Learning from data
After downloading the training set, unpack the tar.gz archive (tar xzvf FILENAME.tar.gz). A directory will then be created, containing the text files. Each text file is one document. In the spam dataset, a document is one email message. In the sentiment dataset, a document is one movie review. The classification label for each document is indicated in the file name. For example, in the spam dataset each file name begins with either SPAM or HAM, so the classes for your spam classifier should be SPAM and HAM.
Your first task is to format the training data. Your classifier must accept training data in exactly one file in the following format:
LABEL_1 FEATURE_11 FEATURE_12 ... FEATURE_1NEach line in your training data file corresponds to one document. Each line starts with the class label for the document, and continues with the feature vector that represents the document. For both tasks in this assignment, we will use (at least) bag-of-words features.
Suppose your training dataset consists of the following two files:
You training file could look like this:
HAM subject : meeting today hi , could we have a meeting today . thank you .
SPAM subject : low rates click here to apply for new low rates do not miss this chance !
To learn a classification model from the training data file, your software will be invoked in the following way:
python3 nblearn.py TRAININGFILE MODELFILE
where TRAININGFILE is the name of the training file (this should be spam_training.txt for the spam dataset, and sentiment_training.txt for the sentiment dataset), MODELFILE is the name of the file that will contain the model that your classifier will learn (for the spam dataset the file name should be spam.nb, and sentiment.nb for the sentiment dataset).
Classifying new text
Once you created a model file (spam.nb or sentiment.nb), you can use the model to classify new documents. Given a file formatted as follows:
FEATURE_11 FEATURE_12 ... FEATURE_1Nwhere each line contains the features corresponding to one document, your program must write to STDOUT the same number of lines, and each line must contain exactly one string: the predicted label for the corresponding document.
For example, suppose we have the following file:
subject : another meeting hello again can we meet tomorrow please . thanks .Your program should write to STDOUT:
HAMTo classify a file with new documents, your software will be invoked in the following way:
python3 nbclassify.py MODELFILE TESTFILE
where MODELFILE is the name of the model file generated by nblearn, TESTFILE is the name of the file containing the features for the new documents to be classified.
On Blackboard you will find a development data set for the spam filtering task that includes documents formatted in the same way as the training set. You may use the development set to test the performance of your classifier. For the sentiment analysis task, it is your responsibillity to designate some of the available data as a development set so you can track your own progress.
Final testing will be done using a test set that will be provided on February 1. It should take you only a few minutes to classify the test data.
Important: The test set will have the same format as the training set, that is, a directory containing text files, except that the file names of the test data will not reveal the correct label. The files will be named TEST.00001.txt, TEST.00002.txt, etc. Your output files should have one label per line, corresponding to each file in numerical order. In other words, the first line should be the label for TEST.00001.txt, the second line should be the label for TEST.00002.txt, and so on. The accuracy of you classifier will be measured automatically. Failure to format your output correctly may result in very low scores, which will not be changed.
What to turn in for part I:
Put the files above in a new bitbucket project called csci544-hw1. For part I of the assigment, this project should contain at least the files: nblearn.py, nbclassify.py, spam.nb, sentiment.nb, spam.out, sentiment.out, and your code for generating the training file from the training documents.
In the second part of the assignment you will build classifiers using the same datasets as in part I, but using an off-the-shelf implementations of Maximum Entropy classification and Support Vector Machines.
For Support Vector Machines, use SVMlight . You will need to read the documentation provided and format the training data according to the specifications. Use the default parameters (or tune them as you like). The software is setup for binary classification, where the labels are -1 and +1. This works just fine for our two tasks, but you will need to postprocess the output to be in the same format as the output of nbclassify.
For Maximum Entropy, use MegaM. You will need to read the documentation provided. Use -nc for named classes, and the multiclass setting. Postprocess the output to be in the same format as the output of nbclassify. You may need to install ocaml to compile, and you may need to change the MegaM Makefile to have the right path on the line that starts with WITHCLIBS (WITHCLIBS =-I /opt/local/lib/ocaml/caml), and replace -lstr with -lcamlstr in the line that starts with WITHSTR).
What to turn in for part II:
Finally, create a README.md for your hw1 project with any relevant information about your solution to part I and part II, disclosure of any source of information (besides this class, class notes and piazza discussions), and with answers to the following questions: