CSCI 544 Homework 1: Text classification

Spring 2015

Last updated: January 16, 2015

Due date and time

February 5, 2015, 11:59pm PST
Late submissions are subject to a 10% penalty for every 24h.

Data

The data necessary for this project will be available on Blackboard. The spam dataset will be posted by January 19, and the sentiment dataset will be posted later in that week. Navigate to content and you will find links to the files. You are not allowed to use any other data in this project. Contact csci544.usc on gmail if you have any questions, or if you cannot access the data.

Description

In this assignment you will create your own text classifier and apply it to two datasets corresponding to two tasks: (1) spam filtering, and (2) sentiment analysis. In part I you will write a Naive Bayes classifier and apply it to the two datasets. In part II you will use off-the-shelf classifiers with the same datasets.

Part I

First you will write a Naive Bayes classifier according to the specifications below.

You must write your code in Python running in Ubuntu Linux.
You must not use any additional packages, libraries or modules.
You must create one file with source code to learn the model (nblearn.py) and one file with source code to classify new text (nbclassify.py).

Learning from data

After downloading the training set, unpack the tar.gz archive (tar xzvf FILENAME.tar.gz). A directory will then be created, containing the text files. Each text file is one document. In the spam dataset, a document is one email message. In the sentiment dataset, a document is one movie review. The classification label for each document is indicated in the file name. For example, in the spam dataset each file name begins with either SPAM or HAM, so the classes for your spam classifier should be SPAM and HAM.

Your first task is to format the training data. Your classifier must accept training data in exactly one file in the following format:

LABEL_1 FEATURE_11 FEATURE_12 ... FEATURE_1N
LABEL_2 FEATURE_21 FEATURE_22 ... FEATURE_2N
...
LABEL_M FEATURE_M1 FEATURE_M2 ... FEATURE_MN

Each line in your training data file corresponds to one document. Each line starts with the class label for the document, and continues with the feature vector that represents the document. For both tasks in this assignment, we will use (at least) bag-of-words features.

Suppose your training dataset consists of the following two files:

HAM.1.txt
subject : meeting today
hi , could we have a meeting today .
thank you .
SPAM.1.txt
subject : low rates
click here to apply for new low rates
do not miss this chance !

You training file could look like this:

HAM subject : meeting today hi , could we have a meeting today . thank you .
SPAM subject : low rates click here to apply for new low rates do not miss this chance !

To learn a classification model from the training data file, your software will be invoked in the following way:

python3 nblearn.py TRAININGFILE MODELFILE

where TRAININGFILE is the name of the training file (this should be spam_training.txt for the spam dataset, and sentiment_training.txt for the sentiment dataset), MODELFILE is the name of the file that will contain the model that your classifier will learn (for the spam dataset the file name should be spam.nb, and sentiment.nb for the sentiment dataset).

Classifying new text

Once you created a model file (spam.nb or sentiment.nb), you can use the model to classify new documents. Given a file formatted as follows:

FEATURE_11 FEATURE_12 ... FEATURE_1N
FEATURE_21 FEATURE_22 ... FEATURE_2N
...
FEATURE_M1 FEATURE_M2 ... FEATURE_MN

where each line contains the features corresponding to one document, your program must write to STDOUT the same number of lines, and each line must contain exactly one string: the predicted label for the corresponding document.

For example, suppose we have the following file:

subject : another meeting hello again can we meet tomorrow please . thanks .
subject : more low rates don 't miss out on our low rates today.

Your program should write to STDOUT:

HAM
SPAM

To classify a file with new documents, your software will be invoked in the following way:

python3 nbclassify.py MODELFILE TESTFILE

where MODELFILE is the name of the model file generated by nblearn, TESTFILE is the name of the file containing the features for the new documents to be classified.

On Blackboard you will find a development data set for the spam filtering task that includes documents formatted in the same way as the training set. You may use the development set to test the performance of your classifier. For the sentiment analysis task, it is your responsibillity to designate some of the available data as a development set so you can track your own progress.

Final testing will be done using a test set that will be provided on February 1. It should take you only a few minutes to classify the test data.

Important: The test set will have the same format as the training set, that is, a directory containing text files, except that the file names of the test data will not reveal the correct label. The files will be named TEST.00001.txt, TEST.00002.txt, etc. Your output files should have one label per line, corresponding to each file in numerical order. In other words, the first line should be the label for TEST.00001.txt, the second line should be the label for TEST.00002.txt, and so on. The accuracy of you classifier will be measured automatically. Failure to format your output correctly may result in very low scores, which will not be changed.

What to turn in for part I:

Code
The model files (spam.nb and sentiment.nb) and files containing the STDOUT output of your classification program on the test set. You should name these two files spam.out and sentiment.out.
Optional: a README file with any instructions necessary for running your code

Put the files above in a new bitbucket project called csci544-hw1. For part I of the assigment, this project should contain at least the files: nblearn.py, nbclassify.py, spam.nb, sentiment.nb, spam.out, sentiment.out, and your code for generating the training file from the training documents.

Part II

In the second part of the assignment you will build classifiers using the same datasets as in part I, but using an off-the-shelf implementations of Maximum Entropy classification and Support Vector Machines.

For Support Vector Machines, use SVM^light. You will need to read the documentation provided and format the training data according to the specifications. Use the default parameters (or tune them as you like). The software is setup for binary classification, where the labels are -1 and +1. This works just fine for our two tasks, but you will need to postprocess the output to be in the same format as the output of nbclassify.

For Maximum Entropy, use MegaM. You will need to read the documentation provided. Use -nc for named classes, and the multiclass setting. Postprocess the output to be in the same format as the output of nbclassify. You may need to install ocaml to compile, and you may need to change the MegaM Makefile to have the right path on the line that starts with WITHCLIBS (WITHCLIBS =-I /opt/local/lib/ocaml/caml), and replace -lstr with -lcamlstr in the line that starts with WITHSTR).

What to turn in for part II:

Create a subdirectory called part2 in your project, and place all files related to part 2 in that directory.
Turn in the model files, which should be named spam.svm.model and sentiment.svm.model, and spam.megam.model and sentiment.megam.model.
Turn in classification output files (postprocessed to have the same format as the output of nbclassify), named spam.svm.out, sentiment.svm.out, spam.megam.out and sentiment.megam.out.

Part III

Finally, create a README.md for your hw1 project with any relevant information about your solution to part I and part II, disclosure of any source of information (besides this class, class notes and piazza discussions), and with answers to the following questions:

What are the precision, recall and F-score on the development data for your classifier in part I for each of the two datasets. Report precision, recall and F-score for each label.
What are the precision, recall and F-score for your classifier in part II for each of the two datasets. Report precision, recall and F-score for each label.
What happens exactly to precision, recall and F-score in each of the two tasks (on the development data) when only 10% of the training data is used to train the classifiers in part I and part II? Why do you think that is?

Grading

This assignment will be graded on a scale from 0 to 100 points, and will be worth 15% of your grade for the course. The number of points you are awarded will be based on the following formula:
HW1score = min(x₁⁵ × 60 + x₂³ × 30 + x₃ × 10 + x₄ × 10 + x₅ - p, 100),
where:
x₁ is the F-score of your classifier on the SPAM dataset,
x₂ is the F-score of your classifier on the Sentiment dataset,
x₃ is the F-score of your SPAM classifier from part II of the assignment (max(svm, maxent)),
x₄ is the F-score of your Sentiment classifier from part II of the assignment (max(svm, maxent)),
x₅ is your score on part III of the assignment,
p is a penalty for undesirable characteristics of your solution, including excessive time or space requirements (runs too slow, uses too much RAM or too much disk space), incorrect implementation, wasteful use of external resources (e.g. use of an external dictionary without any improvement in performance).

Important information on collaboration and external resources

DO NOT look for any kind of help on the web
PLEASE DO discuss any issues you have on our Piazza board
This is an individual assignment. You may NOT work in teams or collaborate informally with other students in the class or anyone else. You must be the sole author of 100% of the code and writing you turn in.
You may NOT use any code you find online or anywhere else. You may NOT turn in any material you did not create. This applies to code, writing, and anything else you turn in
If do receive any external help or use any code (even one line!) you did not write yourself specifically for this assignment, you MUST note this explicitly in your code AND in your write-up.
Failing to comply with these guidelines will result in a grade of zero for the project. All cases of cheating or academic dishonesty will be dealt with according to University policy.