CSCI 544 Homework 2

Perceptron, part-of-speech tagging and named entity recognition

Spring 2015
Kenji Sagae

Last updated: Friday, February 19, 2015

Due date and time

February 24, 2014, 11:59pm PST (updated)
Late submissions are subject to a 10% penalty for every 24h. Assignments more than one week late will receive no credit.

Data

The data necessary for this project will be on the class page on Blackboard. Navigate to content and you will find links to the files.

Description

In this assignment you will create your own discriminative classifier and apply it to two NLP sequence labeling tasks: part-of-speech tagging and named entity recognition.

Submission information

Create a new bitbucket project called csci544-hw2 and give admin permissions to csci544-grader and to appliednlp. You will place the files specified in the rest of this page in this project. Do not add binary files or data files (except for the output files, which of course must be included) in the project.

Part I (15 points)

First you will write an Averaged Perceptron classifier according to the specifications below.

You must write your code in Python 3.
You must turn in one file with source code to learn the model (perceplearn.py) and one file with source code to classify new data (percepclassify.py).
Your training program must use the same input and output format as nblearn from homework 1. Optionally, perceplearn.py may use a heldout development set with the option -h DEVFILE (this is not required). Your classification program must use a similar format for input and output as nbclassify, but instead of taking an input file in the command line, it should take its input from STDIN. Output should be written to STDOUT immediately after each corresponding line is entered via STDIN. Make sure to flush STDOUT after writing each line of output.

What to turn in for part I:

Code (perceplearn.py and percepclassify.py)
Include in your project README.md file with any instructions necessary for running your code

These files should be in the root directory of your project. Do not include binary files or data files in your project.

Important: You do not need to finish part I to complete parts II and III. If you fail to complete part I, you can use MegaM for parts II and III. However, to receive credit for part I, you must use your perceptron in parts II and III. Successful completion of part I means that your perceptron is about as good as MegaM.

Part II (40 points)

In the second part of the assignment you will use your averaged perceptron (or MegaM) to perform part-of-speech tagging.

Ideally, you should create programs or scripts to train a POS tagging model (postrain.py), and to tag new text (postag.py).

The most important aspect of part II is to tag the data. If you cannot create the specific postrain and postag commands with the specific format, make sure you can at least produce tagged output, which will account for 80% of the credit for part II.

The training program, postrain, should run as follows:

python3 postrain.py TRAININGFILE MODEL

(Optionally, postrain.py may also accept the option -h DEVFILE, although this is not required.)

where TRAININGFILE is the input file formated with one sentence per line, and each sentence composed of word/tag pairs. For example, a small training file might contain these lines:
This/DT is/VBZ a/DT test/NN ./.
I/PRP saw/VBD a/DT movie/NN ./.
I/PRP like/VBP cookies/NNS ./.

and MODELFILE is the output file containing the model.

The postag program should run as follows:

python3 postag.py MODEL

where MODEL is the model generated by postrain.

postag should take its input from STDIN in the form of one sentence per line, where each sentence is a sequence of words (without tags). Output should be written to STDOUT, and should be a tagged sentence (in the same format as the training data) for each input sentence.

Your implementation of postag may use percepclassify (or megam.opt, if you cannot complete part I) through a system call, or it may simply read the model and perform the classification without the use of percepclassify.

Training, development and test datasets will be provided.

What to turn in for part II:

Your code (at least postrain.py and postag.py).
Your test set output (pos.test.out).

Put the files above in directory called postagging

Part III (25 points)

In the third part of the assignment you will use your averaged perceptron to perform named entity recogntion.

You will write two programs, like in part II. The training program should be called nelearn.py and the tagging program should be called netag.py.

Your nelearn and netag could be identical to your postrain and postag, but don't need to be.

The NER data format is similar to the POS tagging data format, but also includes a POS tag between each word and its NER BIO tag: WORD/POSTAG/NERTAG WORD/POSTAG/NERTAG ...

The test data will contain POS tags, but no NER tags: WORD/POSTAG WORD/POSTAG ...

Your output should include the NER tags produced by your system, and should be in the same format as the training and dev datasets.

What to turn in for part III:

Your code (at least nelearn.py and netag.py).
Your test set output (ner.esp.test.out).

Put the files above in a directory called ner

Part IV (20 points)

Finally, add to the README.md with any relevant information about your solution to parts I, II and III, disclosure of any source of information (besides this class and class notes), and with answers to the following questions:

(5 pts) What is the accuracy of your part-of-speech tagger?
(5 pts) What are the precision, recall and F-score for each of the named entity types for your named entity recognizer, and what is the overall F-score?
(10 pts) What happens if you use your Naive Bayes classifier instead of your perceptron classifier (report performance metrics)? Why do you think that is?

Grading

This assignment will be graded on a scale from 0 to 100 points, and will be worth 15% of your grade for the course.

Important information on collaboration and external resources

This is an individual assignment. You may NOT work in teams or collaborate informally with other students in the class or anyone else. You must be the sole author of 100% of the code and writing you turn in.
You may NOT use any code you find online or anywhere else. You may NOT turn in any material you did not create. This applies to code, writing, and anything else you turn in. The sole exception to this is that you may consult online resources about generic Python programming or Linux. If you use information found in these sites, you must acknowledge all sources in your readme
If do receive any external help or use any code (even one line!) you did not write yourself specifically for this assignment, you MUST note this explicitly in your code AND in your readme.
Failing to comply with these guidelines will result in a grade of zero for the project. All cases of cheating or academic dishonesty will be dealt with according to University policy.