CSCI 544 Homework 2

Perceptron, part-of-speech tagging and named entity recognition

Spring 2015
Kenji Sagae


Last updated: Friday, February 19, 2015

Due date and time

February 24, 2014, 11:59pm PST (updated)
Late submissions are subject to a 10% penalty for every 24h. Assignments more than one week late will receive no credit.

Data

The data necessary for this project will be on the class page on Blackboard. Navigate to content and you will find links to the files.

Description

In this assignment you will create your own discriminative classifier and apply it to two NLP sequence labeling tasks: part-of-speech tagging and named entity recognition.

Submission information

Create a new bitbucket project called csci544-hw2 and give admin permissions to csci544-grader and to appliednlp. You will place the files specified in the rest of this page in this project. Do not add binary files or data files (except for the output files, which of course must be included) in the project.

Part I (15 points)

First you will write an Averaged Perceptron classifier according to the specifications below.

What to turn in for part I:

These files should be in the root directory of your project. Do not include binary files or data files in your project.

Important: You do not need to finish part I to complete parts II and III. If you fail to complete part I, you can use MegaM for parts II and III. However, to receive credit for part I, you must use your perceptron in parts II and III. Successful completion of part I means that your perceptron is about as good as MegaM.

Part II (40 points)

In the second part of the assignment you will use your averaged perceptron (or MegaM) to perform part-of-speech tagging.

Ideally, you should create programs or scripts to train a POS tagging model (postrain.py), and to tag new text (postag.py).

The most important aspect of part II is to tag the data. If you cannot create the specific postrain and postag commands with the specific format, make sure you can at least produce tagged output, which will account for 80% of the credit for part II.

The training program, postrain, should run as follows:

python3 postrain.py TRAININGFILE MODEL

(Optionally, postrain.py may also accept the option -h DEVFILE, although this is not required.)

where TRAININGFILE is the input file formated with one sentence per line, and each sentence composed of word/tag pairs. For example, a small training file might contain these lines:
This/DT is/VBZ a/DT test/NN ./.
I/PRP saw/VBD a/DT movie/NN ./.
I/PRP like/VBP cookies/NNS ./.

and MODELFILE is the output file containing the model.

The postag program should run as follows:

python3 postag.py MODEL

where MODEL is the model generated by postrain.

postag should take its input from STDIN in the form of one sentence per line, where each sentence is a sequence of words (without tags). Output should be written to STDOUT, and should be a tagged sentence (in the same format as the training data) for each input sentence.

Your implementation of postag may use percepclassify (or megam.opt, if you cannot complete part I) through a system call, or it may simply read the model and perform the classification without the use of percepclassify.

Training, development and test datasets will be provided.

What to turn in for part II:

Put the files above in directory called postagging

Part III (25 points)

In the third part of the assignment you will use your averaged perceptron to perform named entity recogntion.

You will write two programs, like in part II. The training program should be called nelearn.py and the tagging program should be called netag.py.

Your nelearn and netag could be identical to your postrain and postag, but don't need to be.

The NER data format is similar to the POS tagging data format, but also includes a POS tag between each word and its NER BIO tag: WORD/POSTAG/NERTAG WORD/POSTAG/NERTAG ...

The test data will contain POS tags, but no NER tags: WORD/POSTAG WORD/POSTAG ...

Your output should include the NER tags produced by your system, and should be in the same format as the training and dev datasets.

What to turn in for part III:

Put the files above in a directory called ner

Part IV (20 points)

Finally, add to the README.md with any relevant information about your solution to parts I, II and III, disclosure of any source of information (besides this class and class notes), and with answers to the following questions:

Grading

This assignment will be graded on a scale from 0 to 100 points, and will be worth 15% of your grade for the course.

Important information on collaboration and external resources