In written language, errors involving the use of words that sound similar or the same are fairly common. For example, the word its is frequently used where it's should, and vice-versa. Other confusable pairs include: they're/their/there, you're/your and loose/lose.
In this assignment you will develop an approach for detecting and correcting such errors. You only have to worry about these specific types of confusion (in either direction):
You will be provided with one text file. You will need to create a new text file that differs only from the input file in that errors are corrected.
For example, given a file containing:
Then pour water or light oil from a graduated beaker into the chamber to fill the chamber too its gasket surface.
The horses moved at a clump; they were no more on parade than was they're driver; one fork of the road was as good as another.
you will turn in a new text file containing:
Then pour water or light oil from a graduated beaker into the chamber to fill the chamber to its gasket surface.
The horses moved at a clump; they were no more on parade than was their driver; one fork of the road was as good as another.
There may be zero, one or more corrections required per line.
Unlike in previous homework assignments, there is no further guidance on what approach to use, and no training data will be provided. Each student should be able to implement a suitable approach and find the appropriate data based on what has been covered so far in the course.
Also unlike in previous assignments, you may use open-source tools, libraries and toolkits (e.g. NLTK, Stanford CoreNLP). However, you may not use any commercial software or any free software for which source code is not provided.
Sample input and output files are available on Blackboard. These are meant to show only what input and output files look like; the test file (provided two days before the due date) may contain text of different genres, and errors are expected to be distributed differently. As a result, performance on the sample data may or may not match performance on the test set.
Create a new bitbucket project called csci544-hw3 and give admin permissions to csci544-grader and to appliednlp. Do not add binary files, third-party software or data files (except for the output files, which of course must be included) in the project. Put the following in your repository:
This assignment will be graded on a scale from 0 to 100 points, and will be worth 15% of your grade for the course.