The task for this project is to segment a sequence of English characters into the most likely word sequence.
This is mainly to set up your groups and programming environment
Make sure you setup your virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
You can optionally copy and modify the requirements for when we test your code:
cp requirements.txt answer/requirements.txt
You must create the following files:
answer/ensegment.py
answer/ensegment.ipynb
To create the output.zip file for upload to Coursys do:
python3 zipout.py
For more options:
python3 zipout.py -h
To create the source.zip file for upload to Coursys do:
python3 zipsrc.py
For more options:
python3 zipsrc.py -h
To check your accuracy on the dev set:
python3 check.py
For more options:
python3 check.py -h
In particular use the log file to check your output evaluation:
python3 check.py -l log
The accuracy on data/input/test.txt will not be shown. We will
evaluate your output on the test input after the submission deadline.
The default solution is provided in default.py. To use the default
as your solution:
cp default.py answer/ensegment.py
cp default.ipynb answer/ensegment.ipynb
python3 zipout.py
python3 check.py
Make sure that the command line options are kept as they are in
default.py. You can add to them but you must not delete any
command line options that exist in default.py.
Submitting the default solution without modification will get you zero marks.
The data files provided are:
data/count_1w.txt-- counts taken from the Google n-gram corpus with 1TB tokensdata/input-- input filesdev.txtandtest.txtdata/reference/dev.out-- the reference output for thedev.txtinput file