Deadline: TBA
The purpose of this assignment is threefold:
- Implement a better ranker than the
SimpleRanker
class. For example, inverse document frequency (i.e., considering how frequent the query terms are across the corpus) and static rank (i.e., a query-independent quality score per document) are two factors that you should include. - Realize a simple search engine that is capable of approximate matching, e.g., where the query organik kemmistry matches documents containing the terms organic chemistry. We will do this by using a tokenizer that produces k-grams (i.e., overlapping "shingles" of width k) and combine this with n-of-m matching. For example, for k = 3, the string banana would be tokenized into the shingles {ban, ana, nan, ana}.
- Implement basic methods related to sparse document vectors. For example, related to computing dot products, cosine similarities, and centroids.
Your solution should only contain edits to the files shinglegenerator.py
, sparsedocumentvector.py
, and betterranker.py
. Changes to other files will be ignored.
Implementation notes:
- The
BetterRanker
class implements a better ranker than theSimpleRanker
class. You are not required to compute a static rank score g(d) for each document d, but can for the sake of simplicity assume that this value is stored in a suitably named document field. If the named field is missing or doesn't have a value, a default static rank score of 0.0 can be assumed. - The
ShingleGenerator
class implements a tokenizer that produces character-level shingles. - The
SparseDocumentVector
class provides a representation of a sparse document vector, and is a basic building block used by, e.g., theRocchioClassifier
class.
Your task is to:
- Familiarize yourself with the precode.
- Implement the missing code in the
BetterRanker
class. - Implement the missing code in the
ShingleGenerator
class. - Implement the missing code in the
SparseDocumentVector
class. - Ensure that the code is correct and passes all tests.
Some optional bonus challenges for the interested student:
- Extend your
ShingleGenerator
implementation to make special provisions for the beginning and/or ending of strings, so that these are given more weight than other parts of the words. For example, banana could be processed as ^banana$ where the special marker symbols ^ and $ are added as padding. Do you think this makes a difference? - The
WordShingleGenerator
class does token- or term-level shingling. Experiment with using this to realize a biword index. - When working with shingles, the Jaccard distance between the set of shingles in the query and the set of shingles in the document is sometimes used as a relevancy measure. Write a ranker that implements this.
- Extend your ranker to additionally post-process all matches so that edit distance influences the ranking.
- Extend your ranker to take document length into account. See BM25 for inspiration.
- Extend the
SimpleNormalizer
class to do various linguistically motivated transformations or replace its use with, e.g.,PorterNormalizer
or evenSoundexNormalizer
. Do you think this makes a difference when combined with standard tokenization as performed by theSimpleTokenizer
class?
Example output:
>cd tests
>python3 assignments.py d-1
test_document_id_mismatch (test_betterranker.TestBetterRanker.test_document_id_mismatch) ... ok
test_inverse_document_frequency (test_betterranker.TestBetterRanker.test_inverse_document_frequency) ... ok
test_static_quality_score (test_betterranker.TestBetterRanker.test_static_quality_score) ... ok
test_term_frequency (test_betterranker.TestBetterRanker.test_term_frequency) ... ok
test_ranges (test_shinglegenerator.TestShingleGenerator.test_ranges) ... ok
test_shingled_mesh_corpus (test_shinglegenerator.TestShingleGenerator.test_shingled_mesh_corpus) ... ok
test_strings (test_shinglegenerator.TestShingleGenerator.test_strings) ... ok
test_tokens (test_shinglegenerator.TestShingleGenerator.test_tokens) ... ok
test_uses_yield (test_shinglegenerator.TestShingleGenerator.test_uses_yield) ... ok
test_centroid (test_sparsedocumentvector.TestSparseDocumentVector.test_centroid) ... ok
test_cosine (test_sparsedocumentvector.TestSparseDocumentVector.test_cosine) ... ok
test_dot_product (test_sparsedocumentvector.TestSparseDocumentVector.test_dot_product) ... ok
test_length (test_sparsedocumentvector.TestSparseDocumentVector.test_length) ... ok
test_normalize_empty (test_sparsedocumentvector.TestSparseDocumentVector.test_normalize_empty) ... ok
test_normalize_nonempty (test_sparsedocumentvector.TestSparseDocumentVector.test_normalize_nonempty) ... ok
test_top (test_sparsedocumentvector.TestSparseDocumentVector.test_top) ... ok
test_truncate (test_sparsedocumentvector.TestSparseDocumentVector.test_truncate) ... ok
----------------------------------------------------------------------
Ran 17 tests in 1.266s
OK
>python3 repl.py d-1
Indexing MeSH corpus...
Enter a query and find matching documents.
Lookup options are {'debug': False, 'hit_count': 5, 'match_threshold': 0.5}.
Tokenizer is ShingleGenerator.
Ranker is SimpleRanker.
Ctrl-C to exit.
query>orgaNik kemmistry
[{'document': {'document_id': 16981, 'fields': {'body': 'organic chemistry processes', 'meta': '27'}},
'score': 8.0},
{'document': {'document_id': 16980, 'fields': {'body': 'organic chemistry phenomena', 'meta': '27'}},
'score': 8.0},
{'document': {'document_id': 4411, 'fields': {'body': 'chemistry, organic', 'meta': '18'}},
'score': 8.0},
{'document': {'document_id': 4410, 'fields': {'body': 'chemistry, inorganic', 'meta': '20'}},
'score': 8.0},
{'document': {'document_id': 4408, 'fields': {'body': 'chemistry, bioinorganic', 'meta': '23'}},
'score': 8.0}]
Evaluation took 0.013234000000000634 seconds.
query>^C
Bye!
>python3 repl.py d-2
Indexing English news corpus...
Enter a query and find matching documents.
Lookup options are {'debug': False, 'hit_count': 5, 'match_threshold': 0.5}.
Tokenizer is SimpleTokenizer.
Ranker is BetterRanker.
Ctrl-C to exit.
query>water in the tank
[{'document': {'document_id': 7398, 'fields': {'body': 'The elevated salt levels in the water threatened some of the wildlife in the area that depend on a supply of fresh water.'}},
'score': 1.3941246235276306},
{'document': {'document_id': 9699, 'fields': {'body': 'While there are not many people in the water during the winter months, there are plenty playing by the shore with their jeans rolled up just enough so t
hat they can feel the cool water lap up against their feet.'}},
'score': 1.313894761437516},
{'document': {'document_id': 157, 'fields': {'body': "A Chinese People's Liberation Army cadet sits in a Main Battle Tank during a demonstration at the PLA's Armoured Forces Engineering Academy Base, July 22
, 2014."}},
'score': 1.2521843989957548},
{'document': {'document_id': 2818, 'fields': {'body': 'He noted that 50 percent of his 71 Shark Tank investments have been in female-led companies.'}},
'score': 1.1974759616298138},
{'document': {'document_id': 4515, 'fields': {'body': 'Kate Kralman, who shot the video of the MAX going through the water, was helping a friend load equipment nearby when she saw one light rail train go thr
ough the water.'}},
'score': 1.176740906290895}]
Evaluation took 0.06133769999999927 seconds.
query>^C
Bye!
>python3 repl.py x-6
Initializing Rocchio classifier from movie corpus...
Enter some text and have it classified as a movie genre.
Ctrl-C to exit.
text>It was a dark and stormy night.
[{'category': 'Horror', 'score': 0.09401124737117747},
{'category': 'Thriller', 'score': 0.07818261918043858},
{'category': 'Mystery', 'score': 0.07640408674526705},
{'category': 'Romance', 'score': 0.062220941484383556},
{'category': 'Fantasy', 'score': 0.06041547517917066},
{'category': 'Drama', 'score': 0.05547877862410392},
{'category': 'Comedy', 'score': 0.04404293794572259},
{'category': 'Sci-Fi', 'score': 0.03619449355029178},
{'category': 'Crime', 'score': 0.03318476743126697},
{'category': 'Family', 'score': 0.031568925471141276},
{'category': 'Action', 'score': 0.030971438280831806},
{'category': 'Adventure', 'score': 0.027093731733426146},
{'category': 'History', 'score': 0.020218785708303797}]
Evaluation took 0.014864500146359205 seconds.
text>^C
Bye!