Throughout the COVID-19 pandemic, people’s needs have evolved due to a myriad of closures and stay-at-home orders. Local services have had to adapt themselves to this everyday and this information changes at a fast pace. All of this new and dynamic information is difficult to sift through and not always straightforward to find.
- Our goal is to crawl, aggregate, index and search/retrieve information from local news sources in Baltimore and report back relevant and personalized results to the user.
- To achieve this, we built a search engine to retrieve relevant articles. We expanded our search engine to simulate user personalization based on the user’s profile, which can be mimicked through topics the user is biased towards, that are incorporated as a string of bias terms at run time. This allows us to retrieve results personalized to the user needs.
To find an appropriate system for real-world data, we considered 4 labelled datasets (CACM, CISI, Medline, Cranfield) and conducted experiments on this data:
- CACM: abstracts and queries from Communications of ACM journal
- CISI: documents and queries from Centre for Inventions and Scientific Information
- Medline: collection of articles and queries from Medline journals
- Cranfield: commonly used IR dataset with aerodynamics journals articles, queries, and relevance judgements
- Then, we selected the best performing permutations from evaluation on development data to deploy on our COVID-19 news data.
- We crawled COVID-19 related articles from Baltimore Sun, CBS Baltimore, and WBALTV since they provide access to focused local information relevant to Baltimore.
- Note: CBS and WBALTV local news source spiders based on RISJbot
- Web crawling and scraping
- Scrape a set of websites to create a corpus of documents
- Preprocessing
- Structured: Stemming and Stop Words Removal
- Unstructured: Acronyms, Emoticons, Spell Check, Contractions
- Vectorization and Scoring
- Word embeddings: (Word2Vec, GloVe, FastText, Doc2Vec, OneHot)
- Word embeddings to Sentence embeddings Weighting Schemes: Mean, TF-IDF, Smooth Inverse Frequency, Unsupervised Smooth Inverse Frequency.
- Similarity: Cosine, Dice, Jaccard or anything from scipy.spatial.distance.
- Query Optimization
- Personalize user queries using a modified Rocchio relevance feedback mechanism
- Personalized search engine for local Baltimore news.
- Web scraper for popular Baltimore news/business websites.
- Find the best
- Word/Sentence embedding
- Ways to personalize any query
- Ways to handle unstructured text
Install all the packages this search engine requires to run using:
pip install -r requirements.txt
$ scrapy crawl <news source>
: Runs the scrapy spider on news site. Choose from:['cbs', 'wbaltv']
- Process the jsonl output into CSV and document-format required by data loader.
$ cd data/local_news_data
$ python process.py
Then, to run the basic search engine, use the following command, and type your query once the search engine is initialized. (Note: It takes approximately 3-5 mins to initialize the search engine)
$ python deploy.py
Command line arguments
--personalize
: Runs the search engine with mimicked user personalization (biased query results). Note: With this mode enabled we need search history, to be able to personalize towards the type of content the user is biased. So, search for terms that show the user's preferences first and then key in your normal queries to see the improved and personalized results. Example:
$ python deploy.py --personalize
--embedding
: Chooses the word embedding method to use. Choose from:["one-hot", "word2vec-google-news-300", "glove-twitter-100", "glove-wiki-gigaword-100", "glove-wiki-gigaword-200", "fasttext-wiki-news-subwords-300"]
. Example:
$ python deploy.py --embedding "one-hot"
--weighting_scheme
: Chooses the weighting scheme to use for computing document vectors from word vectors. Choose from:["mean", "tf-idf", "sif", "usif"]
. Example:
$ python deploy.py --weighting_scheme "tf-idf"
--top_k
: Number of results to return for each query. Example:
$ python deploy.py --top_k 5
--expand_query
: Enables query expansion based on GloVe (glove-wiki-gigaword-100). Example:
$ python deploy.py --expand_query
- Default
- Command:
python deploy.py
- Examples:
- General:
- "masks"
- "vaccine"
- "ventilators"
- With mis-spellings
- "stayy at homew ordero"
- "johnss hopoins universitio"
- With acronyms or contractions
- "JHU"
- "VP"
- "ICU"
- "CDC"
- "cases last wk"
- General:
- Command:
- Pretrained Word Embeddings
- Command:
python deploy.py --embedding "word2vec-google-news-300" --weighting_scheme "usif"
- Effect: Results capture the semantics of the query
- Examples:
- "employment"
- "grocery"
- "medicine"
- Command:
- User Personalization
- Command:
python deploy.py --personalize
- Effect: Results are personalized towards the user's biases
- Examples:
- search for "costco" then "social distancing" AND just "social distancing" in a fresh session
- search for "sports" then "lakers" AND just "lakers" in a fresh session
- Command:
- Query Expansion
- Command:
python deploy.py --expand_query
- Effect: Gives more concentrated and meaningful results that talk about the query's topic
- Examples:
- "recession"
- "economy"
- "pizza"
- Command:
Note: Try the Pretrained Word Embeddings, User Personalization, and Query Expansion query examples in the default mode (python deploy.py
) also to notice how the search results are improved using these techniques.