Reuters Search Engine

Caution

Make sure that you are up-to-date, by git pull origin main

For each task, create new branch that follows: name/task-name

Important

First, pip install -r requirements.txt

Second, open data/indexer.py file and read the TODO note.

Third, open requirements.txt here you will see the libraries that we want in our project, if you pip install new library don't forget to add it in the file with it version.

Finally, there are some libraries commented, this because we don't need them anymore, but also we don't want to remove them.

Caution

Don't run any of: extractor.py, transformer.py, loader.py, because they are required libraries that cannot be executed on your machine.

Dataset

This table describes each field in the processed_documents.json file.

New indicates that this feature was generated through data mining and feature-engineering.
Updated indicates that this feature already existed but was refined, cleaned, and transformed to improve data quality.

Column / Property	Description	New	Updated
`date`	Timestamp indicating when the Reuters article was published. Stored as `datetime64[ns]`.	False	True
`topics`	List of topical tags or subject categories assigned to the article (e.g., economic themes, commodities, industries).	False	True
`places`	List of geographic place tags related to the article’s content, often countries or regions mentioned or relevant to the story.	False	True
`people`	List of individuals referenced in the article, usually named stakeholders, analysts, officials, or quoted experts.	False	True
`organizations`	List of organizations mentioned in the text, such as agencies, companies, government bodies, or institutions.	False	True
`exchanges`	List of financial exchanges or markets referenced in the article (often empty when not applicable).	False	False
`title`	The headline of the news article as published by Reuters.	False	True
`text`	Full body text of the article, containing the narrative, quotes, and analysis.	False	True
`keyword`	A high-level category label (e.g., “business-news”) representing the main topic class of the article.	True	False
`domain`	A broader thematic domain such as “food-and-drink,” grouping articles into content verticals.	True	False
`type`	The type or format of the entry (e.g., “news”). Indicates the nature of the document.	True	False
`quality`	Represents the structural and grammatical quality of the article’s text body. This reflects how well-formed or clean the text in the `text` field is.	True	False
`geographic_information`	Structured dictionary containing geolocation metadata extracted for the article (e.g., city, state, country, coordinates, ISO codes).	True	False
`identities`	List of identity-related descriptors, typically nationalities, religious groups, or political affiliations referenced in the text.	True	False
`title_embedding`	Embeddings vector generated from the news title reflecting the semantic meaning of its core intent.	True	False
`text_embedding`	Embeddings vector generated from the news body text reflecting the semantic meaning of its content.	True	False

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
api		api
backend		backend
documents		documents
etl		etl
evaluate		evaluate
frontend		frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
documents.json		documents.json
presentation.html		presentation.html
processed_documents.json		processed_documents.json
requirements.txt		requirements.txt
technical-report.pdf		technical-report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reuters Search Engine

Dataset

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

IsmaelMousa/reuters-search-engine

Folders and files

Latest commit

History

Repository files navigation

Reuters Search Engine

Dataset

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages