Caution
Make sure that you are up-to-date, by git pull origin main
For each task, create new branch that follows: name/task-name
Important
First, pip install -r requirements.txt
Second, open data/indexer.py file and read the TODO note.
Third, open requirements.txt here you will see the libraries that we want in our project, if you pip install new
library don't forget to add it in the file with it version.
Finally, there are some libraries commented, this because we don't need them anymore, but also we don't want to remove them.
Caution
Don't run any of: extractor.py, transformer.py, loader.py, because they are required libraries that cannot be
executed on your machine.
This table describes each field in the processed_documents.json file.
-
New indicates that this feature was generated through data mining and feature-engineering.
-
Updated indicates that this feature already existed but was refined, cleaned, and transformed to improve data quality.
| Column / Property | Description | New | Updated |
|---|---|---|---|
date |
Timestamp indicating when the Reuters article was published. Stored as datetime64[ns]. |
False | True |
topics |
List of topical tags or subject categories assigned to the article (e.g., economic themes, commodities, industries). | False | True |
places |
List of geographic place tags related to the article’s content, often countries or regions mentioned or relevant to the story. | False | True |
people |
List of individuals referenced in the article, usually named stakeholders, analysts, officials, or quoted experts. | False | True |
organizations |
List of organizations mentioned in the text, such as agencies, companies, government bodies, or institutions. | False | True |
exchanges |
List of financial exchanges or markets referenced in the article (often empty when not applicable). | False | False |
title |
The headline of the news article as published by Reuters. | False | True |
text |
Full body text of the article, containing the narrative, quotes, and analysis. | False | True |
keyword |
A high-level category label (e.g., “business-news”) representing the main topic class of the article. | True | False |
domain |
A broader thematic domain such as “food-and-drink,” grouping articles into content verticals. | True | False |
type |
The type or format of the entry (e.g., “news”). Indicates the nature of the document. | True | False |
quality |
Represents the structural and grammatical quality of the article’s text body. This reflects how well-formed or clean the text in the text field is. |
True | False |
geographic_information |
Structured dictionary containing geolocation metadata extracted for the article (e.g., city, state, country, coordinates, ISO codes). | True | False |
identities |
List of identity-related descriptors, typically nationalities, religious groups, or political affiliations referenced in the text. | True | False |
title_embedding |
Embeddings vector generated from the news title reflecting the semantic meaning of its core intent. | True | False |
text_embedding |
Embeddings vector generated from the news body text reflecting the semantic meaning of its content. | True | False |