Skip to content

Information retrieval system spanning data extraction, analysis, transformation, and Elasticsearch indexing. integrating large-scale AI operations to deliver robust, accurate search with text, image and voice queries, summarization, and intelligent reranking

Notifications You must be signed in to change notification settings

IsmaelMousa/reuters-search-engine

Repository files navigation

Reuters Search Engine

Caution

Make sure that you are up-to-date, by git pull origin main

For each task, create new branch that follows: name/task-name

Important

First, pip install -r requirements.txt

Second, open data/indexer.py file and read the TODO note.

Third, open requirements.txt here you will see the libraries that we want in our project, if you pip install new library don't forget to add it in the file with it version.

Finally, there are some libraries commented, this because we don't need them anymore, but also we don't want to remove them.

Caution

Don't run any of: extractor.py, transformer.py, loader.py, because they are required libraries that cannot be executed on your machine.

Dataset

This table describes each field in the processed_documents.json file.

  • New indicates that this feature was generated through data mining and feature-engineering.

  • Updated indicates that this feature already existed but was refined, cleaned, and transformed to improve data quality.

Column / Property Description New Updated
date Timestamp indicating when the Reuters article was published. Stored as datetime64[ns]. False True
topics List of topical tags or subject categories assigned to the article (e.g., economic themes, commodities, industries). False True
places List of geographic place tags related to the article’s content, often countries or regions mentioned or relevant to the story. False True
people List of individuals referenced in the article, usually named stakeholders, analysts, officials, or quoted experts. False True
organizations List of organizations mentioned in the text, such as agencies, companies, government bodies, or institutions. False True
exchanges List of financial exchanges or markets referenced in the article (often empty when not applicable). False False
title The headline of the news article as published by Reuters. False True
text Full body text of the article, containing the narrative, quotes, and analysis. False True
keyword A high-level category label (e.g., “business-news”) representing the main topic class of the article. True False
domain A broader thematic domain such as “food-and-drink,” grouping articles into content verticals. True False
type The type or format of the entry (e.g., “news”). Indicates the nature of the document. True False
quality Represents the structural and grammatical quality of the article’s text body. This reflects how well-formed or clean the text in the text field is. True False
geographic_information Structured dictionary containing geolocation metadata extracted for the article (e.g., city, state, country, coordinates, ISO codes). True False
identities List of identity-related descriptors, typically nationalities, religious groups, or political affiliations referenced in the text. True False
title_embedding Embeddings vector generated from the news title reflecting the semantic meaning of its core intent. True False
text_embedding Embeddings vector generated from the news body text reflecting the semantic meaning of its content. True False

About

Information retrieval system spanning data extraction, analysis, transformation, and Elasticsearch indexing. integrating large-scale AI operations to deliver robust, accurate search with text, image and voice queries, summarization, and intelligent reranking

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •