Created by KING-258
This Python-based script fetches data from multiple APIs (NewsAPI, GDELT, Wikipedia) based on a given phrase, processes it, and sends it to a Kafka message queue. The goal is to analyze news articles, global trends, and Wikipedia summaries efficiently, within a set time limit, to avoid long processing times.
- Multi-API Integration: Fetch data from NewsAPI, GDELT, and Wikipedia based on user input.
- Time-limited Fetching: Limits API calls to 3 minutes or 5 pages of results, ensuring efficient processing.
- Kafka Integration: Sends combined data to a Kafka queue for real-time streaming of data after webscraping.
- Customizable Search: Can be modified to adjust the time limits or page size.
- Hadoop Usage: Hadoop is used for storing webscraped data and cross-referencing new searches with already searched data.
- Python 3.8 or above
- Kafka
- Hadoop
- NewsAPI Client
- GDELT API
- Wikipedia API
- Requests
-
Clone the Repository
git clone https://github.com/KING-258/BDA_Mini cd BDA_Mini