GDELT Local Extractor

A command-line tool designed to efficiently download, process, and consolidate data from the GDELT Project. It leverages PySpark to handle large volumes of data and transforms it into local, query-friendly formats like CSV and Parquet.

This tool was created to simplify the initial data engineering phase of GDELT analysis, allowing researchers and data scientists to focus on generating insights rather than data wrangling.

Key Features

Automated Downloads: Fetch GDELT 1.0 data files for a specified date range.
Parallel Processing: Utilizes a local PySpark session to process files in parallel, significantly speeding up data transformation.
Flexible Output: Save processed data as consolidated CSV or highly efficient Parquet files ( Save up to 70% of space compared to Unzipped files!).
Configurable: Easily customize output paths, data columns, and processing settings through a simple configuration file.
Command-Line Interface: Easy to use and integrate into automated data pipelines.

Installation

Follow these steps to set up the tool and its environment.

Prerequisites:

Python 3.9+ ( I used 3.12 )
Java 17+ (required for PySpark, I have openjdk version "21.0.8" )

Steps:

Clone the repository:

git clone <your-repository-url>
cd GDELTLocalExtractor

Create and activate a virtual environment:

# For Linux/macOS
python3 -m venv venv
source venv/bin/activate

# For Windows
python -m venv venv
.\venv\Scripts\activate

Install the project and its dependencies: This command reads the pyproject.toml file and installs the tool along with all required libraries.
```
pip install .
```
For developers who intend to modify the source code, install in editable mode:
```
pip install -e .
```

1. Configuration

Before the first run, you must configure your settings in the config.py file located in the project's root directory. This file controls where data is downloaded and stored, and which data columns you wish to keep. As a default, the folders for downloaded data are automatically created in this repository folder. The only parameters adviced to be changed are FILTER_TERMS and FILTER_TERMS_COLUMNS if the user wishes to filder the content from GDELT towards a specific subject.

In /GDELTLocalExtractor/DataExtractionTool/utils/schema.py, there is a list of columns I have decided to use/prioritize when filtering columns, which can be also changed/modified by the user.

2. Execution

Run the tool from the root of the project directory. The main command structure is:

python -m DataExtractionTool.GDELT_Extractor [options]

Command-Line Arguments

The tool accepts the following arguments to control its behavior:

Argument	Description	Required / Default
`-s, --start_date`	The start date for the data range in YYYY-MM-DD format.	Required
`-e, --end_date`	The end date for the data range in YYYY-MM-DD format.	Required
`--chunk_size`	The number of days to download and process in each batch. Helps manage memory for very large date ranges.	Default: `5`
`-f, --filter`	A flag that, when present, activates the filtering logic defined in `config.py`.	Disabled
`-u, --only_download_and_unzip`	A special mode that only downloads the raw GDELT files and unzips them. It will NOT run the Spark processing pipeline (no filtering, no Parquet/consolidated CSV output).	Disabled

Examples

Basic Extraction

Download and process all GDELT data from February 1st, 2024, to February 3rd, 2024. This will run the full Spark pipeline but will not apply the filters from config.py.

python -m DataExtractionTool.GDELT_Extractor --start_date "2024-02-01" --end_date "2024-02-03"

Extraction with Filtering

Run the full Spark pipeline and apply the FILTER_CONDITIONS specified in your config.py file.

python -m DataExtractionTool.GDELT_Extractor -s "2024-02-01" -e "2024-02-03" -f

Download-Only Mode

Quickly fetch the raw data for a date range without processing it. This is useful for archiving or manual inspection. This command will not start a Spark session.

python -m DataExtractionTool.GDELT_Extractor -s "2024-02-01" -e "2024-02-03" -u

Advanced Extraction

Process data for a large date range, applying filters and using a smaller chunk size to manage system resources effectively.

python -m DataExtractionTool.GDELT_Extractor -s "2024-01-01" -e "2024-03-31" -f --chunk_size 2

Testing

This project uses the pytest framework to ensure the reliability and correctness of its core logic.

To run the tests, first install the project's development dependencies (which includes pytest):

# Make sure you are in your project's root directory with venv activated
pip install .[dev]

Then, you can run the full test suite with a single command:

pytest

Current Coverage

Currently, the tests are focused on the most critical part of the data pipeline: the Spark transformations. The test file test/test_spark_transforms.py contains unit tests that validate the functions within the utils/spark_transforms.py module, ensuring that data is processed correctly.

This is a foundational test suite, and the goal is to expand coverage in the future to include other utility modules such as the downloader, file handler, and input validation to further guarantee the tool's robustness.

Project Structure

.
├── DataExtractionTool
│   ├── GDELT_Extractor.py               # Main file.
│   ├── test                             # Pytest folder.
│   ├── __init__.py
│   │   ├── test_data
│   │   │   ├── sample_gdelt_data.CSV
│   │   │   ├── sample_gdelt_lookup.txt
│   │   │   └── sample_manual_lookup.csv
│   │   └── test_spark_transforms.py
│   └── utils                            # Function modules.
│       ├── __init__.py
│       ├── downloader.py
│       ├── file_handler.py
│       ├── input_validation.py
│       ├── logger_config.py
│       ├── schema.py
│       ├── spark_manager.py
│       └── spark_transforms.py
│ 
├── assets                               # Files used for filtering and mapping.
│   ├── MASTER-GDELTDOMAINSBYCOUNTRY-MAY2018.txt
│   ├── cameo_dictionary
│   ├── cameo_dictionary:Zone.Identifier
│   ├── extended_lookup.csv
│   └── gdelt_headers.xlsx
├── data                                  #  These data folders will be automatically created when you run the application.                                       
│   ├── gdelt_downloaded_data
│   └── merged_parquet
├── config.py                             # User-configurable settings (Recommended edit for specific actor filtering)
├── pyproject.toml                        # Project definition and dependencies
└── README.md

Contributing

Contributions are welcome! If you have suggestions for improvements or find a bug, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GDELT Local Extractor

Key Features

Installation

1. Configuration

2. Execution

Command-Line Arguments

Examples

Basic Extraction

Extraction with Filtering

Download-Only Mode

Advanced Extraction

Testing

Current Coverage

Project Structure

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
DataExtractionTool		DataExtractionTool
assets		assets
LICENSE		LICENSE
README.md		README.md
config.py		config.py
pyproject.toml		pyproject.toml

License

davalpez/GDELT-LocalExtractorTool

Folders and files

Latest commit

History

Repository files navigation

GDELT Local Extractor

Key Features

Installation

1. Configuration

2. Execution

Command-Line Arguments

Examples

Basic Extraction

Extraction with Filtering

Download-Only Mode

Advanced Extraction

Testing

Current Coverage

Project Structure

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages