A command-line tool designed to efficiently download, process, and consolidate data from the GDELT Project. It leverages PySpark to handle large volumes of data and transforms it into local, query-friendly formats like CSV and Parquet.
This tool was created to simplify the initial data engineering phase of GDELT analysis, allowing researchers and data scientists to focus on generating insights rather than data wrangling.
- Automated Downloads: Fetch GDELT 1.0 data files for a specified date range.
- Parallel Processing: Utilizes a local PySpark session to process files in parallel, significantly speeding up data transformation.
- Flexible Output: Save processed data as consolidated CSV or highly efficient Parquet files ( Save up to 70% of space compared to Unzipped files!).
- Configurable: Easily customize output paths, data columns, and processing settings through a simple configuration file.
- Command-Line Interface: Easy to use and integrate into automated data pipelines.
Follow these steps to set up the tool and its environment.
Prerequisites:
- Python 3.9+ ( I used 3.12 )
- Java 17+ (required for PySpark, I have openjdk version "21.0.8" )
Steps:
-
Clone the repository:
git clone <your-repository-url> cd GDELTLocalExtractor
-
Create and activate a virtual environment:
# For Linux/macOS python3 -m venv venv source venv/bin/activate # For Windows python -m venv venv .\venv\Scripts\activate
-
Install the project and its dependencies: This command reads the
pyproject.tomlfile and installs the tool along with all required libraries.pip install .For developers who intend to modify the source code, install in editable mode:
pip install -e .
Before the first run, you must configure your settings in the config.py file located in the project's root directory. This file controls where data is downloaded and stored, and which data columns you wish to keep.
As a default, the folders for downloaded data are automatically created in this repository folder. The only parameters adviced to be changed are FILTER_TERMS and FILTER_TERMS_COLUMNS if the user wishes to filder the content from GDELT towards a specific subject.
In /GDELTLocalExtractor/DataExtractionTool/utils/schema.py, there is a list of columns I have decided to use/prioritize when filtering columns, which can be also changed/modified by the user.
Run the tool from the root of the project directory. The main command structure is:
python -m DataExtractionTool.GDELT_Extractor [options]The tool accepts the following arguments to control its behavior:
| Argument | Description | Required / Default |
|---|---|---|
-s, --start_date |
The start date for the data range in YYYY-MM-DD format. | Required |
-e, --end_date |
The end date for the data range in YYYY-MM-DD format. | Required |
--chunk_size |
The number of days to download and process in each batch. Helps manage memory for very large date ranges. | Default: 5 |
-f, --filter |
A flag that, when present, activates the filtering logic defined in config.py. |
Disabled |
-u, --only_download_and_unzip |
A special mode that only downloads the raw GDELT files and unzips them. It will NOT run the Spark processing pipeline (no filtering, no Parquet/consolidated CSV output). | Disabled |
Download and process all GDELT data from February 1st, 2024, to February 3rd, 2024. This will run the full Spark pipeline but will not apply the filters from config.py.
python -m DataExtractionTool.GDELT_Extractor --start_date "2024-02-01" --end_date "2024-02-03"Run the full Spark pipeline and apply the FILTER_CONDITIONS specified in your config.py file.
python -m DataExtractionTool.GDELT_Extractor -s "2024-02-01" -e "2024-02-03" -fQuickly fetch the raw data for a date range without processing it. This is useful for archiving or manual inspection. This command will not start a Spark session.
python -m DataExtractionTool.GDELT_Extractor -s "2024-02-01" -e "2024-02-03" -uProcess data for a large date range, applying filters and using a smaller chunk size to manage system resources effectively.
python -m DataExtractionTool.GDELT_Extractor -s "2024-01-01" -e "2024-03-31" -f --chunk_size 2This project uses the pytest framework to ensure the reliability and correctness of its core logic.
To run the tests, first install the project's development dependencies (which includes pytest):
# Make sure you are in your project's root directory with venv activated
pip install .[dev]Then, you can run the full test suite with a single command:
pytestCurrently, the tests are focused on the most critical part of the data pipeline: the Spark transformations. The test file test/test_spark_transforms.py contains unit tests that validate the functions within the utils/spark_transforms.py module, ensuring that data is processed correctly.
This is a foundational test suite, and the goal is to expand coverage in the future to include other utility modules such as the downloader, file handler, and input validation to further guarantee the tool's robustness.
.
├── DataExtractionTool
│ ├── GDELT_Extractor.py # Main file.
│ ├── test # Pytest folder.
│ ├── __init__.py
│ │ ├── test_data
│ │ │ ├── sample_gdelt_data.CSV
│ │ │ ├── sample_gdelt_lookup.txt
│ │ │ └── sample_manual_lookup.csv
│ │ └── test_spark_transforms.py
│ └── utils # Function modules.
│ ├── __init__.py
│ ├── downloader.py
│ ├── file_handler.py
│ ├── input_validation.py
│ ├── logger_config.py
│ ├── schema.py
│ ├── spark_manager.py
│ └── spark_transforms.py
│
├── assets # Files used for filtering and mapping.
│ ├── MASTER-GDELTDOMAINSBYCOUNTRY-MAY2018.txt
│ ├── cameo_dictionary
│ ├── cameo_dictionary:Zone.Identifier
│ ├── extended_lookup.csv
│ └── gdelt_headers.xlsx
├── data # These data folders will be automatically created when you run the application.
│ ├── gdelt_downloaded_data
│ └── merged_parquet
├── config.py # User-configurable settings (Recommended edit for specific actor filtering)
├── pyproject.toml # Project definition and dependencies
└── README.md
Contributions are welcome! If you have suggestions for improvements or find a bug, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.