This project demonstrates the scalability of Spark using the Wikipedia dataset and it's ability to perform off-line feature extract. This allows the user of the scripts to find articles that are related to the supplied RegExs for use in training machine learning models.
The dataset can be found here: https://huggingface.co/datasets/wikimedia/wikipedia
Three scripts are provided:
download_wikipedia.py
- This script downloads the English Wikipedia dataset as a single file.repartition.py
- This script uses Spark to repartition the data into convenient ~512 MB Parquet files w/Snappy compression.scan.py
- This script uses Spark to add a regular expression column to the data set, filter out rows the lack any matches and saves the data to disk.
Note:
The configuration of the driver/executors are hardcoded into the repartition and scan scripts. This should likely be driven by a configuration file instead.
Future Enhancements:
- Remove unnecessary columns to save on storage space.
- Replace UDF with a flatMap to reduce time associated with
filter
step. - Source a larger dataset and benchmark the performance of the scripts.
- Load driver/executor settings from a configuration file instead of hard-coding.