Skip to content

ArtemEsper/TopicGeometry

Repository files navigation

BigQuery Analytics Pipeline (Portfolio Overview)

This repository contains a production-grade data processing pipeline built around Google BigQuery and Google Cloud Storage, designed to ingest large historical datasets, clean and deduplicate records, and produce analytics-ready summary tables.

Architecture

GCS (CSV exports) → Raw Tables → Staging Tables → Enriched Views → Summary Tables

Key Features

  • File-based ingestion into BigQuery
  • Business-date partitioned tables
  • Deterministic deduplication and normalization
  • Incremental reprocessing of rolling time windows
  • SQL-based transformation and aggregation logic
  • Stable downstream tables insulated from ingestion changes

This portfolio entry highlights the analytics-engineering and ETL design aspects of the project. Domain-specific research logic is implemented on top of this foundation.

Relevance to Analytics Engineering

While the project originates from large-scale text data analysis, the underlying data engineering patterns are directly applicable to business analytics use cases, including:

  • CSV → GCS → BigQuery ingestion
  • Idempotent incremental loads
  • Partition-aware query design
  • Reproducible SQL transformations
  • Long-term historical data management

Data Cleaning Pipeline for JSTOR Metadata and N-gram Files

This project aims to automate the preprocessing of JSTOR metadata and n-gram files from the Constellate dataset. It cleans ASCII null characters and excess whitespace, prepares metadata for analysis, and uploads cleaned files to Google Cloud Storage for further processing in BigQuery.

Project Structure

  • config.py: Stores file paths and Google Cloud project information.
  • data_cleaning.py: Combines the logic for ASCII characters cleaning and setting correct line delimiters.
  • GCPstorage_upload.py: Sorts and uploads metadata and data files in separate folders (metadata, unigrams, bigrams and trigrams)
  • bq_upload.py: Uploads files from Cloud Storage to BigQuery. Create four tables 'meta_raw', 'unigrams_raw', 'bigrams_raw' and 'trigrams_raw'.
  • pipline.py: Orchestrates the cleaning and uploading steps.

Getting Started

  1. Prerequisites:

    • Python 3.x
    • Required Python libraries: csv, re, os, chardet, shutil, google-cloud-storage, google.api_core.exceptions
    • Google Cloud SDK for uploading cleaned files to Google Cloud Storage
  2. Configuration:

    • Set local file paths, bucket details and BQ dataset in config.py.
    • Example:
      # config.py
      RAW_META_DATA_PATH = "metadata file path"
      CLEANED_META_DATA_PATH = "cleaned metadata file path"
      RAW_DATA_PATH = "path to a folder with uni-,bi-, and trigrams csv files"
      CLEANED_DATA_PATH = "path to a folder with cleaned uni-,bi-, and trigrams csv files"
      BUCKET_NAME = "GCP_bucket_name"
      PROJECT_ID = "GCP_project_name"
      DATASET_ID = "BigQuery dataset name"

Pipeline Workflow

Step 1: Clean Metadata Files (part of pipline.py)

The data_cleaning.py script cleans ASCII null characters from metadata files and sets correct newline delimiters.

Process: Loops through all files in RAW_META_DATA_PATH and RAW_DATA_PATH. Cleans each file of ASCII null characters and preprocesses whitespace and adds newline delimiters to n-gram and other metadata files. Saves cleaned meta files in CLEANED_META_DATA_PATH and data files in CLEANED_DATA_PATH.

  • Usage:
    python data_cleaning.py
    

Step 2: Sort and upload files to GCP bucket (part of pipline.py)

The GCPstorage_upload.py script uploads metadata and n-gram files to corresponding folders GCP clous storage.

  • Usage:
    python GCPstorage_upload.py
    

Process: Sort files in CLEANED_DATA_PATH into three categories unigrams, bigrams, trigrams and save in Google Cloud Storage in
corresponding folders. Saves files from CLEANED_META_DATA_PATH to meta folder in Google Cloud Storage.

Step 3: Load Cleaned Files into BigQuery (part of pipline.py)

The bq_upload.py script uploads csv files from the Google Cloud Storage folders 'meta', 'unigrams', 'bigrams' and 'trigrams' to corresponding tables 'meta_raw', 'unigrams_raw', 'bigrams_raw' and 'trigrams_raw'.

  • Usage:
    python bq_upload.py
    

Combine steps 1 to 3

Combines the cleaning and uploading steps described above. RESULT is four tables 'meta_raw', 'unigrams_raw', 'bigrams_raw' and 'trigrams_raw' in BQ.

  • Usage:
    python pipline.py
    

Step 4: Deduplicate the meta_raw table and other ngram tables in BQ based on title, year and author document metadata.

Important: BigQuery does not support multi-table transactions. We do this in separate steps: first identify duplicates, decide which id to keep, then remove the unwanted ids from the n-gram tables.

  • Usage:
    python deduplication.py
    

Step 5: Clean uni, bi and trigrams tables from the stopwords and various special characters like punctuation and etc.

Use files s5_uni_cleaning.sql, s5_bi_cleaning.sql, s5_tri_cleaning.sql. The cleaning is incomplete and can be updated to remove more irrelevant words that can not be a part of scientific ontology.

Step 6: Extract unique keywords from metadata table and distribute them across three tables of unigrams, bigrams and trigrams

Use s5_keyword_extract.sql to separately extract uni-, bi- and trigrams from the 'keyword' column in the meta table and attribute keywords with uniq id for the extracted keywords and add appropriate prefix U_, B_ and T_.

Step 7: Update the keywords tables with the manually selected most frequent words from a cleaned uni-, bi- and trigram tables

and give to a new concepts distinct id with UN_, BN_ and TN_ prefixes

The most frequent keywords that can be considered as a concepts stored in 'selected_keywords_uni.csv', 'selected_keywords_bi.csv' and 'selected_keywords_tri.csv' files.

Step 8: Join JSTOR meta table with WOS_references and WOS_addresses to extract countries and journal titles

-- metatable_jstor_wos.sql

Step 9: Join the keywords table from step 9 and uni-, bi- and trigram tables to find their frequency across documents

-- tables_match.sql

Notes

Ensure config.py is correctly configured with your file paths and Google Cloud details before running the pipeline.

The cleaning steps apply only to files that match the designated paths in config.py. The initial JSTOR dataset on 'International security' topic can be found on Zenodo repository 10.5281/zenodo.14638434.

Future Improvements

Implement logging for more robust monitoring. Add validation steps to ensure data consistency before uploading. Implement more extensive cleaning.

License

This project is licensed under the MIT License.

Explanation of README.md Sections

  • Project Structure: Summarizes the role of each main file.
  • Getting Started: Lists prerequisites and setup steps for config.py.
  • Pipeline Workflow: Details each step of the ETL process, with usage instructions.
  • Notes: Provides reminders for setup verification and testing individual scripts.
  • Future Improvements: Suggests potential upgrades for pipeline robustness.

About

Concept frequency extraction pipline and information agents impact processing (under construction)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors