Extract

Data Crawler Project

Extract

Utilized government procurement & ACRA dataset from data.gov.sg

Transform

Normalize text (e.g., agency names, supplier names).
Standardize date formats.
Clean categorical variables.
Derive new columns where useful.

Load

Designed a star schema with fact and dimensions tables
Loaded data into PostgreSQL using SQLAlchemy.

Future Enhancements

Load DataFrame to Microsoft Azure SQL Database
Train text classifier for procurement_description column
Use tools like Airflow for scheduling & orchestration
Data quality testing with PyTest or Great Expectations

Basic database schema (for procurement dataset only)

dim_agency(agency_id, agency_name, agency_sub)
dim_supplier(supplier_id, supplier_name)
dim_tender(tender_id, tender_no, description, year)
dim_status(status_id, status)
fact_tender_supplier(fact_id, tender_id, supplier_id, status_id, awarded_amt)

Sample Queries

SELECT COUNT(*) 
FROM fact_tender_supplier f
JOIN dim_tender t ON f.tender_id = t.tender_id
WHERE t.year = 2024;

# Top 5 suppliers by awarded amount
SELECT s.supplier_name, SUM(f.awarded_amt) AS total_awarded
FROM fact_tender_supplier f
JOIN dim_supplier s ON f.supplier_id = s.supplier_id
GROUP BY s.supplier_name
ORDER BY total_awarded DESC
LIMIT 5;

Sample Chart

Top 10 Suppliers in 2024

query = """
SELECT s.supplier_name, SUM(f.awarded_amt) AS total_awarded
FROM fact_awards f
JOIN dim_supplier s ON f.supplier_id = s.supplier_id
WHERE f.award_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY s.supplier_name
ORDER BY total_awarded DESC
LIMIT 10;
"""
df_suppliers_2024 = pd.read_sql(text(query), engine)

# plot
df_suppliers_2024 = df_suppliers_2024.sort_values("total_awarded", ascending=True)  # small → large

plt.figure(figsize=(10,6))
plt.barh(df_suppliers_2024['supplier_name'], df_suppliers_2024['total_awarded'])
plt.title("Top 10 Suppliers by Awarded Amount (2024)")
plt.xlabel("Total Awarded ($ Billions)")
plt.ylabel("Supplier")
plt.show()

Setup Instructions

Create new .env file and add in your API key
Update PostgreSQL login details and port

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
datasets		datasets
prototype		prototype
.gitignore		.gitignore
Charts.ipynb		Charts.ipynb
README.md		README.md
acraEntities.ipynb		acraEntities.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract

Transform

Load

Future Enhancements

Basic database schema (for procurement dataset only)

Sample Queries

Sample Chart

Top 10 Suppliers in 2024

Setup Instructions

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

khairiah/gov_procurement

Folders and files

Latest commit

History

Repository files navigation

Extract

Transform

Load

Future Enhancements

Basic database schema (for procurement dataset only)

Sample Queries

Sample Chart

Top 10 Suppliers in 2024

Setup Instructions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages