This project explores and analyzes real-world datasets from the City of Chicago using SQL within a Jupyter Notebook. It was completed as the final project for Course 5 – Databases and SQL for Data Science with Python in the IBM Data Engineering Professional Certificate.
The analysis is performed using SQLite and %sql magic commands, combining Python and SQL to uncover insights about crime rates, public schools, and socioeconomic conditions in Chicago.
- Create and connect to a SQLite database using Jupyter and SQL magic
- Import CSV datasets into pandas and persist them as relational tables in SQLite
- Perform SQL queries to extract insights using filtering, grouping, and aggregation
- Apply subqueries and joins to analyze relationships across datasets
- Investigate correlations between crime rates, community demographics, and school safety
| Category | Tools/Technologies |
|---|---|
| Languages | Python, SQL |
| Libraries | pandas, sqlite3, ipython-sql |
| Database | SQLite |
| Environment | Jupyter Notebook |
| File Formats | CSV |
-
Chicago Census Data
Socioeconomic indicators by community area (2008–2012)
Source -
Chicago Public Schools Progress Report Cards
Performance and safety data (2011–2012)
Source -
Chicago Crime Data
Reported crimes from 2001 to present
Source
Note: Cleaned and reduced versions of the datasets were provided by IBM for educational use.
- Created a new SQLite database and established a connection using
%load_ext sql - Loaded three CSV files into pandas DataFrames
- Created corresponding tables in SQLite using pandas
.to_sql()method - Queried data using SQL operations:
- Filtering with
WHEREandLIKEclauses - Joining and grouping data across tables
- Using aggregation functions (
COUNT,AVG) - Writing subqueries to solve analytical problems
- Filtering with
- Answered 10 domain-specific questions involving:
- Crimes involving minors and children
- School safety scores grouped by type
- Community areas with high poverty and hardship
- Identifying the most crime-prone neighborhood
-- Total number of crimes
SELECT COUNT(*) FROM CHICAGO_CRIME_DATA;
-- Community areas with per capita income < $11,000
SELECT COMMUNITY_AREA_NAME, COMMUNITY_AREA_NUMBER
FROM CENSUS_DATA
WHERE PER_CAPITA_INCOME < 11000;
-- Crimes committed at school locations (distinct types)
SELECT DISTINCT PRIMARY_TYPE
FROM CHICAGO_CRIME_DATA
WHERE LOCATION_DESCRIPTION LIKE '%SCHOOL%';- Riverdale had both the highest hardship index and highest percentage of households below the poverty line.
- Austin (Community Area 25) was the most crime-prone area in the dataset.
- Safety scores varied across school types, with high schools and elementary schools showing similar average ratings.
SQL_Chicago_Data_Analysis_Project/
├── README.md # Project overview and instructions
├── data/
│ ├── ChicagoCensusData.csv # Socioeconomic data by community area
│ ├── ChicagoCrimeData.csv # Crime records
│ └── ChicagoPublicSchools.csv # School performance and safety scores
├── images/ # Screenshots of database setup and SQL query results
│ ├── low_income_areas.png # Community areas with low per capita income (Problem 2)
│ ├── most_crime_prone_area.png # Area with the highest crime count using a subquery (Problem 10)
│ ├── school_crimes_summary.png # Types of crimes reported at school locations (Problem 5)
│ ├── table_creation_and_database_setup.png # Creating SQLite DB and loading tables using pandas and df.to_sql()
│ └── total_crimes_count.png # Total number of crimes in the dataset (Problem 1)
├── notebook/
│ └── SQL_Chicago_Data_Analysis_Project.ipynb # Final notebook with all tasks and query outputs
This project was completed as part of the IBM Data Engineering Professional Certificate and is intended for educational use.