This repository contains an end-to-end workflow for analyzing CAD/RMS and Domestic Violence data reported to the State of New Jersey between January 2023 and October 31, 2025. It includes automation for exporting Excel files, AI-assisted data profiling, ETL utilities, and documentation to support data quality reviews, demographic insights, and GIS-ready outputs.
dv_doj/
├── raw_data/
│ ├── xlsx/ # Source Excel workbooks
│ └── csv/ # Optional CSV inputs
├── output/ # Auto-generated CSV exports
├── analysis/
│ └── ai_responses/ # AI-generated analysis reports
├── etl_scripts/ # All Python utilities and pipelines
│ ├── ai_data_analyzer.py # Main AI data profiler
│ ├── backfill_dv.py # CSV-first DV backfill + validation suite
│ ├── base_etl.py # ETL framework and helpers
│ ├── export_excel_sheets_to_csv.py # Excel ➜ CSV converter
│ ├── fix_dv_headers.py # DV header normalization
│ ├── transform_dv_data.py # DV data consolidation
│ ├── map_dv_to_rms_locations.py # DV ↔ RMS location mapping
│ ├── verify_transformations.py # Transformation QA checks
│ ├── check_dv_columns.py # Column diagnostics
│ ├── examine_dv_structure.py # Structural inspection helper
│ ├── git_automation.py # Git/GitHub automation utilities
│ └── quick_analysis.py # Lightweight inspection script
├── processed_data/ # Cleaned & transformed data
├── logs/ # Pipeline and analysis logs
├── notebooks/ # Jupyter exploration notebooks
├── data/ # Additional reference data
├── docs/
│ └── archive/ # Historical docs, guides, requirements
├── PROJECT_SUMMARY.md # Project-wide summary
└── README.md # You are here
-
Install dependencies
make setup
This creates
.venv, installs the project in editable mode, and adds QA tooling (ruff,mypy,pytest).
Preferpip install -e .ifmakeis unavailable. -
Convert Excel workbooks to CSV (optional)
python etl.py export --src raw_data/xlsx --out outputCSV exports are written to
output/and logged inoutput/conversion_summary.json. -
Profile the data with AI
python etl.py profile --src output --out analysis/ai_responses
The script inspects every Excel/CSV file in
raw_data/andoutput/, producing detailed JSON reports inanalysis/ai_responses/. -
Build or run ETL pipelines
python etl.py transform --src processed_data --out processed_data
The CLI orchestrates the header fix, data transformation, mapping, and verification helpers.
-
Backfill & validate DV outputs
python -m etl_scripts.backfill_dv \ --dv processed_data/_2023_2025_10_31_dv_fixed_transformed_transformed.csv \ --rms raw_data/xlsx/output/_2023_2025_10_31_dv_rms.csv \ --cad raw_data/xlsx/output/_2023_2025_10_31_dv_cad.csv \ --out processed_data/dv_final.csv
The script merges RMS/CAD context, flags missing case numbers/ages, writes
processed_data/dv_final.csv, and emits validation artifacts inlogs/(validation_report.txt,quality_metrics.json). -
Explore results
- Review AI outputs in
analysis/ai_responses/ - Inspect cleaned files in
processed_data/ - Use notebooks in
notebooks/for ad hoc analysis
- Review AI outputs in
- Environment —
pyproject.tomlpins runtime dependencies;make setupprovisions a local.venv. - CLI —
python etl.pyexposesexport,profile,transform,map, andverifyworkflows (implemented with Click). - Quality Gates —
make qarunsruff,mypy, andpytest;make fmtauto-formats imports and style issues. - Tests — Extensive unit and integration coverage (
pytest tests/test_backfill_functions_edge_cases.py+tests/integration/test_suite.py). - CI —
.github/workflows/ci.ymlexecutes the same lint/type/test pipeline on Windows runners for every push and PR.
Run the commands from the repository root using the project virtual environment (.venv\Scripts\python on Windows):
| Command | Purpose | Example |
|---|---|---|
export |
Convert Excel workbooks to CSV. | python etl.py export --src raw_data/xlsx --out output |
profile |
Generate AI-driven profiling reports. | python etl.py profile --src output --out analysis/ai_responses |
transform |
Apply DV-specific transformations. | python etl.py transform --src processed_data --out processed_data |
map |
Join DV and RMS data for mapping. | python etl.py map --src processed_data --out processed_data |
verify |
Emit a JSON verification report. | python etl.py verify --src processed_data --out logs |
python -m etl_scripts.backfill_dv |
Backfill RMS/CAD context, flag data quality, and write validation logs. | python -m etl_scripts.backfill_dv --dv ... --rms ... --cad ... --out processed_data/dv_final.csv |
_2023_2025_10_31_dv.xlsx(245 KB) – summary DV reporting_2023_2025_10_31_dv_cad.xlsx(19 MB) – CAD incidents_2023_2025_10_31_dv_rms.xlsx(17 MB) – RMS records- Time span: January 2023 through October 31, 2025
Converted CSV counterparts are stored in output/. All sensitive data should remain within raw/processed directories and is excluded from version control.
- Place raw Excel or CSV files into
raw_data/xlsx/orraw_data/csv/. - Run
ai_data_analyzer.pyto generate:- Column inventories with samples
- Random record spot checks
- Data-quality metrics (nulls, outliers, suspicious values)
- AI prompts for deeper exploration
- Use ETL scripts to clean and standardize:
etl_scripts/fix_dv_headers.py– normalize column names and booleansetl_scripts/transform_dv_data.py– consolidate race, ethnicity, time fields, and moreetl_scripts/map_dv_to_rms_locations.py– join DV cases to RMS locations for GISetl_scripts/verify_transformations.py– confirm transformations succeeded
- Load cleaned data into visualization tools, GIS, or downstream analytics.
-
etl_scripts/ai_data_analyzer.py
Automates profiling of every source file and saves JSON reports with human-readable findings and AI-ready prompts. -
etl_scripts/base_etl.py
Provides theBaseETLframework plus demographic helpers for reusable Extract-Transform-Load pipelines. -
etl_scripts/backfill_dv.py
Reconciles DV outputs with RMS/CAD exports (CSV-first), adds quality flags (CN_Flag,VA_Flag), standardises times, drops incomplete records, and writes validation reports/metrics. -
etl_scripts/export_excel_sheets_to_csv.py
Batch converts each sheet in supported Excel workbooks to CSV with progress bars, error handling, and execution logs. -
etl_scripts/git_automation.py
Wraps Git commands to initialize repositories, commit/push changes, create tags, and sync with GitHub.
-
analysis/ai_responses/
{filename}_analysis_{timestamp}.jsonand{filename}_prompts_{timestamp}.jsoncontain profiling results and follow-up prompt templates. -
processed_data/
Stores cleaned datasets, such as_2023_2025_10_31_dv_fixed_transformed.xlsx, ready for reporting or GIS use. -
logs/analysis.log
Captures ETL and analyzer execution details for troubleshooting.
Manual workflow:
git add .
git commit -m "Describe your changes"
git push origin mainAutomation (optional):
python etl_scripts/git_automation.py --status
python etl_scripts/git_automation.py --commit-push "Your commit message"The remote repository is hosted at https://github.com/racmac57/dv_doj.git. Make sure your Git credentials are configured (GitHub token or HTTPS credentials) before pushing changes.
PROJECT_SUMMARY.md– high-level overview of objectives and accomplishmentsdocs/archive/START_HERE.md– orientation guide and project historydocs/archive/ANALYSIS_SUMMARY.md– synthesized findings from prior data reviewsdocs/archive/QUICKSTART.md– concise walkthrough for new usersdocs/archive/SETUP_GIT.md– step-by-step instructions for configuring GitHub accessdocs/archive/TRANSFORMATION_SUMMARY.md– detailed transformation notesdocs/data_dictionary.md– canonical field definitions, types, and examplesdocs/pii_policy.md– expectations for handling sensitive informationdocs/mappings/– CSV lookup tables referenced by the ETL pipeline
The archived documents capture historical context, prior analysis, and setup guides.
- Review
logs/analysis.logfor execution issues. - Check AI reports in
analysis/ai_responses/for data anomalies. - Consult the documentation listed above for deeper dives.
- Expand pytest coverage (target ≥80%) across
etl_scripts/. - Introduce secret scanning/pre-commit hooks and maintain a published
.env.example. - Profile large Excel workloads (≥20 MB) and document/perhaps optimize long-running steps.
- Broaden GitHub Actions coverage (e.g., add
ubuntu-latestmatrix) and cache dependencies. - Automate semantic versioning or release tagging once the toolchain stabilises.
- Potential PII exposure if raw data leaks: enforce secret scanning and redact outputs.
- Performance limits on large files: profile with
cProfile, consider batching or chunked reads. - CI currently Windows-only: add cross-platform runners to catch portability issues.
- Release management: no automated tagging/versioning—plan for semantic releases.
Happy analyzing!