NJ CAD/RMS and Domestic Violence Data Analysis Project

Project Overview

This repository contains an end-to-end workflow for analyzing CAD/RMS and Domestic Violence data reported to the State of New Jersey between January 2023 and October 31, 2025. It includes automation for exporting Excel files, AI-assisted data profiling, ETL utilities, and documentation to support data quality reviews, demographic insights, and GIS-ready outputs.

Repository Layout

dv_doj/
├── raw_data/
│   ├── xlsx/                         # Source Excel workbooks
│   └── csv/                          # Optional CSV inputs
├── output/                           # Auto-generated CSV exports
├── analysis/
│   └── ai_responses/                 # AI-generated analysis reports
├── etl_scripts/                      # All Python utilities and pipelines
│   ├── ai_data_analyzer.py           # Main AI data profiler
│   ├── backfill_dv.py                # CSV-first DV backfill + validation suite
│   ├── base_etl.py                   # ETL framework and helpers
│   ├── export_excel_sheets_to_csv.py # Excel ➜ CSV converter
│   ├── fix_dv_headers.py             # DV header normalization
│   ├── transform_dv_data.py          # DV data consolidation
│   ├── map_dv_to_rms_locations.py    # DV ↔ RMS location mapping
│   ├── verify_transformations.py     # Transformation QA checks
│   ├── check_dv_columns.py           # Column diagnostics
│   ├── examine_dv_structure.py       # Structural inspection helper
│   ├── git_automation.py             # Git/GitHub automation utilities
│   └── quick_analysis.py             # Lightweight inspection script
├── processed_data/                   # Cleaned & transformed data
├── logs/                             # Pipeline and analysis logs
├── notebooks/                        # Jupyter exploration notebooks
├── data/                             # Additional reference data
├── docs/
│   └── archive/                      # Historical docs, guides, requirements
├── PROJECT_SUMMARY.md                # Project-wide summary
└── README.md                         # You are here

Quick Start

Install dependencies
```
make setup
```
This creates .venv, installs the project in editable mode, and adds QA tooling (ruff, mypy, pytest).
Prefer pip install -e . if make is unavailable.
Convert Excel workbooks to CSV (optional)
```
python etl.py export --src raw_data/xlsx --out output
```
CSV exports are written to output/ and logged in output/conversion_summary.json.
Profile the data with AI
```
python etl.py profile --src output --out analysis/ai_responses
```
The script inspects every Excel/CSV file in raw_data/ and output/, producing detailed JSON reports in analysis/ai_responses/.
Build or run ETL pipelines
```
python etl.py transform --src processed_data --out processed_data
```
The CLI orchestrates the header fix, data transformation, mapping, and verification helpers.

Backfill & validate DV outputs

python -m etl_scripts.backfill_dv \
  --dv processed_data/_2023_2025_10_31_dv_fixed_transformed_transformed.csv \
  --rms raw_data/xlsx/output/_2023_2025_10_31_dv_rms.csv \
  --cad raw_data/xlsx/output/_2023_2025_10_31_dv_cad.csv \
  --out processed_data/dv_final.csv

The script merges RMS/CAD context, flags missing case numbers/ages, writes processed_data/dv_final.csv, and emits validation artifacts in logs/ (validation_report.txt, quality_metrics.json).

Explore results
- Review AI outputs in analysis/ai_responses/
- Inspect cleaned files in processed_data/
- Use notebooks in notebooks/ for ad hoc analysis

Tooling Overview

Environment — pyproject.toml pins runtime dependencies; make setup provisions a local .venv.
CLI — python etl.py exposes export, profile, transform, map, and verify workflows (implemented with Click).
Quality Gates — make qa runs ruff, mypy, and pytest; make fmt auto-formats imports and style issues.
Tests — Extensive unit and integration coverage (pytest tests/test_backfill_functions_edge_cases.py + tests/integration/test_suite.py).
CI — .github/workflows/ci.yml executes the same lint/type/test pipeline on Windows runners for every push and PR.

CLI Commands

Run the commands from the repository root using the project virtual environment (.venv\Scripts\python on Windows):

Command	Purpose	Example
`export`	Convert Excel workbooks to CSV.	`python etl.py export --src raw_data/xlsx --out output`
`profile`	Generate AI-driven profiling reports.	`python etl.py profile --src output --out analysis/ai_responses`
`transform`	Apply DV-specific transformations.	`python etl.py transform --src processed_data --out processed_data`
`map`	Join DV and RMS data for mapping.	`python etl.py map --src processed_data --out processed_data`
`verify`	Emit a JSON verification report.	`python etl.py verify --src processed_data --out logs`
`python -m etl_scripts.backfill_dv`	Backfill RMS/CAD context, flag data quality, and write validation logs.	`python -m etl_scripts.backfill_dv --dv ... --rms ... --cad ... --out processed_data/dv_final.csv`

Data Snapshot

_2023_2025_10_31_dv.xlsx (245 KB) – summary DV reporting
_2023_2025_10_31_dv_cad.xlsx (19 MB) – CAD incidents
_2023_2025_10_31_dv_rms.xlsx (17 MB) – RMS records
Time span: January 2023 through October 31, 2025

Converted CSV counterparts are stored in output/. All sensitive data should remain within raw/processed directories and is excluded from version control.

Working With the Data

Place raw Excel or CSV files into raw_data/xlsx/ or raw_data/csv/.
Run ai_data_analyzer.py to generate:
- Column inventories with samples
- Random record spot checks
- Data-quality metrics (nulls, outliers, suspicious values)
- AI prompts for deeper exploration
Use ETL scripts to clean and standardize:
- etl_scripts/fix_dv_headers.py – normalize column names and booleans
- etl_scripts/transform_dv_data.py – consolidate race, ethnicity, time fields, and more
- etl_scripts/map_dv_to_rms_locations.py – join DV cases to RMS locations for GIS
- etl_scripts/verify_transformations.py – confirm transformations succeeded
Load cleaned data into visualization tools, GIS, or downstream analytics.

Tools & Automation

etl_scripts/ai_data_analyzer.py
Automates profiling of every source file and saves JSON reports with human-readable findings and AI-ready prompts.
etl_scripts/base_etl.py
Provides the BaseETL framework plus demographic helpers for reusable Extract-Transform-Load pipelines.
etl_scripts/backfill_dv.py
Reconciles DV outputs with RMS/CAD exports (CSV-first), adds quality flags (CN_Flag, VA_Flag), standardises times, drops incomplete records, and writes validation reports/metrics.
etl_scripts/export_excel_sheets_to_csv.py
Batch converts each sheet in supported Excel workbooks to CSV with progress bars, error handling, and execution logs.
etl_scripts/git_automation.py
Wraps Git commands to initialize repositories, commit/push changes, create tags, and sync with GitHub.

Outputs & Logging

analysis/ai_responses/
{filename}_analysis_{timestamp}.json and {filename}_prompts_{timestamp}.json contain profiling results and follow-up prompt templates.
processed_data/
Stores cleaned datasets, such as _2023_2025_10_31_dv_fixed_transformed.xlsx, ready for reporting or GIS use.
logs/analysis.log
Captures ETL and analyzer execution details for troubleshooting.

Git & GitHub Workflow

Manual workflow:

git add .
git commit -m "Describe your changes"
git push origin main

Automation (optional):

python etl_scripts/git_automation.py --status
python etl_scripts/git_automation.py --commit-push "Your commit message"

The remote repository is hosted at https://github.com/racmac57/dv_doj.git. Make sure your Git credentials are configured (GitHub token or HTTPS credentials) before pushing changes.

Additional Documentation

PROJECT_SUMMARY.md – high-level overview of objectives and accomplishments
docs/archive/START_HERE.md – orientation guide and project history
docs/archive/ANALYSIS_SUMMARY.md – synthesized findings from prior data reviews
docs/archive/QUICKSTART.md – concise walkthrough for new users
docs/archive/SETUP_GIT.md – step-by-step instructions for configuring GitHub access
docs/archive/TRANSFORMATION_SUMMARY.md – detailed transformation notes
docs/data_dictionary.md – canonical field definitions, types, and examples
docs/pii_policy.md – expectations for handling sensitive information
docs/mappings/ – CSV lookup tables referenced by the ETL pipeline

The archived documents capture historical context, prior analysis, and setup guides.

Need Help?

Review logs/analysis.log for execution issues.
Check AI reports in analysis/ai_responses/ for data anomalies.
Consult the documentation listed above for deeper dives.

Future Enhancements & Known Risks

Enhancements to consider

Expand pytest coverage (target ≥80%) across etl_scripts/.
Introduce secret scanning/pre-commit hooks and maintain a published .env.example.
Profile large Excel workloads (≥20 MB) and document/perhaps optimize long-running steps.
Broaden GitHub Actions coverage (e.g., add ubuntu-latest matrix) and cache dependencies.
Automate semantic versioning or release tagging once the toolchain stabilises.

Known blind spots and mitigations

Potential PII exposure if raw data leaks: enforce secret scanning and redact outputs.
Performance limits on large files: profile with cProfile, consider batching or chunked reads.
CI currently Windows-only: add cross-platform runners to catch portability issues.
Release management: no automated tagging/versioning—plan for semantic releases.

Happy analyzing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NJ CAD/RMS and Domestic Violence Data Analysis Project

Project Overview

Repository Layout

Quick Start

Tooling Overview

CLI Commands

Data Snapshot

Working With the Data

Tools & Automation

Outputs & Logging

Git & GitHub Workflow

Additional Documentation

Need Help?

Future Enhancements & Known Risks

Enhancements to consider

Known blind spots and mitigations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
analysis/ai_responses		analysis/ai_responses
data/raw		data/raw
docs		docs
etl_scripts		etl_scripts
logs		logs
notebooks		notebooks
processed_data		processed_data
raw_data		raw_data
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
NEXT_STEPS.md		NEXT_STEPS.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
dv_doj.code-workspace		dv_doj.code-workspace
etl.py		etl.py
pyproject.toml		pyproject.toml
summary_report.json		summary_report.json

Folders and files

Latest commit

History

Repository files navigation

NJ CAD/RMS and Domestic Violence Data Analysis Project

Project Overview

Repository Layout

Quick Start

Tooling Overview

CLI Commands

Data Snapshot

Working With the Data

Tools & Automation

Outputs & Logging

Git & GitHub Workflow

Additional Documentation

Need Help?

Future Enhancements & Known Risks

Enhancements to consider

Known blind spots and mitigations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages