This works in two parts:
- A ruby program to parse the PDFs to CSV files using Tabula PDF.
- A node program to parse the CSVs to JSON files.
The parsers expect that the data resides in two directories:
- raw_data/pdf for the pdf files (these can be in any number of sub directories)
- raw_data/csv for the csv files (these can be in any number of sub directories)
The raw PDFs are from ajschumacher's repository at https://github.com/ajschumacher/nypd