Author: Mark Bauer
The Metropolitan Transportation Authority (MTA) recently released the 2023 and 2024 Subway Origin-Destination Ridership Estimates datasets on the New York State Open Data Portal. The 2023 dataset alone comprises approximately 116 million rows. While this dataset is fascinating, its massive size can pose challenges for users, even for experienced analysts and researchers. This prompted me to investigate the sizes of the largest datasets available on the NYC Open Data Portal and how users interact with these datasets compared to others.
This project aims to:
- Investigate Dataset Sizes: Examine the largest datasets available on the NYC Open Data Portal.
- Analyze Usage Patterns: Analyze row counts, download counts, and view counts to understand how these large datasets are used.
- Explore Broader Implications: Address broader questions such as whether open data can include Big Data? If so, what characteristics define an optimal open big data platform? How can open big data benefit a broad range of producers and users? Which agencies produce the largest datasets, and what methods do they use to support users?
Metric | Value |
---|---|
Number of datasets | 2,491 |
Number of agencies | 201 |
Number of rows | 5,965,739,051 |
Number of views | 27,761,756 |
Number of downloads | 11,529,518 |
Table xx: Summary statistics of NYC Open Data. Note: This analysis only includes datasets with asset type as dataset and the display type as table, as well as datasets with at least one row.
Figure xx: Top 10 Agencies by Total Number of Rows on NYC Open Data.
Figure xx: Top 10 Agencies by Median Number of Rows on NYC Open Data.
Figure xx: Boxplots of Download Counts by Number of Rows on NYC Open Data.
Figure xx: Total and Average Download Count with IQR Error Bars on NYC Open Data.
id | name | attribution | count_rows | viewCount | downloadCount |
---|---|---|---|---|---|
rmhc-afj9 | DSNY - PlowNYC Data | Department of Sanitation (DSNY) | 376,404,531 | 1,854 | 504 |
Table xx: Dataset with the Largest Number of Rows on NYC Open Data.
Figure xx: Top 10 Datasets by Number of Rows on NYC Open Data.
Figure xx: Top 10 Datasets by Number of Rows on NYC Open Data (Excluding Taxi Data).
Figure xx: Top 10 Datasets by Number of Rows on NYC Open Data (Only Taxi Data).
There are two main methods to exporting a dataset on NYC Open Data: 1) Download files locally and 2) utilize the Socrata Open Data API (SODA API):
-
Download Files Locally: You can download tabular data in various formats, including JSON, CSV, RDF, RSS, TSV, and XML. For geospatial data, additional formats such as KML, KMZ, Shapefile, and GeoJSON are also available.
-
Socrata Open Data API (SODA API): The SODA API provides access via unique URLs, known as endpoints, that represent datasets or individual records. The API follows the REST (REpresentational State Transfer) design pattern, using HTTP methods for CRUD (Create, Read, Update, Delete) operations. It supports querying and filtering through the Socrata Query Language (SoQL), which is a SQL-like language tailored for web data. Note that for performance reasons, SODA APIs are paged and return a maximum of 50,000 records per page. More information on SODA API endpoints is available.
Limitations: The methods mentioned above have some limitations. For instance, there is no support for columnar file formats such as Parquet, an open-source column-oriented file format optimized for efficient data storage and retrieval. Additionally, the SODA API can experience performance issues due to the overhead associated with HTTP requests and responses, particularly when querying large volumes of data.
As highlighted earlier, many of the largest datasets on NYC Open Data originate from the NYC Taxi and Limousine Commission (TLC). It’s no surprise that these Taxi Trip Datasets are commonly used in big data tutorials, and popular cloud services often provide them for free as sample data. In addition to hosting datasets (typically by year) on NYC Open Data, TLC also offers Parquet file formats on their website, which are distributed via Amazon Web Services (AWS), specifically using Amazon CloudFront.
These options cater to both types of users: those who prefer accessing data directly from NYC Open Data and those who opt for optimized Parquet files.
- The code to calculate count of rows for each dataset is located in the data-export.ipynb notebook.
- Brief data cleaning before the analysis can be found in the data-cleaning.ipynb notebook.
- The code to generate the figures can be found in the analysis.ipynb notebook.
Data was retrieved from NYC Open Data.
Socrata: NYC Open Data runs on the Socrata Platform.
- Socrata - Home
- Socrata System Architecture: This blog post from SODA 2.0 was originally released in 2011.
Note: Socrata was acquired by Tyler Technologies in 2018 and is now the Data and Insights division of Tyler.
NYC Open Data Portal
NYC Taxi and Limousine Commission (TLC)
- TLC Data: Aggregate and Raw Data
- TLC Trip Record Data
Feel free to reach out.
- LinkedIn: markebauer
- Portfolio: mebauer.github.io
- GitHub: mebauer