This repository contains code and data supporting our investigation "Dollars to Megabits: You May Be Paying 400 Times As Much As Your Neighbor for Internet" from the series Still Loading.
Our methodology is described in detail in "How We Uncovered Disparities in Internet Deals".
Please read that document to understand the context for the code and data in this repository. The data in this repository, described in more detail below, include the results of our automated collecting of ISP offers, plus records from the U.S. Census Bureau and other sources necessary for the analysis.
The code in this repository, also described in more detail below, demonstrates how we processed and analyzed that data.
This directory is where inputs, intermediaries, and outputs are saved.
Here is an overview of how the directory is organized:
data
├── input
│ ├── redlining
│ ├── addresses
│ │ ├── cities.ndjson
│ │ └── open_addresses_enriched
│ ├── fcc
│ │ └── fbd_us_with_satellite_dec2020_v1.csv.gz
│ ├── census
│ │ ├── shape
│ │ └── acs5
│ └── isp
│ ├── att
│ ├── centurylink
│ ├── earthlink
│ └── verizon
├── intermediary
│ ├── fcc
│ │ └── bg_providers.csv
│ ├── census
│ │ ├── aggregated_tables_plus_features.csv.gz
│ │ └── 2019_acs_5_shapes.geojson.gz
│ └── isp
│ ├── att
│ ├── centurylink
│ ├── earthlink
│ └── verizon
└── output
├── speed_price_att.csv.gz
├── speed_price_centurylink.csv.gz
├── speed_price_earthlink.csv.gz
├── speed_price_verizon.csv.gz
├── by_city
├── figs
└── tables
Tables and figures featured in our methodology and story can be found in data/ouput/tables/
and data/output/figs/
, respectively.
The data/
directory also features data/input/
and data/intermediary/
files that were collected and processed to create the files in data/output
. Their entirety is not stored in GitHub due to space restrictions. See the section below to access this data.
In data/input/
, we stored historical redlining maps that were digitized by the University of Richmond's Mapping Inequality project (data/input/redlining/
), as well as TIGER shapefiles from the U.S. Census Bureau (data/input/census/shape/
).
In data/intermediary/
you will find aggregated data from the American Community Survey (data/intermediary/census/
), and the FCC's Form 477 (data/intermediary/fcc/bg_providers.csv
).
Below, we highlight three components of the data that we believe others will find most useful: all offers collected, by ISP; all offers collected, by ISP and city; and summary data regarding the disparities observed for each city-ISP combination.
The address-level internet service offers we collected are stored in the data/output/ directory, with one file per internet service provider. (For instance, data/output/speed_price_att.csv.gz
contains the offers we collected from AT&T.).
Those files contain the following columns:
column | description |
---|---|
address_full |
The complete postal address of a household we searched. |
incorporated_place |
The incorporated city that the address belongs to. |
major_city |
The city that the address is in. |
state |
The state that the address is in. |
lat |
The address’s latitude. From OpenAddresses or NYC Open Data. |
lon |
The address’s longitude. From OpenAddresses or NYC Open Data. |
block_group |
The Census block group of the address, as of 2019. From the Census Geocoder API based on lat and lon . |
collection_datetime |
The Unix timestamp that the address was used to query the provider's website. |
provider |
The internet service provider. |
speed_down |
Cheapest advertised download speed for the address. |
speed_up |
Cheapest advertised upload speed for the address. |
speed_unit |
The unit of speed. This is always in megabits per second (Mbps). |
price |
The cost in USD of the cheapest advertised internet plan for the address. |
technology |
The kind of technology (fiber or non-fiber) used to serve the cheapest internet plan. |
package |
The name of the cheapest internet plan. |
fastest_speed_down |
The advertised download speed of the fastest package. This is usually the same as the cheapest plan if the speed_down is less than 200 Mbps. |
fastest_speed_price |
The advertised upload speed of the fastest internet package for the address. |
fn |
The name of the file of API responses where this record was parsed from. To be used for trouble shooting. API responses are hosted externally in AWS s3. |
redlining_grade |
The redlining grade, merged from Mapping Inequality based on the lat and lon of the adddress. |
race_perc_non_white |
The percentage of people of color (not non-Hispanic White) in the addresse's Census block group expressed as a proportion. Sourced from the 2019 5-year American Community Survey. |
median_household_income |
The median household income in the addresses' Census block group. Sourced from the 2019 5-year American Community Survey |
income_lmi |
median_household_income divided by the city median household income (sourced from U.S. Census Bureau). |
income_dollars_below_median |
City median household income minus the median_household_income . |
ppl_per_sq_mile |
People per square mile is used to determine population density. Sourced from 2019 TIGER shape files from the U.S. Census Bureau. |
n_providers |
The number of other wired competitors in the addresses' Census block group. Sourced from FCC Form 477. |
internet_perc_broadband |
The percentage of the population that is already subscriped to broadband in an addresses' Census block group expressed as a proportion. |
This dataset was created in notebooks/1-process-offers.ipynb
.
You can find a similar file for inidividuals cities, below.
In addition to the ISP-level offer files described above, we have generated similar data files for each ISP-city combination, listed and linked below. For column definitions, see the section above.
Do you want to write a local story based on the data we collected? We wrote a story recipe guide to help you do that.
- Albuquerque, N.M. (CenturyLink, EarthLink)
- Atlanta, Ga. (AT&T, EarthLink)
- Billings, Mont. (CenturyLink, EarthLink)
- Boise, Idaho (CenturyLink, EarthLink)
- Charleston, S.C. (AT&T, EarthLink)
- Charlotte, N.C. (AT&T, EarthLink)
- Cheyenne, Wyo. (CenturyLink, EarthLink)
- Chicago, Ill. (AT&T, EarthLink)
- Columbus, Ohio (AT&T, EarthLink)
- Denver, Colo. (CenturyLink, EarthLink)
- Des Moines, Iowa (CenturyLink, EarthLink)
- Detroit, Mich. (AT&T, EarthLink)
- Fargo, N.D. (CenturyLink, EarthLink)
- Houston, Texas (AT&T, EarthLink)
- Huntsville, Ala. (AT&T, EarthLink)
- Indianapolis, Ind. (AT&T, EarthLink)
- Jackson, Miss. (AT&T)
- Jacksonville, Fla. (AT&T, EarthLink)
- Kansas City, Mo. (AT&T, EarthLink)
- Las Vegas, Nev. (CenturyLink, EarthLink)
- Little Rock, Ark. (AT&T, EarthLink)
- Los Angeles, Calif. (AT&T, EarthLink)
- Louisville, Ky. (AT&T, EarthLink)
- Milwaukee, Wis. (AT&T, EarthLink)
- Minneapolis, Minn. (CenturyLink)
- Nashville, Tenn. (AT&T, EarthLink)
- New Orleans, La. (AT&T, EarthLink)
- Newark, N.J. (Verizon)
- Oklahoma City, Okla. (AT&T, EarthLink)
- Omaha, Neb. (CenturyLink, EarthLink)
- Phoenix, Ariz. (CenturyLink, EarthLink)
- Portland, Ore. (CenturyLink, EarthLink)
- Salt Lake City, Utah (CenturyLink, EarthLink)
- Seattle, Wash (CenturyLink, EarthLink)
- Sioux Falls, S.D. (CenturyLink, EarthLink)
- Virginia Beach, Va. (Verizon)
- Washington (Verizon)
- Wichita, Kan. (AT&T, EarthLink)
To view an interactive address-level map for the cities in our investigation, you can download the Kepler.gl maps for each provider.
Click any of the links below to view a map for the provider you are interested in.
- Map for AT&T
- Map for CenturyLink
- Map for EarthLink
- Map for Verizon
Now you can use the search bar to quick-travel to specific addresses or cities. If you know the areas, this will be immediately useful. However, if you would like an overlay of any socioeconomic factor in our investigation (median household income, the percentage of non-Hispanic White residents in the area, or redlining grades) we can produce them by request.
These maps should be viewed alongside summaries of how speeds vary across each city and between areas.
Please refer to the methodology or this summary file for that information.
The data/output/tables/table1_disparities_by_city.csv
file summarizes the disparities we observed for each city-ISP combination, and represents the core of our findings.
It contains the following:
column | description |
---|---|
major_city |
The city analyzed. |
state |
The state that the city is in. |
isp |
The internet service provider. |
uniform_speed |
Whether the city had virtually the same speeds offered; we omit these cities from out disparate outcome analysis. |
income_disparity |
Whether we identifed a disparity between lower- and upper- income areas. |
pct_slow_lower_income |
Percentage of addresses in lower-income areas that were offered slow speeds (>25 Mbps) expressed as a proportion. |
pct_slow_upper_income |
Percentage of addresses in upper-income areas that were offered slow speeds expressed as a proportion. |
income_pct_pt_diff |
The percentage point difference between income groups offered slow speeds, if this was at or greater than 5, income_disparity is True |
flag_income |
In cases where we did not analyze this city for income-based disparities, the reason why. See our methodology document for more details. |
race_disparity |
Whether we identified a disparity between the most-White and least-White areas. |
pct_slow_least_white |
Percentage of addresses in least-White areas that were offered slow speeds expressed as a proportion. |
pct_slow_most_white |
Percentage of addresses in most-White areas that were offered slow speeds expressed as a proportion. |
race_pct_pt_diff |
The percentage point difference in slow speed offers between the most-White and least-White areas. If this was at or greater than 5, race_disparity is True . |
flag_race |
In cases where we did not analyze this city based on racial or ethnic groups, the reason why. See our methodology document for more details. |
redlining_disparity |
Whether we identified a disparity between HOLC-rated A/B vs. D areas. |
pct_slow_d_rated |
Percentage of addresses in historically D-rated areas that were offered slow speeds expressed as a proportion. |
pct_slow_ab_rated |
Percentage of addresses in historically A and B-rated areas that were offered slow speeds expressed as a proportion. |
redlining_pct_pt_diff |
The percentage point difference in slow speed offers between historically D-rated and A/B-rated neighborhoods. If this was at or greater than 5, redlining_disparity is True . |
flag_redlining |
In cases where we did not analyze this city with redlining grades, the reason why. See our methodology document for more details. |
This file was generated in notebooks/3-statistical-tests-and-regression.ipynb
.
Certain data files were too large to host on GitHub but have been uploaded to Amazon Web Services' Simple Storage Service (Amazon S3):
s3://markup-public-data/isp/input.tar.xz
(~7.7 GB uncompressed) contains open source addresses (data/input/addresses/open_addresses_enriched/
and data/input/isp/
), bulk data from government sources: the U.S. Census Bureau (data/input/census/acs5/
) and FCC Form 477 (data/input/fcc/fbd_us_with_satellite_dec2020_v1.csv.gz
).
s3://markup-public-data/isp/isp-intermedairy.tar.xz
(~5.7 GB uncompressed) and contains API responses from each ISP (data/intermediary/isp/
) appended to the geographic data we pulled in above.
For your convenience we have included a command-line script to download these files:
data/download_external_data.sh
Make sure you have Python 3.8+ installed. We used Miniconda to create a Python 3.8 virtual environment.
Install the Python packages with pip:
pip install -r requirements.txt
The notebooks are intended to be run sequentially, but can also be run independently.
To run all notebooks in sequence, you can use the command
nbexec notebooks
Note that when recalculate = False
in each notebook, files that exist are not regenerated.
The Python/Jupyter notebooks in this repository’s notebooks/ directory demonstrate the steps we took to process and analyze the data we collected. If you want a quick overview of the main methodology, you can skip directly to 3-statistical-tests-and-regression.ipynb.
This notebook collects data from the U.S. Census Bureau's American Community Survey. If you want to re-fetch this data, you'll need to register for an API key and assign it as the environment variable CENSUS_API_KEY
. Otherwise, this is not necessary, as all outputs we used in this analysis are already saved in this repository.
This notebook parses and preprocesses the JSON responses for offers collected from each ISP's service lookup tools. The functions that parse each API response can be found in noteobooks/parsers.py
.
See examples of the API response JSON in data/intermediary/isp/
, or download all the data.
2a-att-reports.ipynb / 2b-verizon-reports.ipynb / 2c-centurylink-reports.ipynb / 2d-earthlink-reports.ipynb
An overview of offers by each ISP. This contains breakdowns for each city served by the ISP by income level, race/ethnicity, and historical redlining grades.
The code to produce the charts in these notebooks can be found in notebooks/aggregators.py
This notebook contains the bulk of our analyses. In it, we test for disparities in slow speed offers by income-level, race/ethnicity, and historical redlining grades.
This is also where we use logistic regression to adjust for business factors to see if accounting for them would eliminate the disparities we observed.
This notebook examines Verizon's price changes, addressed in the Limitations section of the methodology document.
This notebook uses scikit-learn's BallTree algorithm to find the closest address with blazing fast speeds (≥200 Mbps) for any address in a city. We used the results to create the topper graphic in the main story.