Match addresses in the Energy Performance Certificates (EPC) dataset and HM Land Registry Price Paid Data (PPD). A simple, but effective, rule-based approach.
You need Python 3.9 and pipenv
.
If you don't have them, see our instructions for macOS.
- Clone this repo.
pipenv sync --dev
to install dependencies.- If you're a CNZ colleague and downloading the data from our data warehouse,
cp .env.template .env
and fill in the blanks. pipenv run pytest
to run tests.
Alternatively, if you are familiar with Docker you can use the included Dockerfile to build an image and match addresses in a container.
You need the EPC and PPD addresses in Parquet format. At CNZ we manage EPC and PPD using dbt in BigQuery. You can use that code to clean up EPC and PPD and transform it into the schema required for matching. Alternatively you can produce Parquet files with the following schemas however you want (all columns are strings).
ppd_id
primary_addressable_object_name
secondary_addressable_object_name
street
town_city
postcode
epc_id
address_line_1
address_line_2
address_line_3
post_town
postcode
ppd_id
and epc_id
must uniquely identify each address record across both datasets.
You can use whatever identifier you want.
We use the MD5 hash of each record in JSON format so that we can join the matches back to our data warehouse.
In BigQuery, this is md5(to_json_string(addresses))
.
python -m matching epc_addresses/ ppd_addresses/ matches/
On multicore machines you can get much faster performance by matching addresses in parallel.
You need to partition the data using python -m matching.partition
and have GNU Parallel installed.
We use ./scripts/container_run.sh
to extract and partition data, match records and finally load the matches back into BigQuery.
A Github Action runs this in a container on GCP.
Tange, O. (2021, November 22). GNU Parallel 20211122 ('Peng Shuai'). Zenodo. https://doi.org/10.5281/zenodo.5719513