Skip to content

Latest commit

 

History

History
69 lines (49 loc) · 2.61 KB

README.md

File metadata and controls

69 lines (49 loc) · 2.61 KB

epc-ppd-address-matching

Match addresses in the Energy Performance Certificates (EPC) dataset and HM Land Registry Price Paid Data (PPD). A simple, but effective, rule-based approach.

Installation

You need Python 3.9 and pipenv. If you don't have them, see our instructions for macOS.

  1. Clone this repo.
  2. pipenv sync --dev to install dependencies.
  3. If you're a CNZ colleague and downloading the data from our data warehouse, cp .env.template .env and fill in the blanks.
  4. pipenv run pytest to run tests.

Alternatively, if you are familiar with Docker you can use the included Dockerfile to build an image and match addresses in a container.

Data Requirements

You need the EPC and PPD addresses in Parquet format. At CNZ we manage EPC and PPD using dbt in BigQuery. You can use that code to clean up EPC and PPD and transform it into the schema required for matching. Alternatively you can produce Parquet files with the following schemas however you want (all columns are strings).

PPD schema

ppd_id
primary_addressable_object_name
secondary_addressable_object_name
street
town_city
postcode

EPC schema

epc_id
address_line_1
address_line_2
address_line_3
post_town
postcode

Record IDs

ppd_id and epc_id must uniquely identify each address record across both datasets. You can use whatever identifier you want. We use the MD5 hash of each record in JSON format so that we can join the matches back to our data warehouse. In BigQuery, this is md5(to_json_string(addresses)).

Matching Addresses

python -m matching epc_addresses/ ppd_addresses/ matches/

On multicore machines you can get much faster performance by matching addresses in parallel. You need to partition the data using python -m matching.partition and have GNU Parallel installed.

We use ./scripts/container_run.sh to extract and partition data, match records and finally load the matches back into BigQuery. A Github Action runs this in a container on GCP.

Acknowledgements

Tange, O. (2021, November 22). GNU Parallel 20211122 ('Peng Shuai'). Zenodo. https://doi.org/10.5281/zenodo.5719513