lieu is a Python library for deduping places/POIs, addresses, and streets around the world using libpostal's international street address normalization.
pip install lieu
Note: libpostal and its Python binding are required to use this library, setup instructions here.
Inputs are expected to be GeoJSON files. The command-line client works on both standard GeoJSON (wrapped in a FeatureCollection) and line-delimited GeoJSON, but for Spark/EMR the input must be line-delimited GeoJSON so it can be effectively split across machines.
Lieu supports two primary schemas: Whos on First and OpenStreetMap which are mapped to libpostal's tagset.
For the purposes of blocking/candidate generation (grouping similar items together to narrow down the number pairwise checks lieu has to do to significantly fewer than N²), we need at least one field that specifies a geographic area for which to compare records so we don't need to compare every instance of a very common address ("123 Main St") or a very common name ("Ben & Jerry's") with every other instance. As such, at least one of the following fields must be present in all records:
- lat/lon: by default we use a prefix of the geohash of the lat/lon plus its neighbors (to avoid faultlines). See here for the distance each prefix size covers (and multiply those numbers by 3 for neighboring tiles). The default setting is a geohash precision of 6 characters, and since the geohash is only used to block or group candidate pairs together, it's possible for pairs within ~2-3km of each other with the same name/address to be considered duplicaates. This should work reasonably well for real-world place data where the locations may have been recorded with varying devices and degrees of precision.
- postcode: postal codes tend to constrain the geography to a few neighborhoods, and can work well if the data set is for a single country, for multiple countries where the postcodes do not overlap (although even if they do overlap, e.g. postcodes in the US and Italy, it's possible that the use of street names will also be sufficient to disambiguate). The postcode will be used in place of the lat/lon when the
--use-postcode
flag is set. - city, city_district, suburb, or island: libpostal will use any of the named place tags found in the address components when the
--use-city
flag is set. Simple normalizations will match like "Saint Louis" with "St Louis" and "IXe Arrondissement" with "9e Arrondissement", but we do not currently have a database-backed method for matching city name variants like "New York City" vs. "NYC" or containment e.g. suburb="Crown Heights" vs. city_district="Brooklyn". Note: this method does handle tagging differences, so suburb="Harlem" vs. city="Harlem" will match. - state_district: if addresses are already known to be within a certain small geographic boundary (for instance in the US, county governments are often the purveyors of address-related data), where address dupes within that boundary are rare/unlikely, the state_district tag may be used as well when the
--use-containing
flag is set.
Note: none of these fields are used in pairwise comparisons, only for blocking/grouping.
For name deduping, each record must contain:
- name: the venue/company/person's name
Note: when the --name-only
flag is set, only name and a geo qualifier (see above) are required. This option is useful e.g. for deduping check-in or simple POI data sets of names and lat/lons, though this use case has not been as thoroughly tested and may require some parameter tuning.
By default, we assume every record has an address, which is composed of these fields:
- street: street names are used in addresses in most countries. Lieu/libpostal can match a wide variety of variations here including abbreviations like "Main St" vs. "Main Street" in 60+ languages, missing thoroughfare types e.g. simply "Main", missing ordinal types like "W 149th St" vs. "W 149 St", and spacing differences like "Sea Grape Ln" vs. "Seagrape Ln".
- house_number: house number needs to be parsed into its own field. If the source does not separate house number from street, libpostal's parser can be used to extract it. Any subparsing of compound house numbers should be done as a preprocessing step (i.e. 1-3-5 and 3-5 could be the same address in Japan provided that they're both in 1-chome).
Lieu will also handle cases where neither entry has a house number (e.g. England) or where neither entry has a street (e.g. Japan).
Optionally lieu may also compare secondary units when the --with-unit
flag is set. In that case, the following fields may be compared as well:
- unit: normalized unit numbers. Lieu can handle many variations in apartment or floor numbers like "Apt 1C" vs. "#1C" vs. "Apt No. 1 C"
- floor: normalized floor numbers. Again, here lieu can handle many variations like "Fl 1" vs. "1st Floor" vs. "1/F".
Lieu will also use the following information to increase the accuracy/quality of the dupes:
- phone: this uses the Python port of Google's libphonenumber to parse phone numbers in various countries, flagging dupes for review if they have different phone numbers, and upgrading
The dedupe_geojson
command-line tool will be installed in the environment's bin dir and can be used like so:
dedupe_geojson file1.geojson [files ...] -o /some/output/dir
[--address-only] [--geocode] [--name-only]
[--address-only-candidates] [--dupes-only] [--no-latlon]
[--use-city] [--use-small-containing]
[--use-postal-code] [--no-phone-numbers]
[--no-fuzzy-street-names] [--with-unit]
[--features-db-name FEATURES_DB_NAME]
[--index-type {tfidf,info_gain}]
[--info-gain-index INFO_GAIN_INDEX]
[--tfidf-index TFIDF_INDEX]
[--temp-filename TEMP_FILENAME]
[--output-filename OUTPUT_FILENAME]
[--name-dupe-threshold NAME_DUPE_THRESHOLD]
[--name-review-threshold NAME_REVIEW_THRESHOLD]
Option descriptions:
--address-only
address duplicates only (ignore names).--geocode
only compare entries without a lat/lon to canonicals with lat/lons.--name-only
name duplicates only (ignore addresses).--address-only-candidates
use the address-only hash keys for candidate generation.--dupes-only
only output the dupes.--no-latlon
do not use lat/lon and geohashing (if one data set has no lat/lon for instance).--use-city
use the city name as a geo qualifier (for local data sets where city is relatively unambiguous).--use-small-containing
use the small containing boundaries like county as a geo qualifier (for local data sets).--use-postal-code
use the postcode as a geo qualifier (for single-country data sets or cases where postcode is unambiguous).--no-phone-numbers
turn off comparison of normalized phone numbers as a postprocessing step (when available), which revises dupe classifications for phone number matches or definite mismatches.--no-fuzzy-street-names
do not use fuzzy street name comparison for minor misspellings, etc. Only use libpostal expansion equality.--with-unit
include secondary unit/floor comparisons in deduplication (only if both addresses have unit).--features-db-name
path to database to store features for lookup (default='features_db').--index-type
choice of {info_gain, tfidf}, (default='info_gain').--info-gain-index
information gain index filename (default='info_gain.index').--tfidf-index
TF-IDF index file (default='tfidf.index').--temp-filename
temporary file for near-dupe hashes (default='near_dupes').--output-filename
output filename (default='deduped.geojson').--name-dupe-threshold
likely-dupe threshold between 0 and 1 for name deduping with Soft-TFIDF/Soft-Information-Gain (default=0.9).--name-review-threshold
human review threshold between 0 and 1 for name deduping with Soft-TFIDF/Soft-Information-Gain (default=0.7).
It's also possible to dedupe larger/global data sets using Apache Spark and AWS ElasticMapReduce (EMR). Using Spark/EMR should look and feel pretty similar to the command-line script (thanks in large part to the mrjob project from David Marin from Yelp). However, instead of running on your local machine, it spins up a cluster, runs the Spark job, writes the results to S3, shuts down the cluster, and optionally downloads/prints all the results to stdout. There's no need to worry about provisioning the machines or maintaining a standing cluster, and it requires only minimal configuration.
To get started, you'll need to create an Amazon Web Services account and an IAM role that has the permissions required for ElasticMapReduce. Once that's set up, we need to configure the job to use your account:
cd scripts/jobs
cp mrjob.conf.example mrjob.conf
Open up mrjob.conf in your favorite text editor. The config is a YAML file and under runners.emr
there are comments describing the few required fields (e.g. access key and secret, instance types, number of instances, etc.) and some optional ones (AWS region, spot instance bid price, etc.)
The example config includes a sample of the configuration used for deduping the global SimpleGeo data set (with the number of instances scaled back). The full run used 18 r3.2xlarge machines (num_core_instances=18), an r3.xlarge for the master instance, and the following values for the jobconf section of the config:
jobconf option | value |
---|---|
spark.driver.memory | 16g |
spark.driver.cores | 3 |
spark.executor.instances | 36 |
spark.executor.cores | 4 |
spark.executor.memory | 30g |
spark.network.timeout | 900s |
These values should be adjusted depending on the number and type of core instances.
Data should be on S3 as line-delimited GeoJSON files (i.e. not part of a FeatureCollection, just one GeoJSON feature per line) in a bucket that your IAM user can access.
Once the config values are set and the data are on S3, usage is simple:
python dedupe_geojson.py -r emr s3://YOURBUCKET/some/file [more S3 files ...] --output-dir=s3://YOURBUCKET/path/to/output/ --no-output --conf-path=mrjob.conf [--name-dupe-threshold=0.9] [--name-review-threshold=0.7] [--address-only] [--dupes-only] [--with-unit] [--no-latlon] [--use-city] [--use-postal-code] [--no-geo-model]
Note: if you want the output streamed back to stdout on the machine running the job (e.g. your local machine), remove the --no-output
option.
The output is a per-line JSON response which wraps the original GeoJSON object and references any duplicates. Note that here the original WoF GeoJSON properties have been simplified for readability, indentation has been added, and the addresses from SimpleGeo were parsed with libpostal as a preprocessing step to get the addr:housenumber and addr:street fields (which are not part of the original data set). Here's an example of a duplicate:
{
"is_dupe": true,
"object": {
"geometry": {
"coordinates": [
-122.406645,
37.785415
],
"type": "Point"
},
"properties": {
"addr:full": "870 Market St San Francisco CA 94102",
"addr:housenumber": "870",
"addr:postcode": "94102",
"addr:street": "Market St",
"lieu:guid": "1968d59a119e442fa9c66dc9012be89d",
"name": "Consulate General Of Honduras"
},
"type": "Feature"
},
"possibly_same_as": [
{
"classification": "needs_review",
"explain": {
"name_dupe_threshold": 0.9,
"name_review_threshold": 0.7,
"type": "venue",
"with_unit": false
},
"is_canonical": true,
"object": {
"geometry": {
"coordinates": [
-122.406645,
37.785415
],
"type": "Point"
},
"properties": {
"addr:full": "870 Market St San Francisco CA 94102",
"addr:housenumber": "870",
"addr:postcode": "94102",
"addr:street": "Market St",
"lieu:guid": "d804e17f538b4307a2237dbd7992699c",
"wof:name": "Honduras Consulates",
},
"type": "Feature"
},
"similarity": 0.8511739191000001
}
],
"same_as": [
{
"classification": "likely_dupe",
"explain": {
"name_dupe_threshold": 0.9,
"name_review_threshold": 0.7000000000000001,
"type": "venue",
"with_unit": false
},
"is_canonical": true,
"object": {
"geometry": {
"coordinates": [
-122.406645,
37.785415
],
"type": "Point"
},
"properties": {
"addr:full": "870 Market St San Francisco CA 94102",
"addr:housenumber": "870",
"addr:postcode": "94102",
"addr:street": "Market St",
"lieu:guid": "ec28adce0a134cbfbaacb87e71f4ab34",
"wof:name": "Honduras Consulate General of",
},
"type": "Feature"
},
"similarity": 1.0
}
]
}
Note: the property "lieu:guid" is added by the deduping job and should be retained for users who want to keep a canonical index and dedupe files against it regularly. If an incoming record already has a lieu:guid property, it has a higher priority for being considered canonical than an incoming record without said property. This way it's possible to ingest different data sets using a "cleanest-first" policy, so that the more trusted names (i.e. from a human-edited data set like OpenStreetMap) are ingested first and preferred over less-clean data sets where perhaps only the ID needs to be added to the combined record.
In Spark, the output will be split across some number of part-* files on S3 in the directory specified. They can be downloaded and concatenated as needed.
exact_dupe: addresses are not an exact science, so even the term "exact" here means "sharing at least one libpostal expansion in common". As such, "Market Street" and "Market St" would be considered exact matches, as would "Third Avenue" and "3rd Ave", etc. For street name/house number, we require this sort of exact match, but more freedom is allowed in the venue/business name. If both the venue name and the address are exact matches after expansion, they are considered exact dupes.
likely_dupe: a likely dupe may have some minor misspellings, may be missing common words like "Inc" or "Restaurant", and may use different word orders (often the case for professionals such as lawyers e.g. "Michelle Obama" might be written "Obama, Michelle").
needs_review: these entries might be duplicates, and have high similarity,b ut don't quite meet the threshold required for classification as a likely dupe which can be automatically merged. If all of an entry's potential dupes are classified as "needs_review", that entry will not be considered a dupe (is_dupe=False
in the response), but it may be prudent to flag the entry for a human to look at. The needs_review entries are stored as a separate list in the response (possibly_same_as
) and are sorted in reverse order of their similarity to the candidate object, so the most similar entry will be listed first.
Below were some of the likely dupes extracted during a test-run using WoF/SimpleGeo and a subset of OSM venues in San Francisco (note that all of these also share a house number and street address expansion and have the same geohash or are immediate neighbors):