Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

World Countries update #188

Merged
merged 44 commits into from
Oct 6, 2020
Merged

Conversation

jsanz
Copy link
Member

@jsanz jsanz commented Sep 21, 2020

fixes #179
fixes #164

This PR updates the World Countries dataset with a more detailed GeoJSON and TopoJSON datasets. It also updates the Admin Regions dataset to include country data.

The size of the World Countries dataset is mostly controlled by the interval parameter of the simplify command for Mapshaper. Current intervals produce a 1.8MB GeoJSON and a 2.2MB TopoJSON (uncompressed).

The README.md file of the sources/world folder go in detail also on how to query Wikidata to generate a CSV file with population and area for each ISO2 code that is joined with Mapshaper.

This is the visual result for the layer compared with the current production dataset, including actual download size.

image

@kibanamachine
Copy link

💔 Build Failed

@kibanamachine
Copy link

💚 Build Succeeded

@jsanz jsanz marked this pull request as ready for review September 22, 2020 15:08
@jsanz
Copy link
Member Author

jsanz commented Sep 22, 2020

An option we can also check is to replace Wikidata area/pop by World Bank data. In the exported datasets the ISO3 code is added so this should be quite straight forward. A clear benefit is a well-defined origin of information, available for several years, also in an Open License, and with many more indicators available.

@nickpeihl let me know if you think we should explore this path and I'll revert to a Draft PR.

@nickpeihl
Copy link
Member

An option we can also check is to replace Wikidata area/pop by World Bank data. In the exported datasets the ISO3 code is added so this should be quite straight forward. A clear benefit is a well-defined origin of information, available for several years, also in an Open License, and with many more indicators available.

@nickpeihl let me know if you think we should explore this path and I'll revert to a Draft PR.

I think it's worth exploring sourcing the popluation data from the World Bank as an authoritative source.

@jsanz jsanz marked this pull request as draft September 22, 2020 15:21
@jsanz jsanz removed the request for review from nickpeihl September 22, 2020 15:21
@jsanz
Copy link
Member Author

jsanz commented Sep 22, 2020

Admin regions dataset needs some further work, there are some missing ISO3 codes

cat data/admin_regions_lvl2_v1.geo.json |\
 jq -r '.features[] | select( .properties.country_iso3_code == "" ) | .properties.country_iso2_code ' |\
 uniq |\
 wc -l

On the other hand the World Bank offers a convenient API to get data per metric we can later join using the current Mapshaper workflow.

export METRIC="SP.POP.TOTL" &&\
echo "iso,measure" > ${METRIC}.csv &&\
  curl -s "http://api.worldbank.org/v2/country/all/indicator/${METRIC}?format=json&per_page=300&date=2018" |\
  jq -cr '.[1][] | select(.countryiso3code != "") | select(.value > 0) | .countryiso3code + "," + (.value | tostring)' |\
  sort >> ${METRIC}.csv

sources/world/README.md Outdated Show resolved Hide resolved
Copy link
Member

@nickpeihl nickpeihl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I did a preliminary review of the data.

The updated world countries dataset is missing the following ISO-2 codes that are in the current production dataset. I suspect many of these countries have no subdivisions in the admin regions dataset.

"AX"
"BQ"
"BV"
"CC"
"CX"
"GF"
"GP"
"MQ"
"RE"
"SJ"
"YT"

You can see this for yourself with comm -23 <(git show master:data/world_countries_v1.geo.json | jq '.features[].properties.iso2' | sort | uniq) <(jq '.features[].properties.iso2' < data/world_countries_v1.geo.json | sort | uniq)

As an alternative, we could use the Admin 0 Countries layer from Natural Earth. The NE Countries and NE States Provinces layers use the same boundary lines for countries. Though there might be some minor differences between the EMS layers depending on the mapshaper simplification settings we use.

@kibanamachine
Copy link

💔 Build Failed

@kibanamachine
Copy link

💔 Build Failed

@kibanamachine
Copy link

💚 Build Succeeded

@jsanz
Copy link
Member Author

jsanz commented Sep 23, 2020

@nickpeihl thanks for the review!

With last changes we have a complete World Countries dataset with all records with area and population except Antarctica and British Indian Ocean Territory. What I did is to fill the gaps from the World Bank dataset (for example both Sudan and South Sudan are empty) with the remaining information from Wikidata. I also added a Makefile to run the full process.

But then 😅 I checked for the CIA Factbook and I found this project that parses the website and offers a single JSON with data from countries. The ISO 3166 code is available on the Internet section.

cat factbook.json \| 
jq -c '.countries | to_entries | .[] | { name: .value.data.name, pop:  .value.data.people.population.total, pop_date: .value.data.people.population.date, code: (.value.data.communications.internet.country_code[1:3]) }' 

I like WB data because we don't rely on a scrapped HTML to source our data, but let me know if CIA Factbook is still interesting in exchange of WB or Wikidata current usage.

@kibanamachine
Copy link

💚 Build Succeeded

@nickpeihl nickpeihl self-requested a review September 23, 2020 23:57
Copy link
Member

@nickpeihl nickpeihl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I like WB data because we don't rely on a scrapped HTML to source our data, but let me know if CIA Factbook is still interesting in exchange of WB or Wikidata current usage.

I agree. WB has a proper authoritative API which I prefer.

  1. I also see that my comments here and here are antithetical. I'm starting to reconsider using the admin regions to create the world countries because we are deleting ISO codes that previously existed.

  2. We should make sure metric fields like population and area don't show as join fields in Elastic Maps and region maps in older releases of Kibana.

sources/world/README.md Outdated Show resolved Hide resolved
sources/world/README.md Outdated Show resolved Hide resolved
sources/world/world_countries_v2.hjson Outdated Show resolved Hide resolved
sources/world/world_countries_v7.hjson Outdated Show resolved Hide resolved
@jsanz
Copy link
Member Author

jsanz commented Sep 24, 2020

I also see that my comments here and here are antithetical. I'm starting to reconsider using the admin regions to create the world countries because we are deleting ISO codes that previously existed.

From your previous #188 (review), these are the removed ISO codes from the PR data, all part of overseas territories from different countries:

ISO Name Notes
AX Åland Islands Finland
BQ Caribbean Netherlands Netherlands
BV Bouvet Island Norway
CC Cocos (Keeling) Islands Australia
CX Christmas Island Australia
GF French Guiana France
GP Guadeloupe France
MQ Martinique France
RE Réunion France
SJ Svalbard and Jan Mayen Norway
YT Mayotte France

@jsanz jsanz marked this pull request as ready for review September 24, 2020 16:32
@kibanamachine
Copy link

💚 Build Succeeded

@jsanz
Copy link
Member Author

jsanz commented Sep 24, 2020

@nickpeihl in f4129ef I've moved the dataset to the world folder and updated instructions and so on. I've tested this branch on Kibana and apart from the well-known issue of these fields being offered for JOINs on EMS folders, everything seems to be working fine.

I still want to check for those ISO codes mentioned earlier today to see if they show up in the GeoIP database. I'll do that next week.

jsanz and others added 21 commits October 2, 2020 12:40
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>
@jsanz jsanz changed the base branch from master to feature-layers October 2, 2020 10:44
@jsanz
Copy link
Member Author

jsanz commented Oct 2, 2020

@nickpeihl sorry for the extra commits, I did a rebase from master to include the Italy Provinces change but I should have merged 😓

I've also updated feature-layers branch to align with master and this PR is now against feature-layers so we can deploy this to staging environment for further testing after merge (or maybe better squash in this case).

@kibanamachine
Copy link

💚 Build Succeeded

@nickpeihl nickpeihl self-requested a review October 5, 2020 20:31
Copy link
Member

@nickpeihl nickpeihl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final changes lgtm! thanks.

@jsanz jsanz merged commit 30e7235 into elastic:feature-layers Oct 6, 2020
@jsanz jsanz deleted the 179-world-countries branch October 6, 2020 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add population data to EMS files Update World Countries Layer
3 participants