Skip to content

2. How to import datasets

Peter edited this page Nov 21, 2017 · 2 revisions

You can import to the dataset.big table, datasets described by datapackage or non-prepared datasets:

  • prepared with a standard datapackage.json v1.0:
    • from Web at Github.com: use the key github.com.
    • from Web at other like GitLab: later, it is easy but not implemented yet.
    • from Localhost: at your machine or the server of a remote PostgreSQL. Key local.
  • non-prepared: any CSV file, but only local option at this time. Key local-csv.

With these basic clues you can understand and edit your config.json to select what you will import. It is a "configuration + import list" for the generator software, that generates a make file that you can run as shell script in anywhere (local or server) to import datasets

The conf.json file

All configurations to the make-generator and a list of resources of your datasets are expressed as simple key-value pairs at this file. Lets start with Example 1, that is the "default" configuration at distribution:

  "db": "postgresql://postgres:postgres@localhost:5432/trydatasets",
  "github.com":{
    "lexml/lexml-vocabulary":null,
    "datasets/language-codes":null,
    "datasets/country-codes":null,
    "datasets/world-cities":{
      "_corrections_":{"resources":[{"primaryKey": "geonameid"}]}
    },
    "datasets-br/state-codes":"br-state-codes",
    "datasets-br/city-codes":null
  },
  "useBig":true,  "useIDX":false,        "useRename":true,
  "useYUml":true, "useAllNsAsDft":false
}
  • db is the PostgreSQL connection string.

  • github.com is a well-knowned place for datasets... So, only with Github's project name the software can get all files. Contents and some explanations:

    • "datasets/language-codes":null is a Github project at http://github.com/datasets/language-codes
      There are a /datapackage.json file and a /data folder with the CSV files pointed by datapackage.json. There are 4 CVS files, the null say that you need all of them.

    • "datasets-br/state-codes":"br-state-codes", here the string "br-state-codes" reduced "all" to only one CSV. It is at datasets-br/state-codes/data.

    • "datasets/world-cities":{...} is also not null, but now have some information. The typical one is to do correctins, and it is only a replacement for other informations at the project's datapackage.json. The first item at "resources" array.

  • useBig, useIDX, etc. are flags.

Using local and local-csv

Other conf.json example:

{
   "db":"postgresql://postgres:postgres@localhost:5432/trydatasets",
   "github.com":{
        "datasets/country-codes":null,
        "datasets-br/state-codes":"br-state-codes",
        "datasets-br/city-codes":null
   },
   "local": {
        "/tmp/test1":null
   },
   "local-csv":{
     "test2017":{
       "separator":";",
       "folder":"/home/user/mytests"
     },
     "otherTests":"/home/user/myOthertests"
   },
   "useBig":true, "useIDX":false, "useRename":true
}
  • "local" lists the local folders containing usual datapackage.json at root, so all other behaviours are the same tham Github's.

  • "local-csv" poits directly a CSV files, with no datapackage descriptor. So, some more information is necessary. Most commom is the CSV-separator. The name is used to define dataset's namespace.

Messages of the make-generator

...

BEGIN of cache-scripts generation

 CONFIGS (github.com): NsAsDft= useIDX=, count=6 items.

 Creating cache-scripts for lexml/lexml-vocabulary of github.com:
	 Building table1 with data/autoridade.csv.
	 Building table2 with data/localidade.csv.
	 Building table3 with data/tipoDocumento.csv.
	 Building table4 with data/evento.csv.
	 Building table5 with data/lingua.csv.
	 Building table6 with data/tipoConteudo.csv.
 Creating cache-scripts for datasets/language-codes of github.com:
	 Building table7 with data/language-codes.csv.
	 Building table8 with data/language-codes-3b2.csv.
	 Building table9 with data/language-codes-full.csv.
	 Building table10 with data/ietf-language-tags.csv.
 Creating cache-scripts for datasets/country-codes of github.com:
	 Building table11 with data/country-codes.csv.
 Creating cache-scripts for datasets/world-cities of github.com:
	 -- Notice: using conf-corrections for datapackage
		... Replacing resources[0][primaryKey] by 'geonameid'
	 Building table12 with data/world-cities.csv.
 Creating cache-scripts for datasets-br/state-codes of github.com:
	 Building table13 with data/br-state-codes.csv.
 Creating cache-scripts for datasets-br/city-codes of github.com:
	 Building table14 with data/br-city-synonyms.csv.
	 Building table15 with data/br-city-codes.csv.
END of cache-scripts generation

Associated importation

The configuration and first output results are at Example 1. To check what was imported you can compare conf.josn directives with the vmeta_summary,

select * from  dataset.vmeta_summary;
 id |               urn               |          pkey          |   jtd   | n_cols | n_rows 
----+---------------------------------+------------------------+---------+--------+--------
  1 | (2)lexml:autoridade             | id                     | tab-aoa |      9 |    601
  4 | (2)lexml:evento                 | id                     | tab-aoa |      9 |     14
 ...
 15 | (4)datasets-br:br_city_codes    | state/lexLabel         | tab-aoa |      9 |   5570
 14 | (4)datasets-br:br_city_synonyms | state/lexLabel/synonym | tab-aoa |      5 |     26
 13 | (4)datasets-br:br_state_codes   | id                     | tab-aoa |     15 |     33
(15 rows)
  • Conf's datasets/country-codes generated the datasets namespace, and there was "datasets-br/state-codes":"br-state-codes", "datasets-br/city-codes":null

...

Clone this wiki locally