-
Notifications
You must be signed in to change notification settings - Fork 0
2. How to import datasets
You can import to the dataset.big
table, datasets described by datapackage or non-prepared datasets:
-
prepared with a standard
datapackage.json
v1.0:- from Web at Github.com: use the key
github.com
. - from Web at other like GitLab: later, it is easy but not implemented yet.
- from Localhost: at your machine or the server of a remote PostgreSQL. Key
local
.
- from Web at Github.com: use the key
-
non-prepared: any CSV file, but only local option at this time. Key
local-csv
.
With these basic clues you can understand and edit your config.json
to select what you will import. It is a "configuration + import list" for the generator software, that generates a make file that you can run as shell script in anywhere (local or server) to import datasets
All configurations to the make-generator and a list of resources of your datasets are expressed as simple key-value pairs at this file. Lets start with Example 1, that is the "default" configuration at distribution:
"db": "postgresql://postgres:postgres@localhost:5432/trydatasets",
"github.com":{
"lexml/lexml-vocabulary":null,
"datasets/language-codes":null,
"datasets/country-codes":null,
"datasets/world-cities":{
"_corrections_":{"resources":[{"primaryKey": "geonameid"}]}
},
"datasets-br/state-codes":"br-state-codes",
"datasets-br/city-codes":null
},
"useBig":true, "useIDX":false, "useRename":true,
"useYUml":true, "useAllNsAsDft":false
}
-
db is the PostgreSQL connection string.
-
github.com is a well-knowned place for datasets... So, only with Github's project name the software can get all files. Contents and some explanations:
-
"datasets/language-codes":null
is a Github project at http://github.com/datasets/language-codes
There are a /datapackage.json file and a /data folder with the CSV files pointed bydatapackage.json
. There are 4 CVS files, thenull
say that you need all of them. -
"datasets-br/state-codes":"br-state-codes"
, here the string "br-state-codes" reduced "all" to only one CSV. It is at datasets-br/state-codes/data. -
"datasets/world-cities":{...}
is also not null, but now have some information. The typical one is to do correctins, and it is only a replacement for other informations at the project's datapackage.json. The first item at "resources" array.
-
-
useBig, useIDX, etc. are flags.
Other conf.json example:
{
"db":"postgresql://postgres:postgres@localhost:5432/trydatasets",
"github.com":{
"datasets/country-codes":null,
"datasets-br/state-codes":"br-state-codes",
"datasets-br/city-codes":null
},
"local": {
"/tmp/test1":null
},
"local-csv":{
"test2017":{
"separator":";",
"folder":"/home/user/mytests"
},
"otherTests":"/home/user/myOthertests"
},
"useBig":true, "useIDX":false, "useRename":true
}
-
"local"
lists the local folders containing usualdatapackage.json
at root, so all other behaviours are the same tham Github's. -
"local-csv"
poits directly a CSV files, with no datapackage descriptor. So, some more information is necessary. Most commom is the CSV-separator. The name is used to define dataset's namespace.
...
BEGIN of cache-scripts generation
CONFIGS (github.com): NsAsDft= useIDX=, count=6 items.
Creating cache-scripts for lexml/lexml-vocabulary of github.com:
Building table1 with data/autoridade.csv.
Building table2 with data/localidade.csv.
Building table3 with data/tipoDocumento.csv.
Building table4 with data/evento.csv.
Building table5 with data/lingua.csv.
Building table6 with data/tipoConteudo.csv.
Creating cache-scripts for datasets/language-codes of github.com:
Building table7 with data/language-codes.csv.
Building table8 with data/language-codes-3b2.csv.
Building table9 with data/language-codes-full.csv.
Building table10 with data/ietf-language-tags.csv.
Creating cache-scripts for datasets/country-codes of github.com:
Building table11 with data/country-codes.csv.
Creating cache-scripts for datasets/world-cities of github.com:
-- Notice: using conf-corrections for datapackage
... Replacing resources[0][primaryKey] by 'geonameid'
Building table12 with data/world-cities.csv.
Creating cache-scripts for datasets-br/state-codes of github.com:
Building table13 with data/br-state-codes.csv.
Creating cache-scripts for datasets-br/city-codes of github.com:
Building table14 with data/br-city-synonyms.csv.
Building table15 with data/br-city-codes.csv.
END of cache-scripts generation
The configuration and first output results are at Example 1. To check what was imported you can compare conf.josn
directives with the vmeta_summary,
select * from dataset.vmeta_summary;
id | urn | pkey | jtd | n_cols | n_rows
----+---------------------------------+------------------------+---------+--------+--------
1 | (2)lexml:autoridade | id | tab-aoa | 9 | 601
4 | (2)lexml:evento | id | tab-aoa | 9 | 14
...
15 | (4)datasets-br:br_city_codes | state/lexLabel | tab-aoa | 9 | 5570
14 | (4)datasets-br:br_city_synonyms | state/lexLabel/synonym | tab-aoa | 5 | 26
13 | (4)datasets-br:br_state_codes | id | tab-aoa | 15 | 33
(15 rows)
- Conf's
datasets/country-codes
generated the datasets namespace, and there was "datasets-br/state-codes":"br-state-codes", "datasets-br/city-codes":null
...