Utilities to create and manage Scythica S-Datasets. S-Datasets are meant to be analyzed by analytical packages that support memory mapped files, such as the R RScythica package.
See the [RScythica][http://github.com/geraldthewes/RScythica] for details.
Currently supports limited CSV creation.
LICENSE: LGPL 2.1
-
S Dataframes consists of multiple binary, column oriented files that hold vector data in binary form
-
S Dataframes are partioned.
-
Partitions are further broken into large splits, typically 64MB in size for efficiency
The following types are currently supported:
- int32:32-Bit integer
- double: 64-Bit Double
- factor: Strings interpreted as R factors
- date: Date type
- datetime: DateTime type. Format can be specified in the attribute section using the Go Time formating syntax
Download recent go distrupution from the golang site Latest build were done using Go 1.3
You will need various version control utilities installed including
- GIT
- Mercurial
- Bazaar
sudo apt-get install bzr
Then:
mkdir ~/go # Or any other path
export GOPATH=~/go
mkdir -p ~/go/src/github.com/geraldthewes/
cd ~/go/src/github.com/geraldthewes/
git clone https://github.com/geraldthewes/scythica.git
cd scythica/sdsmeta
go get
go build
go install
cd ../sdscreate
go build
go install
sudo cp sdscreate /usr/local/bin
Configuration files are in YAML format.
First section is columns. For each column specify the column name, column data type and optional attributes.
Column attributes include:
- pkey - Indicates the column is part of the partition key. Multiple columns can form the partition key. One column must be a partition key
- pkey0p2 - Same as pkey, but column is expected to be a number, and number will be 0 padded to two digits. Useful if you have date fields that need to be part of the key that are not 0 padded
- A go Time datetime format - such as "20060102 15:04"
The keyspace section contain S-Dataset wide configuration settings
- key_size - Unused at the momemnt
- nodes - Unused at the moment
- rows_per_split - Max number of rows to be part of a split. Should be a large number like 1000000 or more.
- isna - Global default for NA value.
columns:
- colname: STATION
coltype: factor
attributes: pkey
- colname: STATION_NAME
coltype: factor
- colname: ELEVATION
coltype: double
- colname: LATITUDE
coltype: double
- colname: LONGITUDE
coltype: double
- colname: DATE
coltype: datetime
attributes: "20060102 15:04"
- colname: MEASUREMENT
coltype: int32
- colname: FLAG
coltype: factor
- colname: QUALITY
coltype: factor
- colname: UNITS
coltype: factor
keyspace:
key_size: 8192
nodes: 1
rows_per_split: 1000
isna: "NA"
sdscreate data/boston.yaml boston.db data/boston-1970-2014.csv