ENH: Linked Datasets (RDF)

## ENH: Linked Datasets (RDF)
- This is very much a meta ticket.
- There are a number of bare links here.
- They are for documentation

(UPDATE: see https://github.com/westurner/pandas-rdf/issues/1)
### Use Case

So I:
- retrieved some data
  - from somewhere
  - about a certain #topic
- perfomed analysis
  - with certain transformations and aggregations
  - with certain versions of certain tools
  - confirmed/rejected a [null] hypothesis

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.
### User Story

As a data analyst, I would like to **share** or **publish** `Series`, `DataFrame`s, `Panel`s, and `Panel4D`s as structured, hierarchical, RDF linked data ("DataSet").
### Status Quo: Pandas IO

> How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html
- http://pandas.pydata.org/pandas-docs/dev/dsintro.html
- https://github.com/pydata/pandas/blob/master/pandas/core/format.py
- https://github.com/pydata/pandas/blob/master/pandas/rpy/common.py

.
- Series (1D)
  - index
  - data
    - NumPy datatypes
- DataFrame (2D)
  - index
  - column(s)
    - NumPy datatypes
- Panel (3D)
- Panel4D (4D)

**Read** or **parse** a data format into a DataSet:
- `pandas.read_*`
  - `read_clipboard`
  - `read_csv`
  - `read_excel`
  - `read_fwf`
  - `read_gbq`
  - `read_hdf`
  - `read_html`
  - `read_json`
  - `read_msgpack`
  - `read_pickle`
  - `read_sql`
  - `read_stata`
  - `read_table`
- `pandas.HDFStore`
  - https://pandas.pydata.org/docs/dev/io.html#hdf5-pytables 

**Add** metadata:
- [ ] Add RDF metadata (RDFa, JSONLD)

**Save** or **serialize** a DataSet into a data format:
- `pandas.DataFrame.`
  - `to_csv`
  - `to_dict`
  - `to_excel`
  - `to_gbq`
  - `to_html`
  - `to_latex`
  - `to_panel`
  - `to_period`
  - `to_records`
  - `to_sparse`
  - `to_sql`
  - `to_stata`
  - `to_string`
  - `to_timestamp`
  - `to_wide`
- [ ] to_ RDF
- [ ] to_ CSVW
- [ ] to_ HTML + RDFa
- [ ] to_ JSONLD
  - [ ] create a JSONLD @context

**Share** or **publish** a serialized DataSet with the internet:
- Email Attachment (Table in a PDF)
  - opendatahandbook.org
  - project-open-data.github.io
- FTP, SFTP, RSYNC, NFS
- HTML web upload form with metadata form fields
- CLI tool
- Version Control: Git, Hg, Svn
  - challenge: 'large' files ("binary blobs") in VCS systems
- HTTP API: Object Storage (~LDP)
  - `GET/POST /container/filename.csv` # [.json|.xml|.xls|.rdf|.html]
  - challenge: indexing metadata from a separate document / named graph
    - `GET/POST to`/container/filename.csv`
- Push to CKAN
  - http://docs.ckan.org/en/latest/api.html
  - http://docs.ckan.org/en/latest/linked-data-and-rdf.html
  - https://github.com/okfn/ckan
- Host DataSet metadata
  - `python -m SimpleHTTPServer 8088`
  - e.g. http://datasets.schema-labs.appspot.com/about Indexes http://schema.org/Dataset s
### Implementation

What changes would be needed for Pandas core to support this workflow?
- `.meta` schema
- `to_rdf` for Series, DataFrames, Panels, and Panel4Ds
- `read_rdf` for Series, DataFrames, Panels, and Panel 4Ds
- ~`@datastep` process decorators
- ~`DataSet`
- ~`DataCatalog` of precomputed aggregations/views/slices.
- Units support (`.meta`?)
#### `.meta` schema

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata
within Pandas (e.g. in `.meta['columns'][colname]['schema']`, or as a **JSON-LD `@context`**).
##### Ontology Resources
- http://www.w3.org/TR/rdf-schema/ (`rdfs:`)
- http://www.w3.org/TR/owl-overview/ (`owl:`)
- http://www.w3.org/TR/sparql11-query/#sparqlDefinition
- http://lov.okfn.org
- http://prefix.cc
##### CSV2RDF (`csvw`)
- http://www.w3.org/ns/csvw
- https://github.com/w3c/csvw
- https://w3c.github.io/csvw/
##### W3C PROV (`prov:`)
- http://www.w3.org/TR/prov-primer/#intuitive-overview-of-prov
- http://www.w3.org/TR/prov-o/
- http://www.w3.org/2011/prov/wiki/ProvImplementations
  - **https://pypi.python.org/pypi/prov**
  - http://prov.readthedocs.org/en/latest/usage.html
##### schema.org (`schema:`)
- http://schema.org
- http://www.w3.org/wiki/WebSchemas
- http://schema.rdfs.org/
- https://schema.org/docs/full.html :
  - schema:Dataset -- A body of structured information describing some topic(s) of interest.
    - [schema:Thing, schema:CreativeWork]
    - distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
    - spatial, temporal
    - catalog --  A data catalog which contains a dataset (DataCatalog)
  - schema:DataCatalog -- collection of Datasets
    - [schema:Thing, schema:CreativeWork]
    - dataset -- A dataset contained in a catalog. (Dataset)
  - schema:DataDownload -- A dataset in downloadable form.
    - [schema:Thing, schema:CreativeWork]
    - contentSize
    - contentURL
    - uploadDate
##### W3C RDF Data Cube (`qb:`)
- http://www.w3.org/TR/vocab-data-cube/
- http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary#The_history_of_Data_Cube.2C_SDMX-RDF_and_SCOVO 
- http://www.w3.org/TR/vocab-data-cube/#vocab-reference :
  - qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
    - qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values.
  - qb:Observation -- a single observation in the cube, may have one or more associated measured values.
    - qb:dataset -- data set of which this observation is a part.
  - qb:ObservationGroup -- a, possibly arbitrary, group of observations.
    - qb:observation -- an observation contained within this slice of the data set.
  - qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values, component properties on the Slice.
  - [Components, Properties, Dimensions, Attributes, Measures]
#### `to_rdf`

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:
- [ ] output `fmt`
- [ ] JSON-LD: compaction

.
- [ ] `Series.meta`
- [ ] `Series.to_rdf()`
- [ ] `DataFrame.meta`
- [ ] `DataFrame.to_rdf()`
- [ ] `Panel.meta`
- [ ] `Panel.to_rdf()`
- [ ] `Panel4D.meta`
- [ ] `Panel4D.to_rdf()`
#### `read_rdf`

http://pandas.pydata.org/pandas-docs/dev/remote_data.html
- [ ] `Series.read_rdf()`
- [ ] `DataFrame.read_rdf()`
- [ ] `Panel.read_rdf()`
- [ ] `Panel4D.read_rdf()`

Arguments to `read_rdf` would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.
#### @datastep / PROV
- [ ] Objective: Additive journal of transformations
- [ ] Link to source script(s) URIs
- [ ] Decorator for annotating data transformations with metadata.
- [ ] Generate PROV metadata for data transformations

[Ten Simple Rules for Reproducible Computational Research](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285) (3, 4, 5, 7, 8, 10)
#### DataCatalog

A collection of Datasets.
- [ ] `DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]`
  - 'this is an aggregation of that'
    - 'this' has a URI
    - 'that' has a URI
- What if there is no metadata for df2?
#### Units support
- [ ] Series.meta
- [ ] DataFrame.column.meta
- NumPy [, PyTables]
- http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
- https://pint.readthedocs.org/en/latest/
- http://pythonhosted.org/quantities/

RDF Datatypes
- http://en.wikipedia.org/wiki/ISO_8601
- http://www.w3.org/TR/xmlschema-2/#decimal
- http://schema.org/Date
- http://schema.org/DateTime
- http://schema.org/Float
- http://schema.org/Quantity
- **https://github.com/RDFLib/rdflib**
  - `from rdflib.namespace import XSD, RDF, RDFS`
  - `from rdflib import URIRef, Literal`
  - https://github.com/RDFLib/rdflib-sqlalchemy (SQLAlchemy)

JSON-LD (RDF in JSON)
- https://github.com/digitalbazaar/pyld (JSON-LD)
- https://github.com/RDFLib/rdflib-jsonld (JSON-LD)
### Linked Data Primer

Linked Data Abstractions
- Graphs are represented as triples of (s,p,o)
- Subject, Predicate, Object
- Queries are patterns with ?references
  - `graph.triples((None, None, None))`
  - `SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };`
- subjects are linked to objects by predicates
  - subjects and predicate are identified by URI 'keys'

URIs and URLs
- a URI is like a URL
- usually, we expect URLs to be 'dereferencable` HTTP URIs
  - HTTP GET http://en.wikipedia.org/
- a URI may start with a different URI prefix
  - `urn:`
  - `uuid:`

SQL and Linked Data
- there exist standard mappings for whole SQL tablesets
  - rdb2rdf
  - similar to application scaffolding
  - ACL support adds complexity
- virtuoso supports SQL and RDF and SPARQL
  - standard mappings
  - virtuoso powers http://dbpedia.org/
    - dbpedia.org has a high degree of centrality
      - http://lod-cloud.net/
- rdflib-sqlalchemy maps RDF onto SQL tables
  - fairly inefficiently, when compared to native triplestores

Named Graphs
- Quads: (g, s, p, o)
- g: sometimes called the 'context' of a triple
- Metadata about `GRAPH ?g`
- Multiple named graphs in one file: TriX, TriG

Linked Data Formats
- [ ] NTriples
- [ ] RDF/XML
  -  [ ] TriX
- [ ] Turtle, N3
  - [ ] TriG
- [ ] JSON-LD

Choosing Schema
- [ ] XSD, RDF, RDFS, DCTERMS
- Which schema is most popular?
- Which schema is a best fit for the data?
- Which schema will search engines index for us?
- What do the queries look like?
- Years Later... What is OWL?
- Why would we start with RDFS now?
#### Linked Data Process, Provenance, and Schema

DataSets have [implicit] URIs:

```
http://example.com/datasets/#<key>
```

**Shared** or **published** DataSets have URLs:

```
http://ckan.example.org/datasets/<key>
```

DataSets are about certain things:

```
e.g. URIs for #Tags, Categories, Taxonomy, Ontology
```

DataSets are derived from somewhere, somehow:
- where and how was it downloaded? (digital sense)
- how was it collected? (process control sense)

Datasets have structure:
- Tabular, Hierarchical
- 1D, 2D, 3D, 4D
- Graph-based
  - Chains
  - Flows
- Schema

**5 ★ Open Data**
**http://5stardata.info/**
**http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data**

> ☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
> ☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
> ☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
> ☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
> ☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Linked Datasets (RDF) #3402