Description
ENH: Linked Datasets (RDF)
- This is very much a meta ticket.
- There are a number of bare links here.
- They are for documentation
(UPDATE: see westurner/pandasrdf#1)
Use Case
So I:
- retrieved some data
- from somewhere
- about a certain #topic
- perfomed analysis
- with certain transformations and aggregations
- with certain versions of certain tools
- confirmed/rejected a [null] hypothesis
and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.
User Story
As a data analyst, I would like to share or publish Series
, DataFrame
s, Panel
s, and Panel4D
s as structured, hierarchical, RDF linked data ("DataSet").
Status Quo: Pandas IO
How do I go from a [CSV] to a DataFrame to something shareable with a URL?
http://pandas.pydata.org/pandas-docs/dev/io.html
- http://pandas.pydata.org/pandas-docs/dev/dsintro.html
- https://github.com/pydata/pandas/blob/master/pandas/core/format.py
- https://github.com/pydata/pandas/blob/master/pandas/rpy/common.py
.
- Series (1D)
- index
- data
- NumPy datatypes
- DataFrame (2D)
- index
- column(s)
- NumPy datatypes
- Panel (3D)
- Panel4D (4D)
Read or parse a data format into a DataSet:
pandas.read_*
read_clipboard
read_csv
read_excel
read_fwf
read_gbq
read_hdf
read_html
read_json
read_msgpack
read_pickle
read_sql
read_stata
read_table
pandas.HDFStore
Add metadata:
- Add RDF metadata (RDFa, JSONLD)
Save or serialize a DataSet into a data format:
pandas.DataFrame.
to_csv
to_dict
to_excel
to_gbq
to_html
to_latex
to_panel
to_period
to_records
to_sparse
to_sql
to_stata
to_string
to_timestamp
to_wide
- to_ RDF
- to_ CSVW
- to_ HTML + RDFa
- to_ JSONLD
- create a JSONLD @context
Share or publish a serialized DataSet with the internet:
- Email Attachment (Table in a PDF)
- opendatahandbook.org
- project-open-data.github.io
- FTP, SFTP, RSYNC, NFS
- HTML web upload form with metadata form fields
- CLI tool
- Version Control: Git, Hg, Svn
- challenge: 'large' files ("binary blobs") in VCS systems
- HTTP API: Object Storage (~LDP)
GET/POST /container/filename.csv
# [.json|.xml|.xls|.rdf|.html]- challenge: indexing metadata from a separate document / named graph
GET/POST to
/container/filename.csv`
- Push to CKAN
- Host DataSet metadata
python -m SimpleHTTPServer 8088
- e.g. http://datasets.schema-labs.appspot.com/about Indexes http://schema.org/Dataset s
Implementation
What changes would be needed for Pandas core to support this workflow?
.meta
schemato_rdf
for Series, DataFrames, Panels, and Panel4Dsread_rdf
for Series, DataFrames, Panels, and Panel 4Ds- ~
@datastep
process decorators - ~
DataSet
- ~
DataCatalog
of precomputed aggregations/views/slices. - Units support (
.meta
?)
.meta
schema
It's easy enough to serialize a dict and a table to naieve RDF.
For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.
There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema']
, or as a JSON-LD @context
).
Ontology Resources
- http://www.w3.org/TR/rdf-schema/ (
rdfs:
) - http://www.w3.org/TR/owl-overview/ (
owl:
) - http://www.w3.org/TR/sparql11-query/#sparqlDefinition
- http://lov.okfn.org
- http://prefix.cc
CSV2RDF (csvw
)
W3C PROV (prov:
)
- http://www.w3.org/TR/prov-primer/#intuitive-overview-of-prov
- http://www.w3.org/TR/prov-o/
- http://www.w3.org/2011/prov/wiki/ProvImplementations
schema.org (schema:
)
- http://schema.org
- http://www.w3.org/wiki/WebSchemas
- http://schema.rdfs.org/
- https://schema.org/docs/full.html :
- schema:Dataset -- A body of structured information describing some topic(s) of interest.
- [schema:Thing, schema:CreativeWork]
- distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
- spatial, temporal
- catalog -- A data catalog which contains a dataset (DataCatalog)
- schema:DataCatalog -- collection of Datasets
- [schema:Thing, schema:CreativeWork]
- dataset -- A dataset contained in a catalog. (Dataset)
- schema:DataDownload -- A dataset in downloadable form.
- [schema:Thing, schema:CreativeWork]
- contentSize
- contentURL
- uploadDate
- schema:Dataset -- A body of structured information describing some topic(s) of interest.
W3C RDF Data Cube (qb:
)
- http://www.w3.org/TR/vocab-data-cube/
- http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary#The_history_of_Data_Cube.2C_SDMX-RDF_and_SCOVO
- http://www.w3.org/TR/vocab-data-cube/#vocab-reference :
- qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values.
- qb:Observation -- a single observation in the cube, may have one or more associated measured values.
- qb:dataset -- data set of which this observation is a part.
- qb:ObservationGroup -- a, possibly arbitrary, group of observations.
- qb:observation -- an observation contained within this slice of the data set.
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values, component properties on the Slice.
- [Components, Properties, Dimensions, Attributes, Measures]
- qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
to_rdf
http://pandas.pydata.org/pandas-docs/dev/io.html
Arguments:
- output
fmt
- JSON-LD: compaction
.
-
Series.meta
-
Series.to_rdf()
-
DataFrame.meta
-
DataFrame.to_rdf()
-
Panel.meta
-
Panel.to_rdf()
-
Panel4D.meta
-
Panel4D.to_rdf()
read_rdf
http://pandas.pydata.org/pandas-docs/dev/remote_data.html
-
Series.read_rdf()
-
DataFrame.read_rdf()
-
Panel.read_rdf()
-
Panel4D.read_rdf()
Arguments to read_rdf
would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.
@datastep / PROV
- Objective: Additive journal of transformations
- Link to source script(s) URIs
- Decorator for annotating data transformations with metadata.
- Generate PROV metadata for data transformations
Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)
DataCatalog
A collection of Datasets.
-
DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
- 'this is an aggregation of that'
- 'this' has a URI
- 'that' has a URI
- 'this is an aggregation of that'
- What if there is no metadata for df2?
Units support
- Series.meta
- DataFrame.column.meta
- NumPy [, PyTables]
- http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
- https://pint.readthedocs.org/en/latest/
- http://pythonhosted.org/quantities/
RDF Datatypes
- http://en.wikipedia.org/wiki/ISO_8601
- http://www.w3.org/TR/xmlschema-2/#decimal
- http://schema.org/Date
- http://schema.org/DateTime
- http://schema.org/Float
- http://schema.org/Quantity
- https://github.com/RDFLib/rdflib
from rdflib.namespace import XSD, RDF, RDFS
from rdflib import URIRef, Literal
- https://github.com/RDFLib/rdflib-sqlalchemy (SQLAlchemy)
JSON-LD (RDF in JSON)
- https://github.com/digitalbazaar/pyld (JSON-LD)
- https://github.com/RDFLib/rdflib-jsonld (JSON-LD)
Linked Data Primer
Linked Data Abstractions
- Graphs are represented as triples of (s,p,o)
- Subject, Predicate, Object
- Queries are patterns with ?references
graph.triples((None, None, None))
SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
- subjects are linked to objects by predicates
- subjects and predicate are identified by URI 'keys'
URIs and URLs
- a URI is like a URL
- usually, we expect URLs to be 'dereferencable` HTTP URIs
- HTTP GET http://en.wikipedia.org/
- a URI may start with a different URI prefix
urn:
uuid:
SQL and Linked Data
- there exist standard mappings for whole SQL tablesets
- rdb2rdf
- similar to application scaffolding
- ACL support adds complexity
- virtuoso supports SQL and RDF and SPARQL
- standard mappings
- virtuoso powers http://dbpedia.org/
- dbpedia.org has a high degree of centrality
- rdflib-sqlalchemy maps RDF onto SQL tables
- fairly inefficiently, when compared to native triplestores
Named Graphs
- Quads: (g, s, p, o)
- g: sometimes called the 'context' of a triple
- Metadata about
GRAPH ?g
- Multiple named graphs in one file: TriX, TriG
Linked Data Formats
- NTriples
- RDF/XML
- TriX
- Turtle, N3
- TriG
- JSON-LD
Choosing Schema
- XSD, RDF, RDFS, DCTERMS
- Which schema is most popular?
- Which schema is a best fit for the data?
- Which schema will search engines index for us?
- What do the queries look like?
- Years Later... What is OWL?
- Why would we start with RDFS now?
Linked Data Process, Provenance, and Schema
DataSets have [implicit] URIs:
http://example.com/datasets/#<key>
Shared or published DataSets have URLs:
http://ckan.example.org/datasets/<key>
DataSets are about certain things:
e.g. URIs for #Tags, Categories, Taxonomy, Ontology
DataSets are derived from somewhere, somehow:
- where and how was it downloaded? (digital sense)
- how was it collected? (process control sense)
Datasets have structure:
- Tabular, Hierarchical
- 1D, 2D, 3D, 4D
- Graph-based
- Chains
- Flows
- Schema
5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.