Recommended file format for large files

Datashader is agnostic about file formats, working with anything that can be loaded into a dataframe-like object (currently supporting Pandas and Dask dataframes).  But because datashader focuses on having good performance for large datasets, the performance of the file format is a major factor in the usability of the library.  Thus we should use examples that serve to guide users towards good solutions for their own problems, recommending and demonstrating approaches that we find to work well.

Right now, our examples use CSV and castra or HDF5 formats.  It is of course important to show a CSV example, since nearly every dataset can be obtained in CSV for import into the library.  However, CSV is highly inefficient in both file size and reading speed, and it also truncates floating-point precision  in ways that are problematic when zooming in closely to a dataset.

[Castra](https://pypi.python.org/pypi/castra) is a relatively high-performance binary format that works well for the large datasets in the examples, but it is not yet a mature project, and is not available on the main conda channel.  Should we invest in making castra be more fully supported?  If not, I think we should choose another binary format (HDF5?) to use for our examples.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Recommended file format for large files #129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Recommended file format for large files #129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions