Skip to content

Recommended file format for large files #129

@jbednar

Description

@jbednar

Datashader is agnostic about file formats, working with anything that can be loaded into a dataframe-like object (currently supporting Pandas and Dask dataframes). But because datashader focuses on having good performance for large datasets, the performance of the file format is a major factor in the usability of the library. Thus we should use examples that serve to guide users towards good solutions for their own problems, recommending and demonstrating approaches that we find to work well.

Right now, our examples use CSV and castra or HDF5 formats. It is of course important to show a CSV example, since nearly every dataset can be obtained in CSV for import into the library. However, CSV is highly inefficient in both file size and reading speed, and it also truncates floating-point precision in ways that are problematic when zooming in closely to a dataset.

Castra is a relatively high-performance binary format that works well for the large datasets in the examples, but it is not yet a mature project, and is not available on the main conda channel. Should we invest in making castra be more fully supported? If not, I think we should choose another binary format (HDF5?) to use for our examples.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions