Skip to content

Commit

Permalink
feat: direct support for feather, parquet, and json
Browse files Browse the repository at this point in the history
  • Loading branch information
dmyersturnbull committed Mar 3, 2021
1 parent c8ee847 commit bb990ff
Show file tree
Hide file tree
Showing 18 changed files with 417 additions and 338 deletions.
15 changes: 9 additions & 6 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
version: 2
# see options in https://docs.github.com/en/github/administering-a-repository/configuration-options-for-dependency-updates
updates:
- package-ecosystem: "pip"
directory: "/"
- package-ecosystem: pip
directory: /
schedule:
interval: "weekly"
- package-ecosystem: "github-actions"
directory: "/"
interval: weekly
labels: ["kind: infrastructure"]
- package-ecosystem: github-actions
directory: /
schedule:
interval: "weekly"
interval: weekly
labels: ["kind: infrastructure"]
2 changes: 1 addition & 1 deletion .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ on:
push:
pull_request:
schedule:
- cron: 0 7 * * 6
- cron: "0 7 * * 6"
jobs:
markdown-link-check:
name: Check Markdown links
Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,9 @@
name: Publish on release creation
on:
release:
types:
- published
types: [published]
repository_dispatch:
types:
- release-made
types: [release-made]
jobs:
deploy:
runs-on: ubuntu-20.04
Expand Down
10 changes: 4 additions & 6 deletions .github/workflows/pull-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,25 @@
name: Test on pull request
on:
pull_request:
branches:
- main
branches: [main]
jobs:
build:
strategy:
max-parallel: 1
matrix:
os: ["ubuntu-20.04"]
os: ["ubuntu-20.04", "windows-2019", "macos-10.15"]
python-version: ["3.9"]
runs-on: "${{ matrix.os }}"
steps:
- name: Checkout
uses: actions/checkout@v2
- name: 'Set up Python ${{ matrix.python-version }}'
- name: "Set up Python ${{ matrix.python-version }}"
uses: actions/setup-python@v2
with:
python-version: '${{ matrix.python-version }}'
python-version: "${{ matrix.python-version }}"
- name: Install build meta-dependencies
run: |
pip install tox poetry
- name: Test with tox
run: |
tox -v
5 changes: 2 additions & 3 deletions .github/workflows/push-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
name: Build & test
on:
push:
branches:
- main
branches: [main]
jobs:
build:
strategy:
Expand All @@ -30,6 +29,6 @@ jobs:
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
pip install coveralls
pip install 'coveralls>=3,<4'
coveralls --service=github
if: ${{ matrix.os }} == ubuntu-20.04
9 changes: 4 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
minimum_pre_commit_version: 2.9.0
minimum_pre_commit_version: 2.10.0

repos:
- repo: meta
hooks:
- id: check-hooks-apply
- repo: 'https://github.com/psf/black'
rev: 20.8b1
hooks:
- id: black
- repo: 'https://github.com/pre-commit/pre-commit-hooks'
rev: v3.4.0
hooks:
- id: check-symlinks
- id: check-case-conflict
- id: fix-byte-order-marker
- id: end-of-file-fixer
- id: check-merge-conflict
- id: debug-statements
- id: check-builtin-literals
- id: check-toml
- id: check-json
- id: check-yaml
11 changes: 10 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,23 @@
Adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and
[Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [0.6.0] - 2021-03-02

## Added
- Read/write wrappers for Feather, Parquet, and JSON

### Fixed
- Slightly better build config

## [0.5.0] - 2021-01-19

### Changed
- Made `tables` an optional dependency; use `typeddfs[hdf5]`
- `natsort` is no longer pinned to version 7; it's now `>=7`.
Added a note in the readme that this just requires some caution.
Added a note in the readme that this just requires some caution.

### Fixed
- Slight improvement to build and metadata

## [0.4.0] - 2020-08-29

Expand Down
28 changes: 23 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,27 @@
# Contributing

[New issues](https://github.com/dmyersturnbull/typed-dfs/issues) and pull requests are welcome.
Typed-Dfs is licensed under the [Apache License, version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
Typed-dfs is licensed under the
[Apache License, version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
New issues and pull requests are welcome.
Feel free to direct a question to the authors by creating an [issue with the _question_ tag](https://github.com/dmyersturnbull/tyrannosaurus/issues/new?assignees=&labels=kind%3A+question&template=question.md).
Contributors are asked to abide by both the [GitHub community guidelines](https://docs.github.com/en/github/site-policy/github-community-guidelines)
and the [Contributor Code of Conduct, version 2.0](https://www.contributor-covenant.org/version/2/0/code_of_conduct/).

##### For pull requests:
If you can, please update `CHANGELOG.md` and add your name to the contributors in `pyproject.toml`.
Then run `tyrannosaurus sync`.
#### Pull requests

Please update `CHANGELOG.md` and add your name to the contributors in `pyproject.toml`
so that you’re credited. Run `poetry lock` and `tyrannosaurus sync` to sync metadata.
Feel free to make a draft pull request and solicit feedback from the authors.

#### Publishing a new version

1. Bump the version in `tool.poetry.version` in `pyproject.toml`, following
[Semantic Versioning](https://semver.org/spec/v2.0.0.html).
2. Run `tyrannosaurus sync` so that the Poetry lock file is up-to-date
and metadata are synced to pyproject.toml.
3. Create a [new release](https://github.com/dmyersturnbull/typed-dfs/releases/new)
with both the name and tag set to something like `v1.4.13` (keep the _v_).
4. An hour later, check that the *publish on release creation*
[workflow](https://github.com/dmyersturnbull/typed-dfs/actions) passes
and that the PyPi, Docker Hub, and GitHub Package versions are updated as shown in the
shields on the readme.
102 changes: 78 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,33 @@
[![Created with Tyrannosaurus](https://img.shields.io/badge/Created_with-Tyrannosaurus-0000ff.svg)](https://github.com/dmyersturnbull/tyrannosaurus)


Pandas DataFrame subclasses that enforce structure and can self-organize.
Because your functions can’t exactly accept _any_ DataFrame.
Pandas DataFrame subclasses that enforce structure and can self-organize.
*Because your functions can’t exactly accept **any** DataFrame**.
`pip install typeddfs[feather]`

The subclassed DataFrames can have required and/or optional columns and indices,
and support custom requirements.
Columns are automatically turned into indices,
which means **`read_csv` and `to_csv` are always inverses**.
`MyDf.read_csv(mydf.to_csv())` is just `mydf`.
```python
from typeddfs import TypedDfs
MyDfType = (
TypedDfs.typed("MyDfType")
.require("name", index=True) # always keep in index
.require("value", dtype=float) # require a column and type
.drop("_temp") # auto-drop a column
.condition(lambda df: len(df)==12) # require exactly 12 rows
).build()
# All normal Pandas functions work (plus a few more, like sort_natural)
```

The DataFrames will display nicely in Jupyter notebooks,
and a few convenience methods are added, such as `sort_natural` and `drop_cols`.
**[See the docs](https://typed-dfs.readthedocs.io/en/stable/)** for more information.
### 🎁 Features

`pip install typeddfs[hdf5]` to install.
- Columns are turned into indices as needed,
so **`read_csv` and `to_csv` are inverses**.
`MyDf.read_csv(mydf.to_csv())` is `mydf`.
- DataFrames display elegantly in Jupyter notebooks.
- Extra methods such as `sort_natural` and `drop_cols`.

Please note that HDF5 via pytables is
[unsupported in Python 3.9 on Windows](https://github.com/PyTables/PyTables/issues/854)
as of 2021-02-03.
### 🎨 Example

Simple example for a CSV like this:
For a CSV like this:

| key | value | note |
| ----- | ------ | ---- |
Expand All @@ -43,11 +50,11 @@ from typeddfs import TypedDfs

# Build me a Key-Value-Note class!
KeyValue = (
TypedDfs.typed("KeyValue") # typed means enforced requirements
.require("key", dtype=str, index=True) # automagically make this an index
.require("value") # required
.reserve("note") # permitted but not required
.strict() # don’t allow other columns
TypedDfs.typed("KeyValue") # With enforced reqs / typing
.require("key", dtype=str, index=True) # automagically add to index
.require("value") # required
.reserve("note") # permitted but not required
.strict() # disallow other columns
).build()

# This will self-organize and use "key" as the index:
Expand All @@ -66,11 +73,58 @@ def my_special_function(df: KeyValue) -> float:

All of the normal DataFrame methods are available.
Use `.untyped()` or `.vanilla()` to make a detyped copy that doesn’t enforce requirements.

A small note of caution: [natsort](https://github.com/SethMMorton/natsort) is no longer pinned
to a specific major version as of version 0.5 because it receives somewhat frequent major updates.
**[See the docs 📚](https://typed-dfs.readthedocs.io/en/stable/)** for more information.

### 🔌 Serialization support

Serialization is provided through Pandas, and some formats require additional packages.
Pandas does not specify compatible versions, so typed-dfs specifies
[extras](https://python-poetry.org/docs/pyproject/#extras) are provided in typed-dfs
to ensure that those packages are installed with compatible versions.
- To install with [Feather](https://arrow.apache.org/docs/python/feather.html) support,
use `pip install typeddfs[feather]`.
- To install with support for all serialization formats,
use `pip install typeddfs[feather] fastparquet tables`.

However, hdf5 and parquet have limited compatibility,
restricted to some platforms and Python versions.
In particular, neither is supported in Python 3.9 on Windows as of 2021-03-02.
(See the [llvmlite issue](https://github.com/numba/llvmlite/issues/669)
and [tables issue](https://github.com/PyTables/PyTables/issues/854).)

Feather offers massively better performance over CSV, gzipped CSV, and HDF5
in read speed, write speed, memory overhead, and compression ratios.
Parquet typically results in smaller file sizes than Feather at some cost in speed.
Feather is the preferred format for most cases.

**⚠ Note:** The `hdf5` and `parquet` extras are currently disabled.

| format | packages | extra | compatibility | performance |
| -------- | -------------------- | --------- | ------------- | ------------ |
| pickle | none | none | ❗ ️ ||
| CSV | none | none || −− |
| CSV.GZ | none | none || −− |
| JSON | none | none | /️ | −− |
| JSON.GZ | none | none | /️ | −− |
| .npy † | none | none | †️ | + |
| .npz † | none | none | †️ | + |
| Feather | `pyarrow` | `feather` || ++++ |
| Parquet | `pyarrow,fastparquet` | `parquet` || +++ |
| HDF5 | `tables` | `hdf5` |||

❗ == Pickle is explicitly not supported due to vulnerabilities and other issues.
/ == Mostly. JSON has inconsistent handling of `None`.
† == .npy and .npz only serialize numpy objects and therefore skip indices.

### 📝 Extra notes

A small note of caution: [natsort](https://github.com/SethMMorton/natsort) is not pinned
to a specific major version because it receives somewhat frequent major updates.
This means that the result of typed-df’s `sort_natural` could change.
You can pin natsort to a specific major version; e.g. `natsort = "^7"` with [Poetry](https://python-poetry.org/).
You can pin natsort to a specific major version;
e.g. `natsort = "^7"` with [Poetry](https://python-poetry.org/) or `natsort>=7,<8` with pip.

### 🍁 Contributing

Typed-Dfs is licensed under the [Apache License, version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
[New issues](https://github.com/dmyersturnbull/typed-dfs/issues) and pull requests are welcome.
Expand Down
2 changes: 1 addition & 1 deletion SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ These versions of typed-dfs are supported:

| Version | Supported |
| ------- | ------------------ |
| 0.5.x | :white_check_mark: |
| 0.6.x | :white_check_mark: |


## How to report a vulnerability
Expand Down
2 changes: 1 addition & 1 deletion codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"codeRepository":"https://github.com/dmyersturnbull/typed-dfs",
"issueTracker":"https://github.com/dmyersturnbull/typed-dfs/issues",
"license":"https://www.apache.org/licenses/LICENSE-2.0",
"version":"0.5.0",
"version":"0.6.0",
"author":[
{
"@type":"Person",
Expand Down
Loading

0 comments on commit bb990ff

Please sign in to comment.