Skip to content

ENH: add fsspec support #34266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
94e717f
Add remote file io using fsspec.
Apr 14, 2020
fd7e072
Attempt refactor and clean
May 19, 2020
302ba13
Merge branch 'master' into feature/add-fsspec-support
May 20, 2020
9e6d3b2
readd and adapt s3/gcs tests
May 21, 2020
4564c8d
remove gc from test
May 21, 2020
0654537
Simpler is_fsspec
May 21, 2020
8d45cbb
add test
May 21, 2020
006e736
Answered most points
May 28, 2020
724ebd8
Implemented suggestions
May 28, 2020
9da1689
lint
May 28, 2020
a595411
Add versions info
May 29, 2020
6dd1e92
Update some deps
May 29, 2020
6e13df7
issue link syntax
May 29, 2020
3262063
More specific test versions
Jun 2, 2020
4bc2411
Account for alternate S3 protocols, and ignore type error
Jun 2, 2020
68644ab
Add comment to mypy ignore insrtuction
Jun 2, 2020
32bc586
more mypy
Jun 2, 2020
037ef2c
more black
Jun 2, 2020
c3c3075
Make storage_options a dict rather than swallowing kwargs
Jun 3, 2020
85d6452
More requested changes
Jun 5, 2020
263dd3b
Remove fsspec from locale tests
Jun 10, 2020
d0afbc3
tweak
Jun 10, 2020
6a587a5
Merge branch 'master' into feature/add-fsspec-support
Jun 10, 2020
b2992c1
Merge branch 'master' into feature/add-fsspec-support
Jun 11, 2020
9c03745
requested changes
Jun 11, 2020
7982e7b
add gcsfs to environment.yml
Jun 12, 2020
946297b
rerun deps script
Jun 12, 2020
145306e
Merge branch 'master' into feature/add-fsspec-support
Jun 12, 2020
06e5a3a
account for passed filesystem again
Jun 12, 2020
8f3854c
specify should_close
Jun 12, 2020
50c08c8
lint
Jun 12, 2020
9b20dc6
Except http passed to fsspec in parquet
Jun 12, 2020
eb90fe8
lint
Jun 12, 2020
b3e2cd2
Merge branch 'master' into feature/add-fsspec-support
Jun 16, 2020
4977a00
redo whatsnew
Jun 16, 2020
29a9785
simplify parquet write
Jun 18, 2020
565031b
Retry S3 file probe with timeout, in test_to_s3
Jun 18, 2020
606ce11
expand user in non-fsspec paths for parquet; add test for this
Jun 19, 2020
60b80a6
reorder imports!
Jun 19, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Prev Previous commit
Next Next commit
Merge branch 'master' into feature/add-fsspec-support
  • Loading branch information
Martin Durant committed May 20, 2020
commit 302ba1307c193aeccd1d1787b5725609fabb9d9c
24 changes: 21 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ cache:

env:
global:
# Variable for test workers
- PYTEST_WORKERS="auto"
# create a github personal access token
# cd pandas-dev/pandas
# travis encrypt 'PANDAS_GH_TOKEN=personal_access_token' -r pandas-dev/pandas
Expand All @@ -27,12 +29,21 @@ matrix:
fast_finish: true

include:
# In allowed failures
- dist: bionic
python: 3.9-dev
env:
- JOB="3.9-dev" PATTERN="(not slow and not network and not clipboard)"
- env:
- JOB="3.8" ENV_FILE="ci/deps/travis-38.yaml" PATTERN="(not slow and not network and not clipboard)"

- env:
- JOB="3.7" ENV_FILE="ci/deps/travis-37.yaml" PATTERN="(not slow and not network and not clipboard)"

- arch: arm64
env:
- JOB="3.7, arm64" PYTEST_WORKERS=8 ENV_FILE="ci/deps/travis-37-arm64.yaml" PATTERN="(not slow and not network and not clipboard)"

- env:
- JOB="3.6, locale" ENV_FILE="ci/deps/travis-36-locale.yaml" PATTERN="((not slow and not network and not clipboard) or (single and db))" LOCALE_OVERRIDE="zh_CN.UTF-8" SQL="1"
services:
Expand All @@ -53,11 +64,18 @@ matrix:
services:
- mysql
- postgresql
allow_failures:
- arch: arm64
env:
- JOB="3.7, arm64" PYTEST_WORKERS=8 ENV_FILE="ci/deps/travis-37-arm64.yaml" PATTERN="(not slow and not network and not clipboard)"
- dist: bionic
python: 3.9-dev
env:
- JOB="3.9-dev" PATTERN="(not slow and not network)"

before_install:
- echo "before_install"
# set non-blocking IO on travis
# https://github.com/travis-ci/travis-ci/issues/8920#issuecomment-352661024
# Use blocking IO on travis. Ref: https://github.com/travis-ci/travis-ci/issues/8920#issuecomment-352661024
- python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); fcntl.fcntl(sys.stdout, fcntl.F_SETFL, flags&~os.O_NONBLOCK);'
- source ci/travis_process_gbq_encryption.sh
- export PATH="$HOME/miniconda3/bin:$PATH"
Expand All @@ -83,7 +101,7 @@ install:
script:
- echo "script start"
- echo "$JOB"
- source activate pandas-dev
- if [ "$JOB" != "3.9-dev" ]; then source activate pandas-dev; fi
- ci/run_tests.sh

after_script:
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
[![Downloads](https://anaconda.org/conda-forge/pandas/badges/downloads.svg)](https://pandas.pydata.org)
[![Gitter](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/pydata/pandas)
[![Powered by NumFOCUS](https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://numfocus.org)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

## What is it?

Expand Down
17 changes: 14 additions & 3 deletions asv_bench/benchmarks/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,16 @@ class Factorize:
params = [
[True, False],
[True, False],
["int", "uint", "float", "string", "datetime64[ns]", "datetime64[ns, tz]"],
[
"int",
"uint",
"float",
"string",
"datetime64[ns]",
"datetime64[ns, tz]",
"Int64",
"boolean",
],
]
param_names = ["unique", "sort", "dtype"]

Expand All @@ -49,13 +58,15 @@ def setup(self, unique, sort, dtype):
"datetime64[ns, tz]": pd.date_range(
"2011-01-01", freq="H", periods=N, tz="Asia/Tokyo"
),
"Int64": pd.array(np.arange(N), dtype="Int64"),
"boolean": pd.array(np.random.randint(0, 2, N), dtype="boolean"),
}[dtype]
if not unique:
data = data.repeat(5)
self.idx = data
self.data = data

def time_factorize(self, unique, sort, dtype):
self.idx.factorize(sort=sort)
pd.factorize(self.data, sort=sort)


class Duplicated:
Expand Down
63 changes: 60 additions & 3 deletions asv_bench/benchmarks/arithmetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def time_series_op_with_fill_value_no_nas(self):
self.ser.add(self.ser, fill_value=4)


class MixedFrameWithSeriesAxis0:
class MixedFrameWithSeriesAxis:
params = [
[
"eq",
Expand All @@ -78,7 +78,7 @@ class MixedFrameWithSeriesAxis0:
"gt",
"add",
"sub",
"div",
"truediv",
"floordiv",
"mul",
"pow",
Expand All @@ -87,15 +87,72 @@ class MixedFrameWithSeriesAxis0:
param_names = ["opname"]

def setup(self, opname):
arr = np.arange(10 ** 6).reshape(100, -1)
arr = np.arange(10 ** 6).reshape(1000, -1)
df = DataFrame(arr)
df["C"] = 1.0
self.df = df
self.ser = df[0]
self.row = df.iloc[0]

def time_frame_op_with_series_axis0(self, opname):
getattr(self.df, opname)(self.ser, axis=0)

def time_frame_op_with_series_axis1(self, opname):
getattr(operator, opname)(self.df, self.ser)


class FrameWithFrameWide:
# Many-columns, mixed dtypes

params = [
[
# GH#32779 has discussion of which operators are included here
operator.add,
operator.floordiv,
operator.gt,
]
]
param_names = ["op"]

def setup(self, op):
# we choose dtypes so as to make the blocks
# a) not perfectly match between right and left
# b) appreciably bigger than single columns
n_cols = 2000
n_rows = 500

# construct dataframe with 2 blocks
arr1 = np.random.randn(n_rows, int(n_cols / 2)).astype("f8")
arr2 = np.random.randn(n_rows, int(n_cols / 2)).astype("f4")
df = pd.concat(
[pd.DataFrame(arr1), pd.DataFrame(arr2)], axis=1, ignore_index=True,
)
# should already be the case, but just to be sure
df._consolidate_inplace()

# TODO: GH#33198 the setting here shoudlnt need two steps
arr1 = np.random.randn(n_rows, int(n_cols / 4)).astype("f8")
arr2 = np.random.randn(n_rows, int(n_cols / 2)).astype("i8")
arr3 = np.random.randn(n_rows, int(n_cols / 4)).astype("f8")
df2 = pd.concat(
[pd.DataFrame(arr1), pd.DataFrame(arr2), pd.DataFrame(arr3)],
axis=1,
ignore_index=True,
)
# should already be the case, but just to be sure
df2._consolidate_inplace()

self.left = df
self.right = df2

def time_op_different_blocks(self, op):
# blocks (and dtypes) are not aligned
op(self.left, self.right)

def time_op_same_blocks(self, op):
# blocks (and dtypes) are aligned
op(self.left, self.left)


class Ops:

Expand Down
2 changes: 1 addition & 1 deletion asv_bench/benchmarks/frame_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -564,7 +564,7 @@ def setup(self):

def time_frame_get_dtype_counts(self):
with warnings.catch_warnings(record=True):
self.df._data.get_dtype_counts()
self.df.dtypes.value_counts()

def time_info(self):
self.df.info()
Expand Down
92 changes: 92 additions & 0 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -626,4 +626,96 @@ def time_first(self):
self.df_nans.groupby("key").transform("first")


class TransformEngine:
def setup(self):
N = 10 ** 3
data = DataFrame(
{0: [str(i) for i in range(100)] * N, 1: list(range(100)) * N},
columns=[0, 1],
)
self.grouper = data.groupby(0)

def time_series_numba(self):
def function(values, index):
return values * 5

self.grouper[1].transform(function, engine="numba")

def time_series_cython(self):
def function(values):
return values * 5

self.grouper[1].transform(function, engine="cython")

def time_dataframe_numba(self):
def function(values, index):
return values * 5

self.grouper.transform(function, engine="numba")

def time_dataframe_cython(self):
def function(values):
return values * 5

self.grouper.transform(function, engine="cython")


class AggEngine:
def setup(self):
N = 10 ** 3
data = DataFrame(
{0: [str(i) for i in range(100)] * N, 1: list(range(100)) * N},
columns=[0, 1],
)
self.grouper = data.groupby(0)

def time_series_numba(self):
def function(values, index):
total = 0
for i, value in enumerate(values):
if i % 2:
total += value + 5
else:
total += value * 2
return total

self.grouper[1].agg(function, engine="numba")

def time_series_cython(self):
def function(values):
total = 0
for i, value in enumerate(values):
if i % 2:
total += value + 5
else:
total += value * 2
return total

self.grouper[1].agg(function, engine="cython")

def time_dataframe_numba(self):
def function(values, index):
total = 0
for i, value in enumerate(values):
if i % 2:
total += value + 5
else:
total += value * 2
return total

self.grouper.agg(function, engine="numba")

def time_dataframe_cython(self):
def function(values):
total = 0
for i, value in enumerate(values):
if i % 2:
total += value + 5
else:
total += value * 2
return total

self.grouper.agg(function, engine="cython")


from .pandas_vb_common import setup # noqa: F401 isort:skip
4 changes: 2 additions & 2 deletions asv_bench/benchmarks/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

try:
from pandas._libs.tslibs.parsing import (
_concat_date_cols,
concat_date_cols,
_does_string_look_like_datetime,
)
except ImportError:
Expand Down Expand Up @@ -39,4 +39,4 @@ def setup(self, value, dim):
)

def time_check_concat(self, value, dim):
_concat_date_cols(self.object)
concat_date_cols(self.object)
25 changes: 12 additions & 13 deletions asv_bench/benchmarks/rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,19 +150,18 @@ def time_quantile(self, constructor, window, dtype, percentile, interpolation):
self.roll.quantile(percentile, interpolation=interpolation)


class PeakMemFixed:
def setup(self):
N = 10
arr = 100 * np.random.random(N)
self.roll = pd.Series(arr).rolling(10)

def peakmem_fixed(self):
# GH 25926
# This is to detect memory leaks in rolling operations.
# To save time this is only ran on one method.
# 6000 iterations is enough for most types of leaks to be detected
for x in range(6000):
self.roll.max()
class PeakMemFixedWindowMinMax:

params = ["min", "max"]

def setup(self, operation):
N = int(1e6)
arr = np.random.random(N)
self.roll = pd.Series(arr).rolling(2)

def peakmem_fixed(self, operation):
for x in range(5):
getattr(self.roll, operation)()


class ForwardWindowMethods:
Expand Down
4 changes: 2 additions & 2 deletions asv_bench/benchmarks/stat_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ class FrameOps:
param_names = ["op", "dtype", "axis"]

def setup(self, op, dtype, axis):
if op == "mad" and dtype == "Int64" and axis == 1:
# GH-33036
if op == "mad" and dtype == "Int64":
# GH-33036, GH#33600
raise NotImplementedError
values = np.random.randn(100000, 4)
if dtype == "Int64":
Expand Down
3 changes: 3 additions & 0 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ trigger:
pr:
- master

variables:
PYTEST_WORKERS: auto

jobs:
# Mac and Linux use the same template
- template: ci/azure/posix.yml
Expand Down
21 changes: 21 additions & 0 deletions ci/build39.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash -e
# Special build for python3.9 until numpy puts its own wheels up

sudo apt-get install build-essential gcc xvfb
pip install --no-deps -U pip wheel setuptools
pip install python-dateutil pytz pytest pytest-xdist hypothesis
pip install cython --pre # https://github.com/cython/cython/issues/3395

git clone https://github.com/numpy/numpy
cd numpy
python setup.py build_ext --inplace
python setup.py install
cd ..
rm -rf numpy

python setup.py build_ext -inplace
python -m pip install --no-build-isolation -e .

python -c "import sys; print(sys.version_info)"
python -c "import pandas as pd"
python -c "import hypothesis"
Loading
You are viewing a condensed version of this merge commit. You can view the full changes here.