Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Use pandas custom data types for BigQuery DATE and TIME columns, remove date_as_object argument #972

Merged
merged 87 commits into from
Nov 10, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
849f2c0
feat: add pandas time arrays to hold BigQuery TIME data
jimfulton Sep 9, 2021
99b43b6
remove commented/aborted base class and simplify super call in __init__
jimfulton Sep 9, 2021
f722570
dtypes tests pass with pandas 0.24.2
jimfulton Sep 9, 2021
724f62f
blacken and lint
jimfulton Sep 9, 2021
9b03449
Added DateArray and generalized tests to handle dates and times
jimfulton Sep 9, 2021
6af561d
blacken
jimfulton Sep 9, 2021
7fc299c
handle bad values passed to areay construction
jimfulton Sep 9, 2021
869f5f7
Add null/None handling
jimfulton Sep 9, 2021
5505069
Added repeat and copy tests
jimfulton Sep 11, 2021
15ca7e1
don't use extract_array and make _from_sequence_of_strings and alias …
jimfulton Sep 11, 2021
9392cb4
Summary test copying ndarray arg
jimfulton Sep 11, 2021
04c1e6a
expand construction test to test calling class directly and calling _…
jimfulton Sep 11, 2021
f59169c
blacken
jimfulton Sep 11, 2021
c33f608
test size and shape
jimfulton Sep 11, 2021
41bbde6
Enable properties for pandas 1.3 and later
jimfulton Sep 13, 2021
e9ed1c4
Test more pandas versions.
jimfulton Sep 13, 2021
d6d81fe
Updated import_default with an option to force use of the default
jimfulton Sep 13, 2021
a8697f3
simplified version-checking code
jimfulton Sep 13, 2021
8ea43c5
_from_factorized
jimfulton Sep 13, 2021
2f37929
fix small DRY violation for parametrization by date tand time dtypes
jimfulton Sep 14, 2021
ad0e3c0
isna
jimfulton Sep 14, 2021
c1ebb5c
take
jimfulton Sep 14, 2021
eaa2e96
_concat_same_type
jimfulton Sep 14, 2021
82a2d84
fixed __getitem__ to handle array indexes
jimfulton Sep 14, 2021
6f4178f
test __getitem__ w arrau index and dropna
jimfulton Sep 14, 2021
d8818e0
fix assignment with array keys and fillna test
jimfulton Sep 14, 2021
6bfb75b
reminder (for someone :) ) to override some abstract implementations …
jimfulton Sep 14, 2021
3364bdd
unique test
jimfulton Sep 14, 2021
661c6b2
test argsort
jimfulton Sep 14, 2021
ac7330c
fix version in constraint
jimfulton Sep 14, 2021
9b7a72c
blacken/lint
jimfulton Sep 14, 2021
47b0756
stop fighting the framework and store as ns
jimfulton Sep 14, 2021
48f2e11
blacken/lint
jimfulton Sep 14, 2021
b1025b7
Implement astype to fix Python 3.7 failure
jimfulton Sep 15, 2021
7acdb05
test assigning None
jimfulton Sep 15, 2021
731634e
test astype self type w copy
jimfulton Sep 15, 2021
903e23c
test a concatenation dtype sanity check
jimfulton Sep 15, 2021
e52c65d
Added conversion of date to datetime
jimfulton Sep 15, 2021
74ef1b0
convert times to time deltas
jimfulton Sep 15, 2021
91d9e2b
Use new pandas date and time dtypes
jimfulton Sep 15, 2021
517307c
Get rid of date_as_object argument
jimfulton Sep 15, 2021
711cfaf
fixed a comment
jimfulton Sep 15, 2021
7cdea07
added *unit* test for dealimng with dates and timestamps that can't f…
jimfulton Sep 15, 2021
a20f67b
Removed brittle hack that enabled series properties.
jimfulton Sep 16, 2021
96ed76a
Add note on possible zero-copy optimization for dates
jimfulton Sep 16, 2021
159c202
Implemented any, all, min, max and median
jimfulton Sep 16, 2021
a81b26e
make pytype happy
jimfulton Sep 16, 2021
98b3603
test (and fix) load from dataframe with date and time columns
jimfulton Sep 16, 2021
4f71c9c
Make sure insert_rows_from_dataframe works
jimfulton Sep 16, 2021
5c25ba4
Renamed date and time dtypes to bqdate and bqtime
jimfulton Sep 16, 2021
c6dabe2
make fallback date and time dtype names strings to make pytype happy
jimfulton Sep 17, 2021
7021601
date and time arrays implement __arrow_array__
jimfulton Sep 17, 2021
2585dbc
Document new dtypes
jimfulton Sep 17, 2021
77c1c9e
blacken/lint
jimfulton Sep 18, 2021
4261b80
Make conversion of date columns from arrow pandas outout to pandas ze…
jimfulton Sep 18, 2021
2671718
Added date math support
jimfulton Sep 18, 2021
36eb58c
fix end tag
jimfulton Sep 18, 2021
a8d0cb0
fixed snippet ref
jimfulton Sep 18, 2021
3eabca3
Support date math with DateOffset scalars
jimfulton Sep 18, 2021
b81b996
use db-dtypes
jimfulton Sep 27, 2021
dad0d36
Include db-dtypes in sample requirements
jimfulton Sep 27, 2021
35df752
Updated docs and snippets for db-dtypes
jimfulton Sep 27, 2021
ba776f6
gaaaa, missed a bqdate
jimfulton Sep 27, 2021
10555e1
Merge branch 'v3' into dtypes-v3
parthea Oct 8, 2021
fc451f5
Merge remote-tracking branch 'upstream/v3' into dtypes-v3
tswast Nov 1, 2021
08d1b70
move db-dtypes samples to db-dtypes docs
tswast Nov 1, 2021
d3b57e0
update to work with db-dtypes 0.2.0+
tswast Nov 1, 2021
0442789
update dtype names in system tests
tswast Nov 2, 2021
c7ff18e
comment with link to arrow data types
tswast Nov 2, 2021
b99fa5f
update db-dtypes version
tswast Nov 2, 2021
8253559
experiment with direct pyarrow to extension array conversion
tswast Nov 3, 2021
c8b5d67
Merge branch 'v3' into dtypes-v3
tswast Nov 5, 2021
4a6ec3e
use types mapper for dbdate
tswast Nov 9, 2021
4d5d229
Merge branch 'dtypes-v3' of github.com:jimfulton/python-bigquery into…
tswast Nov 9, 2021
3953ead
fix constraints
tswast Nov 9, 2021
c4a1f2c
use types mapper where possible
tswast Nov 9, 2021
d7e7c5b
always use types mapper
tswast Nov 9, 2021
eec4103
adjust unit tests to use arrow not avro
tswast Nov 9, 2021
69deb6f
avoid "ValueError: need at least one array to concatenate" with empty…
tswast Nov 9, 2021
732fe86
link to pandas issue
tswast Nov 9, 2021
2cd97a1
remove unnecessary variable
tswast Nov 9, 2021
852d5a8
Update tests/unit/job/test_query_pandas.py
tswast Nov 9, 2021
ea5d254
Update tests/unit/job/test_query_pandas.py
tswast Nov 9, 2021
1d8adb1
add missing db-dtypes requirement
tswast Nov 10, 2021
ef8847d
Merge branch 'dtypes-v3' of github.com:jimfulton/python-bigquery into…
tswast Nov 10, 2021
25a8f75
avoid arrow_schema on older versions of bqstorage
tswast Nov 10, 2021
0d81fa1
Merge remote-tracking branch 'upstream/v3' into dtypes-v3
tswast Nov 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
take
  • Loading branch information
jimfulton committed Sep 14, 2021
commit c1ebb5cb3fc1db5d4b6466b0923e5c513ec9ab7b
37 changes: 36 additions & 1 deletion google/cloud/bigquery/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@

import datetime
import operator
from typing import Any, Sequence

import numpy
import packaging.version
from pandas._libs import NaT
import pandas.core.algorithms
import pandas.core.dtypes.base
import pandas.core.dtypes.dtypes
import pandas.core.dtypes.generic
Expand Down Expand Up @@ -114,7 +116,6 @@ def copy(self):
def repeat(self, n):
return self.__class__(self._ndarray.repeat(n), self._dtype)


#
###########################################################################

Expand Down Expand Up @@ -180,6 +181,40 @@ def _from_factorized(self, unique, original):
def isna(self):
return pandas.isna(self._ndarray)

def _validate_scalar(self, value):
if pandas.isna(value):
return None

if not isinstance(value, self.dtype.type):
raise ValueError(value)

return value

def take(
self,
indices: Sequence[int],
*,
allow_fill: bool = False,
fill_value: Any = None,
):
indices = numpy.asarray(indices, dtype=numpy.intp)
data = self._ndarray
if allow_fill:
fill_value = self._validate_scalar(fill_value)
fill_value = (
numpy.datetime64() if fill_value is None
else numpy.datetime64(self._datetime(fill_value))
)
if (indices < -1).any():
raise ValueError(
"take called with negative indexes other than -1,"
" when a fill value is provided.")
out = data.take(indices)
if allow_fill:
out[indices == -1] = fill_value

return self.__class__(out)


@pandas.core.dtypes.dtypes.register_extension_dtype
class TimeDtype(_BaseDtype):
Expand Down
45 changes: 45 additions & 0 deletions tests/unit/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,3 +307,48 @@ def test__from_factorized(dtype):
def test_isna(dtype):
a = _make_one(dtype)
assert list(a.isna()) == [False, False, True]


@for_date_and_time
def test__validate_scalar_invalid(dtype):
with pytest.raises(ValueError):
_make_one(dtype)._validate_scalar("bad")


@for_date_and_time
@pytest.mark.parametrize(
"allow_fill, fill_value",
[
(False, None),
(True, None),
(True, pd._libs.NaT if pd else None),
(True, np.NaN if pd else None), (True, 42)])
def test_take(dtype, allow_fill, fill_value):
sample_values = SAMPLE_VALUES[dtype]
a = _cls(dtype)(sample_values)
if allow_fill:
if fill_value == 42:
fill_value = expected_fill = (
datetime.date(1971, 4, 2) if dtype == 'date'
else datetime.time(0, 42, 42, 424242))
else:
expected_fill = None
b = a.take([1, -1, 3], allow_fill=True, fill_value=fill_value)
expect = [sample_values[1], expected_fill, sample_values[3]]
else:
b = a.take([1, -4, 3])
expect = [sample_values[1], sample_values[-4], sample_values[3]]

assert list(b) == expect


@for_date_and_time
def test_take_bad_index(dtype):
# When allow_fill is set, negative indexes < -1 raise ValueError.
# This is based on testing with an integer series/array.
# The documentation isn't clear on this at all.
sample_values = SAMPLE_VALUES[dtype]
a = _cls(dtype)(sample_values)
with pytest.raises(ValueError):
a.take([1, -2, 3], allow_fill=True, fill_value=None)