Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Use pandas custom data types for BigQuery DATE and TIME columns, remove date_as_object argument #972

Merged
merged 87 commits into from
Nov 10, 2021
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
849f2c0
feat: add pandas time arrays to hold BigQuery TIME data
jimfulton Sep 9, 2021
99b43b6
remove commented/aborted base class and simplify super call in __init__
jimfulton Sep 9, 2021
f722570
dtypes tests pass with pandas 0.24.2
jimfulton Sep 9, 2021
724f62f
blacken and lint
jimfulton Sep 9, 2021
9b03449
Added DateArray and generalized tests to handle dates and times
jimfulton Sep 9, 2021
6af561d
blacken
jimfulton Sep 9, 2021
7fc299c
handle bad values passed to areay construction
jimfulton Sep 9, 2021
869f5f7
Add null/None handling
jimfulton Sep 9, 2021
5505069
Added repeat and copy tests
jimfulton Sep 11, 2021
15ca7e1
don't use extract_array and make _from_sequence_of_strings and alias …
jimfulton Sep 11, 2021
9392cb4
Summary test copying ndarray arg
jimfulton Sep 11, 2021
04c1e6a
expand construction test to test calling class directly and calling _…
jimfulton Sep 11, 2021
f59169c
blacken
jimfulton Sep 11, 2021
c33f608
test size and shape
jimfulton Sep 11, 2021
41bbde6
Enable properties for pandas 1.3 and later
jimfulton Sep 13, 2021
e9ed1c4
Test more pandas versions.
jimfulton Sep 13, 2021
d6d81fe
Updated import_default with an option to force use of the default
jimfulton Sep 13, 2021
a8697f3
simplified version-checking code
jimfulton Sep 13, 2021
8ea43c5
_from_factorized
jimfulton Sep 13, 2021
2f37929
fix small DRY violation for parametrization by date tand time dtypes
jimfulton Sep 14, 2021
ad0e3c0
isna
jimfulton Sep 14, 2021
c1ebb5c
take
jimfulton Sep 14, 2021
eaa2e96
_concat_same_type
jimfulton Sep 14, 2021
82a2d84
fixed __getitem__ to handle array indexes
jimfulton Sep 14, 2021
6f4178f
test __getitem__ w arrau index and dropna
jimfulton Sep 14, 2021
d8818e0
fix assignment with array keys and fillna test
jimfulton Sep 14, 2021
6bfb75b
reminder (for someone :) ) to override some abstract implementations …
jimfulton Sep 14, 2021
3364bdd
unique test
jimfulton Sep 14, 2021
661c6b2
test argsort
jimfulton Sep 14, 2021
ac7330c
fix version in constraint
jimfulton Sep 14, 2021
9b7a72c
blacken/lint
jimfulton Sep 14, 2021
47b0756
stop fighting the framework and store as ns
jimfulton Sep 14, 2021
48f2e11
blacken/lint
jimfulton Sep 14, 2021
b1025b7
Implement astype to fix Python 3.7 failure
jimfulton Sep 15, 2021
7acdb05
test assigning None
jimfulton Sep 15, 2021
731634e
test astype self type w copy
jimfulton Sep 15, 2021
903e23c
test a concatenation dtype sanity check
jimfulton Sep 15, 2021
e52c65d
Added conversion of date to datetime
jimfulton Sep 15, 2021
74ef1b0
convert times to time deltas
jimfulton Sep 15, 2021
91d9e2b
Use new pandas date and time dtypes
jimfulton Sep 15, 2021
517307c
Get rid of date_as_object argument
jimfulton Sep 15, 2021
711cfaf
fixed a comment
jimfulton Sep 15, 2021
7cdea07
added *unit* test for dealimng with dates and timestamps that can't f…
jimfulton Sep 15, 2021
a20f67b
Removed brittle hack that enabled series properties.
jimfulton Sep 16, 2021
96ed76a
Add note on possible zero-copy optimization for dates
jimfulton Sep 16, 2021
159c202
Implemented any, all, min, max and median
jimfulton Sep 16, 2021
a81b26e
make pytype happy
jimfulton Sep 16, 2021
98b3603
test (and fix) load from dataframe with date and time columns
jimfulton Sep 16, 2021
4f71c9c
Make sure insert_rows_from_dataframe works
jimfulton Sep 16, 2021
5c25ba4
Renamed date and time dtypes to bqdate and bqtime
jimfulton Sep 16, 2021
c6dabe2
make fallback date and time dtype names strings to make pytype happy
jimfulton Sep 17, 2021
7021601
date and time arrays implement __arrow_array__
jimfulton Sep 17, 2021
2585dbc
Document new dtypes
jimfulton Sep 17, 2021
77c1c9e
blacken/lint
jimfulton Sep 18, 2021
4261b80
Make conversion of date columns from arrow pandas outout to pandas ze…
jimfulton Sep 18, 2021
2671718
Added date math support
jimfulton Sep 18, 2021
36eb58c
fix end tag
jimfulton Sep 18, 2021
a8d0cb0
fixed snippet ref
jimfulton Sep 18, 2021
3eabca3
Support date math with DateOffset scalars
jimfulton Sep 18, 2021
b81b996
use db-dtypes
jimfulton Sep 27, 2021
dad0d36
Include db-dtypes in sample requirements
jimfulton Sep 27, 2021
35df752
Updated docs and snippets for db-dtypes
jimfulton Sep 27, 2021
ba776f6
gaaaa, missed a bqdate
jimfulton Sep 27, 2021
10555e1
Merge branch 'v3' into dtypes-v3
parthea Oct 8, 2021
fc451f5
Merge remote-tracking branch 'upstream/v3' into dtypes-v3
tswast Nov 1, 2021
08d1b70
move db-dtypes samples to db-dtypes docs
tswast Nov 1, 2021
d3b57e0
update to work with db-dtypes 0.2.0+
tswast Nov 1, 2021
0442789
update dtype names in system tests
tswast Nov 2, 2021
c7ff18e
comment with link to arrow data types
tswast Nov 2, 2021
b99fa5f
update db-dtypes version
tswast Nov 2, 2021
8253559
experiment with direct pyarrow to extension array conversion
tswast Nov 3, 2021
c8b5d67
Merge branch 'v3' into dtypes-v3
tswast Nov 5, 2021
4a6ec3e
use types mapper for dbdate
tswast Nov 9, 2021
4d5d229
Merge branch 'dtypes-v3' of github.com:jimfulton/python-bigquery into…
tswast Nov 9, 2021
3953ead
fix constraints
tswast Nov 9, 2021
c4a1f2c
use types mapper where possible
tswast Nov 9, 2021
d7e7c5b
always use types mapper
tswast Nov 9, 2021
eec4103
adjust unit tests to use arrow not avro
tswast Nov 9, 2021
69deb6f
avoid "ValueError: need at least one array to concatenate" with empty…
tswast Nov 9, 2021
732fe86
link to pandas issue
tswast Nov 9, 2021
2cd97a1
remove unnecessary variable
tswast Nov 9, 2021
852d5a8
Update tests/unit/job/test_query_pandas.py
tswast Nov 9, 2021
ea5d254
Update tests/unit/job/test_query_pandas.py
tswast Nov 9, 2021
1d8adb1
add missing db-dtypes requirement
tswast Nov 10, 2021
ef8847d
Merge branch 'dtypes-v3' of github.com:jimfulton/python-bigquery into…
tswast Nov 10, 2021
25a8f75
avoid arrow_schema on older versions of bqstorage
tswast Nov 10, 2021
0d81fa1
Merge remote-tracking branch 'upstream/v3' into dtypes-v3
tswast Nov 10, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions docs/usage/pandas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,94 @@ and load it into a new table:
:dedent: 4
:start-after: [START bigquery_load_table_dataframe]
:end-before: [END bigquery_load_table_dataframe]

Pandas date and time arrays used for BigQuery DATE and TIME columns
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When BigQuery DATE [#date]_ and TIME data are loaded into Pandas,
BigQuery-supplied date and time series are used.

Date and time series support comparison and basic statistics, `min`,
`max` and `median`.

Date series
-----------

Date series are created when loading BigQuery DATE data [#date]_, but
they can also be created directly from `datetime.date` data or date strings:

.. literalinclude:: ../samples/snippets/pandas_date_and_time.py
:language: python
:dedent: 4
:start-after: [START bigquery_bqdate_create]
:end-before: [END bigquery_bqdate_create]

The data type name for BigQuery-supplied date series is `bqdate`. You
need to import `google.cloud.bigquery.dtypes` to cause this to get
registered with pandas.

You can convert date series to date-time series using `astype("datetime64")`:

.. literalinclude:: ../samples/snippets/pandas_date_and_time.py
:language: python
:dedent: 4
:start-after: [START bigquery_bqdate_as_datetime]
:end-before: [END bigquery_bqdate_as_datetime]

You can subtract date series to get timedelta64 series:

.. literalinclude:: ../samples/snippets/pandas_date_and_time.py
:language: python
:dedent: 4
:start-after: [START bigquery_bqdate_sub]
:end-before: [END bigquery_bqdate_sub]

You can also add and subtract Pandas date offsets, either as scalars or as arrays:

.. literalinclude:: ../samples/snippets/pandas_date_and_time.py
:language: python
:dedent: 4
:start-after: [START bigquery_bqdate_do]
:end-before: [END bigquery_bqdate_do]

Time series
-----------

Time series are created when loading BigQuery TIME data, but
they can also be created directly from `datetime.time` data or time strings:

.. literalinclude:: ../samples/snippets/pandas_date_and_time.py
:language: python
:dedent: 4
:start-after: [START bigquery_bqtime_create]
:end-before: [END bigquery_bqtime_create]

The data type name for BigQuery-supplied time series is `bqtime`.

You can convert time series to time-delta series using `astype("timedelta64")`:

.. literalinclude:: ../samples/snippets/pandas_date_and_time.py
:language: python
:dedent: 4
:start-after: [START bigquery_bqtime_as_timedelta]
:end-before: [END bigquery_bqtime_as_timedelta]

This lets you combine dates and times to create date-time data:

.. literalinclude:: ../samples/snippets/pandas_date_and_time.py
:language: python
:dedent: 4
:start-after: [START bigquery_combine_bqdate_bqtime]
:end-before: [END bigquery_combine_bqdate_bqtime]

But you can also add dates and times directly:

.. literalinclude:: ../samples/snippets/pandas_date_and_time.py
:language: python
:dedent: 4
:start-after: [START bigquery_combine2_bqdate_bqtime]
:end-before: [END bigquery_combine2_bqdate_bqtime]

.. [#date] Dates before 1678 can't be represented using
BigQuery-supplied date series and will be converted as
object series.
7 changes: 7 additions & 0 deletions google/cloud/bigquery/_pandas_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,12 @@
import pandas
except ImportError: # pragma: NO COVER
pandas = None
date_dtype_name = time_dtype_name = "" # Use '' rather than None because pytype
else:
import numpy

from db_dtypes import date_dtype_name, time_dtype_name

import pyarrow
import pyarrow.parquet

Expand Down Expand Up @@ -85,6 +88,8 @@ def _to_wkb(v):
"FLOAT64": "float64",
"INT64": "Int64",
"INTEGER": "Int64",
"DATE": date_dtype_name,
"TIME": time_dtype_name,
}
_PANDAS_DTYPE_TO_BQ = {
"bool": "BOOLEAN",
Expand All @@ -102,6 +107,8 @@ def _to_wkb(v):
"uint16": "INTEGER",
"uint32": "INTEGER",
"geometry": "GEOGRAPHY",
date_dtype_name: "DATE",
time_dtype_name: "TIME",
}


Expand Down
16 changes: 0 additions & 16 deletions google/cloud/bigquery/job/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -1481,7 +1481,6 @@ def to_dataframe(
dtypes: Dict[str, Any] = None,
progress_bar_type: str = None,
create_bqstorage_client: bool = True,
date_as_object: bool = True,
max_results: Optional[int] = None,
geography_as_object: bool = False,
) -> "pandas.DataFrame":
Expand Down Expand Up @@ -1524,12 +1523,6 @@ def to_dataframe(

.. versionadded:: 1.24.0

date_as_object (Optional[bool]):
If ``True`` (default), cast dates to objects. If ``False``, convert
to datetime64[ns] dtype.

.. versionadded:: 1.26.0

max_results (Optional[int]):
Maximum number of rows to include in the result. No limit by default.

Expand Down Expand Up @@ -1563,7 +1556,6 @@ def to_dataframe(
dtypes=dtypes,
progress_bar_type=progress_bar_type,
create_bqstorage_client=create_bqstorage_client,
date_as_object=date_as_object,
geography_as_object=geography_as_object,
)

Expand All @@ -1576,7 +1568,6 @@ def to_geodataframe(
dtypes: Dict[str, Any] = None,
progress_bar_type: str = None,
create_bqstorage_client: bool = True,
date_as_object: bool = True,
max_results: Optional[int] = None,
geography_column: Optional[str] = None,
) -> "geopandas.GeoDataFrame":
Expand Down Expand Up @@ -1619,12 +1610,6 @@ def to_geodataframe(

.. versionadded:: 1.24.0

date_as_object (Optional[bool]):
If ``True`` (default), cast dates to objects. If ``False``, convert
to datetime64[ns] dtype.

.. versionadded:: 1.26.0

max_results (Optional[int]):
Maximum number of rows to include in the result. No limit by default.

Expand Down Expand Up @@ -1657,7 +1642,6 @@ def to_geodataframe(
dtypes=dtypes,
progress_bar_type=progress_bar_type,
create_bqstorage_client=create_bqstorage_client,
date_as_object=date_as_object,
geography_column=geography_column,
)

Expand Down
77 changes: 39 additions & 38 deletions google/cloud/bigquery/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@
import pandas
except ImportError: # pragma: NO COVER
pandas = None
else:
from db_dtypes import DateArray, date_dtype_name

import pyarrow

Expand Down Expand Up @@ -1872,7 +1874,6 @@ def to_dataframe(
dtypes: Dict[str, Any] = None,
progress_bar_type: str = None,
create_bqstorage_client: bool = True,
date_as_object: bool = True,
geography_as_object: bool = False,
) -> "pandas.DataFrame":
"""Create a pandas DataFrame by loading all pages of a query.
Expand Down Expand Up @@ -1922,12 +1923,6 @@ def to_dataframe(

.. versionadded:: 1.24.0

date_as_object (Optional[bool]):
If ``True`` (default), cast dates to objects. If ``False``, convert
to datetime64[ns] dtype.

.. versionadded:: 1.26.0

geography_as_object (Optional[bool]):
If ``True``, convert GEOGRAPHY data to :mod:`shapely`
geometry objects. If ``False`` (default), don't cast
Expand Down Expand Up @@ -1973,36 +1968,43 @@ def to_dataframe(
self.schema
)

# When converting date or timestamp values to nanosecond precision, the result
# can be out of pyarrow bounds. To avoid the error when converting to
# Pandas, we set the date_as_object or timestamp_as_object parameter to True,
# if necessary.
date_as_object = not all(
self.__can_cast_timestamp_ns(col)
for col in record_batch
if str(col.type).startswith("date")
)
if date_as_object:
default_dtypes = {
name: type_
for name, type_ in default_dtypes.items()
if type_ != _pandas_helpers.date_dtype_name
}

# Let the user-defined dtypes override the default ones.
# https://stackoverflow.com/a/26853961/101923
dtypes = {**default_dtypes, **dtypes}

# When converting timestamp values to nanosecond precision, the result
# can be out of pyarrow bounds. To avoid the error when converting to
# Pandas, we set the timestamp_as_object parameter to True, if necessary.
types_to_check = {
pyarrow.timestamp("us"),
pyarrow.timestamp("us", tz=datetime.timezone.utc),
}

for column in record_batch:
if column.type in types_to_check:
try:
column.cast("timestamp[ns]")
except pyarrow.lib.ArrowInvalid:
timestamp_as_object = True
break
else:
timestamp_as_object = False

extra_kwargs = {"timestamp_as_object": timestamp_as_object}
timestamp_as_object = not all(
self.__can_cast_timestamp_ns(col)
for col in record_batch
if str(col.type).startswith("timestamp")
)

df = record_batch.to_pandas(
date_as_object=date_as_object, integer_object_nulls=True, **extra_kwargs
date_as_object=date_as_object,
timestamp_as_object=timestamp_as_object,
integer_object_nulls=True,
)

for column in dtypes:
df[column] = pandas.Series(df[column], dtype=dtypes[column])
data = df[column]
if dtypes[column] == date_dtype_name:
data = DateArray(data.to_numpy(copy=False), copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a special check for the date dtype, I wonder if we can use the types_mapper parameter on to_pandas?

https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch.to_pandas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be possible, but we'll need some additional work on the db-dtypes package. googleapis/python-db-dtypes-pandas#38

df[column] = pandas.Series(data, dtype=dtypes[column], copy=False)

if geography_as_object:
for field in self.schema:
Expand All @@ -2011,6 +2013,15 @@ def to_dataframe(

return df

@staticmethod
def __can_cast_timestamp_ns(column):
try:
column.cast("timestamp[ns]")
except pyarrow.lib.ArrowInvalid:
return False
else:
return True

# If changing the signature of this method, make sure to apply the same
# changes to job.QueryJob.to_geodataframe()
def to_geodataframe(
Expand All @@ -2019,7 +2030,6 @@ def to_geodataframe(
dtypes: Dict[str, Any] = None,
progress_bar_type: str = None,
create_bqstorage_client: bool = True,
date_as_object: bool = True,
geography_column: Optional[str] = None,
) -> "geopandas.GeoDataFrame":
"""Create a GeoPandas GeoDataFrame by loading all pages of a query.
Expand Down Expand Up @@ -2067,10 +2077,6 @@ def to_geodataframe(

This argument does nothing if ``bqstorage_client`` is supplied.

date_as_object (Optional[bool]):
If ``True`` (default), cast dates to objects. If ``False``, convert
to datetime64[ns] dtype.

geography_column (Optional[str]):
If there are more than one GEOGRAPHY column,
identifies which one to use to construct a geopandas
Expand Down Expand Up @@ -2126,7 +2132,6 @@ def to_geodataframe(
dtypes,
progress_bar_type,
create_bqstorage_client,
date_as_object,
geography_as_object=True,
)

Expand Down Expand Up @@ -2183,7 +2188,6 @@ def to_dataframe(
dtypes=None,
progress_bar_type=None,
create_bqstorage_client=True,
date_as_object=True,
geography_as_object=False,
) -> "pandas.DataFrame":
"""Create an empty dataframe.
Expand All @@ -2193,7 +2197,6 @@ def to_dataframe(
dtypes (Any): Ignored. Added for compatibility with RowIterator.
progress_bar_type (Any): Ignored. Added for compatibility with RowIterator.
create_bqstorage_client (bool): Ignored. Added for compatibility with RowIterator.
date_as_object (bool): Ignored. Added for compatibility with RowIterator.

Returns:
pandas.DataFrame: An empty :class:`~pandas.DataFrame`.
Expand All @@ -2208,7 +2211,6 @@ def to_geodataframe(
dtypes=None,
progress_bar_type=None,
create_bqstorage_client=True,
date_as_object=True,
geography_column: Optional[str] = None,
) -> "pandas.DataFrame":
"""Create an empty dataframe.
Expand All @@ -2218,7 +2220,6 @@ def to_geodataframe(
dtypes (Any): Ignored. Added for compatibility with RowIterator.
progress_bar_type (Any): Ignored. Added for compatibility with RowIterator.
create_bqstorage_client (bool): Ignored. Added for compatibility with RowIterator.
date_as_object (bool): Ignored. Added for compatibility with RowIterator.

Returns:
pandas.DataFrame: An empty :class:`~pandas.DataFrame`.
Expand Down
Loading