Skip to content

Commit

Permalink
DOC: Document BigQuery to dtype translation for read_gbq
Browse files Browse the repository at this point in the history
Adds a table documenting the current behavior, including that pandas
0.24.0 stores as time zone aware dtype and earlier versions store naive.
I could not figure out how to make 0.24.0+ store as a naive dtype, nor
could I figure out how to make earlier versions use time zone aware.
  • Loading branch information
tswast committed Apr 3, 2019
1 parent 7edfc3e commit 7aeacca
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 19 deletions.
6 changes: 6 additions & 0 deletions docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@ Changelog

- This fixes a bug where pandas-gbq could not upload an empty database. (:issue:`237`)

Documentation
~~~~~~~~~~~~~

- Document :ref:`BigQuery data type to pandas dtype conversion
<reading-dtypes>` for ``read_gbq``. (:issue:`TBD`)

Dependency updates
~~~~~~~~~~~~~~~~~~

Expand Down
74 changes: 55 additions & 19 deletions docs/source/reading.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,32 @@ Suppose you want to load all data from an existing BigQuery table

.. code-block:: python
# Insert your BigQuery Project ID Here
# Can be found in the Google web console
import pandas_gbq
# TODO: Set your BigQuery Project ID.
projectid = "xxxxxxxx"
data_frame = read_gbq('SELECT * FROM test_dataset.test_table', projectid)
data_frame = pandas_gbq.read_gbq(
'SELECT * FROM `test_dataset.test_table`',
project_id=projectid)
.. note::

A project ID is sometimes optional if it can be inferred during
authentication, but it is required when authenticating with user
credentials. You can find your project ID in the `Google Cloud console
<https://console.cloud.google.com>`__.

You can define which column from BigQuery to use as an index in the
destination DataFrame as well as a preferred column order as follows:

.. code-block:: python
data_frame = read_gbq('SELECT * FROM test_dataset.test_table',
index_col='index_column_name',
col_order=['col1', 'col2', 'col3'], projectid)
data_frame = pandas_gbq.read_gbq(
'SELECT * FROM `test_dataset.test_table`',
project_id=projectid,
index_col='index_column_name',
col_order=['col1', 'col2', 'col3'])
You can specify the query config as parameter to use additional options of
Expand All @@ -37,20 +48,45 @@ your job. For more information about query configuration parameters see `here
"useQueryCache": False
}
}
data_frame = read_gbq('SELECT * FROM test_dataset.test_table',
configuration=configuration, projectid)
data_frame = read_gbq(
'SELECT * FROM `test_dataset.test_table`',
project_id=projectid,
configuration=configuration)
.. note::

You can find your project id in the `Google developers console
<https://console.developers.google.com>`__.
The ``dialect`` argument can be used to indicate whether to use
BigQuery's ``'legacy'`` SQL or BigQuery's ``'standard'`` SQL (beta). The
default value is ``'standard'`` For more information on BigQuery's standard
SQL, see `BigQuery SQL Reference
<https://cloud.google.com/bigquery/docs/reference/standard-sql/>`__

.. note::
.. code-block:: python
The ``dialect`` argument can be used to indicate whether to use BigQuery's ``'legacy'`` SQL
or BigQuery's ``'standard'`` SQL (beta). The default value is ``'legacy'``, though this will change
in a subsequent release to ``'standard'``. For more information
on BigQuery's standard SQL, see `BigQuery SQL Reference
<https://cloud.google.com/bigquery/sql-reference/>`__
data_frame = pandas_gbq.read_gbq(
'SELECT * FROM [test_dataset.test_table]',
project_id=projectid,
dialect='legacy')
.. _reading-dtypes:

Inferring the DataFrame's dtypes
--------------------------------

The :func:`~pandas_gbq.read_gbq` method infers the pandas dtype for each column, based on the BigQuery table schema.

================== =========================
BigQuery Data Type dtype
================== =========================
FLOAT float
------------------ -------------------------
TIMESTAMP **pandas versions 0.24.0+**
:class:`~pandas.DatetimeTZDtype` with ``unit='ns'`` and
``tz='UTC'``
**Earlier versions**
datetime64[ns]
------------------ -------------------------
DATETIME datetime64[ns]
TIME datetime64[ns]
DATE datetime64[ns]
================== =========================
3 changes: 3 additions & 0 deletions pandas_gbq/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -650,6 +650,9 @@ def _bqschema_to_nullsafe_dtypes(schema_fields):
# See:
# http://pandas.pydata.org/pandas-docs/dev/missing_data.html
# #missing-data-casting-rules-and-indexing
#
# If you update this mapping, also update the table at
# `docs/source/reading.rst`.
dtype_map = {
"FLOAT": np.dtype(float),
# Even though TIMESTAMPs are timezone-aware in BigQuery, pandas doesn't
Expand Down

0 comments on commit 7aeacca

Please sign in to comment.