Skip to content

Commit c21c983

Browse files
Focus on Arrow in documentation.
1 parent f987c6d commit c21c983

File tree

3 files changed

+79
-43
lines changed

3 files changed

+79
-43
lines changed

doc/src/api_manual/dataframe.rst

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
.. _oracledataframeobj:
1+
.. _oracledataframe:
22

33
****************
44
API: Data Frames
55
****************
66

7-
Python-oracledb can fetch directly to the `Python DataFrame Interchange
8-
Protocol <https://data-apis.org/dataframe-protocol/latest/index.html>`__
9-
format.
7+
Python-oracledb can fetch directly to data frames that expose an Apache Arrow
8+
PyCapsule Interface. These can be used by many numerical and data analysis
9+
libraries.
1010

1111
See :ref:`dataframeformat` for more information, including the type mapping
1212
from Oracle Database types to Arrow data types.
@@ -16,12 +16,18 @@ from Oracle Database types to Arrow data types.
1616
The data frame support in python-oracledb 3.0.0 is a pre-release and may
1717
change in the next version.
1818

19+
.. _oracledataframeobj:
20+
1921
OracleDataFrame Objects
2022
=======================
2123

2224
OracleDataFrame objects are returned from the methods
2325
:meth:`Connection.fetch_df_all()` and :meth:`Connection.fetch_df_batches()`.
2426

27+
Each column in OracleDataFrame exposes an `Apache Arrow PyCapsule
28+
<https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html>`__
29+
interface, giving access to the underlying Arrow array.
30+
2531
The OracleDataFrame object is an extension to the DB API.
2632

2733
.. versionadded:: 3.0.0

doc/src/release_notes.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -99,9 +99,10 @@ Common Changes
9999
#) Added new methods :meth:`Connection.fetch_df_all()`,
100100
:meth:`Connection.fetch_df_batches()`,
101101
:meth:`AsyncConnection.fetch_df_all()`, and
102-
:meth:`AsyncConnection.fetch_df_batches()` to fetch data as DataFrames
103-
compliant with the Python DataFrame Interchange protocol. See
104-
:ref:`dataframeformat`.
102+
:meth:`AsyncConnection.fetch_df_batches()` to fetch data as
103+
:ref:`OracleDataFrame objects <oracledataframeobj>` that expose an Apache
104+
Arrow PyCapsule interface for efficient data exchange with external
105+
libraries. See :ref:`dataframeformat`.
105106
#) Added support for Oracle Database 23.7
106107
:ref:`SPARSE vectors <sparsevectors>`.
107108
#) Added support for :ref:`naming and caching connection pools

doc/src/user_guide/sql_execution.rst

Lines changed: 65 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -738,18 +738,17 @@ unnecessarily, and avoid objects with large numbers of attributes.
738738

739739
.. _dataframeformat:
740740

741-
Fetching using the DataFrame Interchange Protocol
742-
-------------------------------------------------
743-
744-
Python-oracledb can fetch directly to the `Python DataFrame Interchange
745-
Protocol <https://data-apis.org/dataframe-protocol/latest/index.html>`__
746-
format. This can reduce application memory requirements and allow zero-copy
747-
data interchanges between Python data frame libraries. It is an efficient way
748-
to work with data using Python libraries such as `Apache Arrow
749-
<https://arrow.apache.org/>`__, `Pandas <https://pandas.pydata.org>`__, `Polars
750-
<https://pola.rs/>`__, `NumPy <https://numpy.org/>`__, `PyTorch
751-
<https://pytorch.org/>`__, or to write files in `Apache Parquet
752-
<https://parquet.apache.org/>`__ format.
741+
Fetching Data Frames
742+
--------------------
743+
744+
Python-oracledb can fetch directly to data frames that expose an Apache Arrow
745+
PyCapsule Interface. This can reduce application memory requirements and allow
746+
zero-copy data interchanges between Python data frame libraries. It is an
747+
efficient way to work with data using Python libraries such as `Apache PyArrow
748+
<https://arrow.apache.org/docs/python/index.html>`__, `Pandas
749+
<https://pandas.pydata.org>`__, `Polars <https://pola.rs/>`__, `NumPy
750+
<https://numpy.org/>`__, `PyTorch <https://pytorch.org/>`__, or to write files
751+
in `Apache Parquet <https://parquet.apache.org/>`__ format.
753752

754753
.. note::
755754

@@ -759,9 +758,7 @@ to work with data using Python libraries such as `Apache Arrow
759758
The method :meth:`Connection.fetch_df_all()` fetches all rows from a query.
760759
The method :meth:`Connection.fetch_df_batches()` implements an iterator for
761760
fetching batches of rows. The methods return :ref:`OracleDataFrame
762-
<oracledataframeobj>` objects, whose :ref:`methods <oracledataframemeth>`
763-
implement the Python DataFrame Interchange Protocol `DataFrame API Interface
764-
<https://data-apis.org/dataframe-protocol/latest/API.html>`__.
761+
<oracledataframeobj>` objects.
765762

766763
For example, to fetch all rows from a query and print some information about
767764
the results:
@@ -782,13 +779,36 @@ With Oracle Database's standard DEPARTMENTS table, this would display::
782779
4 columns
783780
27 rows
784781

785-
To do more extensive operations on an :ref:`OracleDataFrame
786-
<oracledataframeobj>`, it can be converted to an appropriate library class, and
787-
then methods of that library can be used. For example it could be converted to
788-
a `Pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.
789-
DataFrame.html#pandas.DataFrame>`__, or to a `PyArrow table
790-
<https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__ as shown
791-
later.
782+
**Summary of Converting OracleDataFrame to Other Data Frames**
783+
784+
To do more extensive operations, :ref:`OracleDataFrames <oracledataframeobj>`
785+
can be converted to your chosen library data frame, and then methods of that
786+
library can be used. This section has an overview of how best to do
787+
conversions. Some examples are shown in subsequent sections.
788+
789+
To convert :ref:`OracleDataFrame <oracledataframeobj>` to a `PyArrow Table
790+
<https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__, use
791+
`pyarrow.Table.from_arrays()
792+
<https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_arrays>`__
793+
which leverages the Arrow PyCapsule interface.
794+
795+
To convert :ref:`OracleDataFrame <oracledataframeobj>` to a `Pandas DataFrame
796+
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame>`__,
797+
use `pyarrow.Table.to_pandas()
798+
<https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas>`__.
799+
800+
If you want to use a data frame library other than Pandas or PyArrow, use the
801+
library's ``from_arrow()`` method to convert a PyArrow Table to the applicable
802+
data frame, if your library supports this. For example, with `Polars
803+
<https://pola.rs/>`__ use `polars.from_arrow()
804+
<https://docs.pola.rs/api/python/dev/reference/api/polars.from_arrow.html>`__.
805+
806+
Lastly, if your data frame library does not support ``from_arrow()``, then use
807+
``from_dataframe()`` if the library supports it. This can be slower, depending
808+
on the implementation.
809+
810+
The general recommendation is to use Apache Arrow as much as possible but if
811+
there are no options, then use ``from_dataframe()``.
792812

793813
**Data Frame Type Mapping**
794814

@@ -797,8 +817,8 @@ support makes use of `Apache nanoarrow <https://arrow.apache.org/nanoarrow/>`__
797817
libraries to build data frames.
798818

799819
The following data type mapping occurs from Oracle Database types to the Arrow
800-
types used in OracleDataFrame objects. Querying any other types from Oracle
801-
Database will result in an exception.
820+
types used in OracleDataFrame objects. Querying any other data types from
821+
Oracle Database will result in an exception.
802822

803823
.. list-table-with-summary::
804824
:header-rows: 1
@@ -830,7 +850,6 @@ Database will result in an exception.
830850
* - DB_TYPE_TIMESTAMP_TZ
831851
- TIMESTAMP
832852

833-
834853
When converting Oracle Database NUMBERs, if :attr:`defaults.fetch_decimals` is
835854
*True*, the Arrow data type is DECIMAL128. Note Arrow's DECIMAL128 format only
836855
supports precision of up to 38 decimal digits. Else, if the Oracle number data
@@ -895,6 +914,11 @@ An example that creates and uses a `PyArrow Table
895914
This makes use of :meth:`OracleDataFrame.column_arrays()` which returns a list
896915
of :ref:`OracleArrowArray Objects <oraclearrowarrayobj>`.
897916

917+
Internally `pyarrow.Table.from_arrays() <https://arrow.apache.org/docs/python/
918+
generated/pyarrow.Table.html#pyarrow.Table.from_arrays>`__ leverages the Apache
919+
Arrow PyCapsule interface that :ref:`OracleDataFrame <oracledataframeobj>`
920+
exposes.
921+
898922
See `samples/dataframe_pyarrow.py <https://github.com/oracle/python-oracledb/
899923
blob/main/samples/dataframe_pyarrow.py>`__ for a runnable example.
900924

@@ -924,17 +948,19 @@ org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame>`__ is:
924948
print(df.T) # transform
925949
print(df.tail(3)) # last three rows
926950
927-
Using python-oracledb to fetch the interchange format will be more efficient
928-
than using the Pandas ``read_sql()`` method.
951+
The `to_pandas() <https://arrow.apache.org/docs/python/generated/pyarrow.Table.
952+
html#pyarrow.Table.to_pandas>`__ method supports arguments like
953+
``types_mapper=pandas.ArrowDtype`` and ``deduplicate_objects=False``, which may
954+
be useful for some data sets.
929955

930956
See `samples/dataframe_pandas.py <https://github.com/oracle/python-oracledb/
931957
blob/main/samples/dataframe_pandas.py>`__ for a runnable example.
932958

933-
Creating Polars Series
934-
++++++++++++++++++++++
959+
Creating Polars DataFrames
960+
++++++++++++++++++++++++++
935961

936-
An example that creates and uses a `Polars Series
937-
<https://docs.pola.rs/api/python/stable/reference/series/index.html>`__ is:
962+
An example that creates and uses a `Polars DataFrame
963+
<https://docs.pola.rs/api/python/stable/reference/dataframe/index.html>`__ is:
938964

939965
.. code-block:: python
940966
@@ -946,13 +972,16 @@ An example that creates and uses a `Polars Series
946972
sql = "select id from SampleQueryTab order by id"
947973
odf = connection.fetch_df_all(statement=sql, arraysize=100)
948974
949-
# Convert to a Polars Series
950-
pyarrow_array = pyarrow.array(odf.get_column_by_name("ID"))
951-
p = polars.from_arrow(pyarrow_array)
975+
# Convert to a Polars DataFrame
976+
pyarrow_table = pyarrow.Table.from_arrays(
977+
odf.column_arrays(), names=odf.column_names()
978+
)
979+
df = polars.from_arrow(pyarrow_table)
952980
953-
# Perform various Polars operations on the Series
981+
# Perform various Polars operations on the DataFrame
982+
r, c = df.shape
983+
print(f"{r} rows, {c} columns")
954984
print(p.sum())
955-
print(p.log10())
956985
957986
See `samples/dataframe_polars.py <https://github.com/oracle/python-oracledb/
958987
blob/main/samples/dataframe_polars.py>`__ for a runnable example.

0 commit comments

Comments
 (0)