Skip to content

Commit 0cbc1b8

Browse files
DOCSP-37414 Cleanup comparison page (#1)
1 parent 4f52672 commit 0cbc1b8

File tree

6 files changed

+126
-245
lines changed

6 files changed

+126
-245
lines changed

README.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
==================================
2-
MongoDB PymongoArrow Documentation
2+
MongoDB PyMongoArrow Documentation
33
==================================
44

55
This repository contains documentation for PyMongoArrow, an extension to the

source/comparison.txt

Lines changed: 125 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,22 @@
44
Comparing to PyMongo
55
====================
66

7-
This tutorial is intended as a comparison between using **PyMongoArrow**,
8-
versus just PyMongo. The reader is assumed to be familiar with basic
9-
`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`_ and
10-
`MongoDB <https://docs.mongodb.com>`_ concepts.
11-
7+
.. contents:: On this page
8+
:local:
9+
:backlinks: none
10+
:depth: 1
11+
:class: singlecol
12+
13+
.. facet::
14+
:name: genre
15+
:values: reference
16+
17+
.. meta::
18+
:keywords: PyMongo, equivalence
19+
20+
In this guide, you can learn about the differences between {+driver-short+} and the
21+
PyMongo driver. This guide assumes familiarity with basic :driver:`PyMongo
22+
</pymongo>` and `MongoDB <https://docs.mongodb.com>`__ concepts.
1223

1324
Reading Data
1425
------------
@@ -17,93 +28,98 @@ The most basic way to read data using PyMongo is:
1728

1829
.. code-block:: python
1930

20-
coll = db.benchmark
21-
f = list(coll.find({}, projection={"_id": 0}))
22-
table = pyarrow.Table.from_pylist(f)
31+
coll = db.benchmark
32+
f = list(coll.find({}, projection={"_id": 0}))
33+
table = pyarrow.Table.from_pylist(f)
2334

24-
This works, but we have to exclude the "_id" field because otherwise we get this error::
35+
This works, but you have to exclude the ``_id`` field, otherwise you get the following error:
2536

26-
pyarrow.lib.ArrowInvalid: Could not convert ObjectId('642f2f4720d92a85355671b3') with type ObjectId: did not recognize Python value type when inferring an Arrow data type
37+
.. code-block:: python
2738

28-
The workaround gets ugly (especially if you're using more than ObjectIds):
39+
pyarrow.lib.ArrowInvalid: Could not convert ObjectId('642f2f4720d92a85355671b3') with type ObjectId: did not recognize Python value type when inferring an Arrow data type
2940

30-
.. code-block:: pycon
41+
The following code example shows a workaround for the preceding error when
42+
using PyMongo:
3143

32-
>>> f = list(coll.find({}))
33-
>>> for doc in f:
34-
... doc["_id"] = str(doc["_id"])
35-
...
36-
>>> table = pyarrow.Table.from_pylist(f)
37-
>>> print(table)
38-
pyarrow.Table
39-
_id: string
40-
x: int64
41-
y: double
44+
.. code-block:: python
4245

43-
Even though this avoids the error, an unfortunate drawback is that Arrow cannot identify that it is an ObjectId,
44-
as noted by the schema showing "_id" is a string.
45-
The primary benefit that PyMongoArrow gives is support for BSON types through Arrow/Pandas Extension Types. This allows you to avoid the ugly workaround:
46+
>>> f = list(coll.find({}))
47+
>>> for doc in f:
48+
... doc["_id"] = str(doc["_id"])
49+
...
50+
>>> table = pyarrow.Table.from_pylist(f)
51+
>>> print(table)
52+
pyarrow.Table
53+
_id: string
54+
x: int64
55+
y: double
4656

47-
.. code-block:: pycon
57+
Even though this avoids the error, a drawback is that Arrow can't identify that ``_id`` is an ObjectId,
58+
as noted by the schema showing ``_id`` as a string.
4859

49-
>>> from pymongoarrow.types import ObjectIdType
50-
>>> schema = Schema({"_id": ObjectIdType(), "x": pyarrow.int64(), "y": pyarrow.float64()})
51-
>>> table = find_arrow_all(coll, {}, schema=schema)
52-
>>> print(table)
53-
pyarrow.Table
54-
_id: extension<arrow.py_extension_type<ObjectIdType>>
55-
x: int64
56-
y: double
60+
{+driver-short+} supports BSON types
61+
through Arrow or Pandas Extension Types. This allows you to avoid the preceding
62+
workaround.
63+
64+
.. code-block:: python
5765

58-
And it also lets Arrow correctly identify the type! This is limited in utility for non-numeric extension types,
59-
but if you wanted to for example, sort datetimes, it avoids unnecessary casting:
66+
>>> from pymongoarrow.types import ObjectIdType
67+
>>> schema = Schema({"_id": ObjectIdType(), "x": pyarrow.int64(), "y": pyarrow.float64()})
68+
>>> table = find_arrow_all(coll, {}, schema=schema)
69+
>>> print(table)
70+
pyarrow.Table
71+
_id: extension<arrow.py_extension_type<ObjectIdType>>
72+
x: int64
73+
y: double
74+
75+
With this method, Arrow correctly identifies the type. This has limited
76+
use for non-numeric extension types, but avoids unnecessary casting for certain
77+
operations, such as sorting datetimes.
6078

6179
.. code-block:: python
6280

63-
f = list(coll.find({}, projection={"_id": 0, "x": 0}))
64-
naive_table = pyarrow.Table.from_pylist(f)
81+
f = list(coll.find({}, projection={"_id": 0, "x": 0}))
82+
naive_table = pyarrow.Table.from_pylist(f)
6583

66-
schema = Schema({"time": pyarrow.timestamp("ms")})
67-
table = find_arrow_all(coll, {}, schema=schema)
84+
schema = Schema({"time": pyarrow.timestamp("ms")})
85+
table = find_arrow_all(coll, {}, schema=schema)
6886

69-
assert (
70-
table.sort_by([("time", "ascending")])["time"]
71-
== naive_table["time"].cast(pyarrow.timestamp("ms")).sort()
72-
)
87+
assert (
88+
table.sort_by([("time", "ascending")])["time"]
89+
== naive_table["time"].cast(pyarrow.timestamp("ms")).sort()
90+
)
7391

74-
Additionally, PyMongoArrow supports Pandas extension types.
75-
With PyMongo, a Decimal128 value behaves as follows:
92+
Additionally, {+driver-short+} supports Pandas extension types.
93+
With PyMongo, a ``Decimal128`` value behaves as follows:
7694

7795
.. code-block:: python
7896

79-
coll = client.test.test
80-
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
81-
cursor = coll.find({})
82-
df = pd.DataFrame(list(cursor))
83-
print(df.dtypes)
84-
# _id object
85-
# value object
97+
coll = client.test.test
98+
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
99+
cursor = coll.find({})
100+
df = pd.DataFrame(list(cursor))
101+
print(df.dtypes)
102+
# _id object
103+
# value object
86104

87-
The equivalent in PyMongoArrow would be:
105+
The equivalent in {+driver-short+} is:
88106

89107
.. code-block:: python
90108

91-
from pymongoarrow.api import find_pandas_all
92-
93-
coll = client.test.test
94-
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
95-
df = find_pandas_all(coll, {})
96-
print(df.dtypes)
97-
# _id bson_PandasObjectId
98-
# value bson_PandasDecimal128
109+
from pymongoarrow.api import find_pandas_all
110+
coll = client.test.test
111+
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
112+
df = find_pandas_all(coll, {})
113+
print(df.dtypes)
114+
# _id bson_PandasObjectId
115+
# value bson_PandasDecimal128
99116

100-
In both cases the underlying values are the bson class type:
117+
In both cases, the underlying values are the BSON class type:
101118

102119
.. code-block:: python
103120

104-
print(df["value"][0])
105-
Decimal128("0")
106-
121+
print(df["value"][0])
122+
Decimal128("0")
107123

108124
Writing Data
109125
------------
@@ -112,53 +128,56 @@ Writing data from an Arrow table using PyMongo looks like the following:
112128

113129
.. code-block:: python
114130

115-
data = arrow_table.to_pylist()
116-
db.collname.insert_many(data)
131+
data = arrow_table.to_pylist()
132+
db.collname.insert_many(data)
117133

118-
The equivalent in PyMongoArrow is:
134+
The equivalent in {+driver-short+} is:
119135

120136
.. code-block:: python
121137

122-
from pymongoarrow.api import write
123-
124-
write(db.collname, arrow_table)
138+
from pymongoarrow.api import write
125139

126-
As of PyMongoArrow 1.0, the main advantage to using the ``write`` function
127-
is that it will iterate over the arrow table/ data frame / numpy array
128-
and not convert the entire object to a list.
140+
write(db.collname, arrow_table)
129141

142+
As of {+driver-short+} 1.0, the main advantage to using the ``write`` function
143+
is that it iterates over the arrow table, data frame, or numpy array,
144+
and doesn't convert the entire object to a list.
130145

131146
Benchmarks
132147
----------
133148

134-
The following measurements were taken with PyMongoArrow 1.0 and PyMongo 4.4.
135-
For insertions, the library performs about the same as when using PyMongo
136-
(conventional), and uses the same amount of memory.::
137-
138-
ProfileInsertSmall.peakmem_insert_conventional 107M
139-
ProfileInsertSmall.peakmem_insert_arrow 108M
140-
ProfileInsertSmall.time_insert_conventional 202±0.8ms
141-
ProfileInsertSmall.time_insert_arrow 181±0.4ms
142-
143-
ProfileInsertLarge.peakmem_insert_arrow 127M
144-
ProfileInsertLarge.peakmem_insert_conventional 125M
145-
ProfileInsertLarge.time_insert_arrow 425±1ms
146-
ProfileInsertLarge.time_insert_conventional 440±1ms
147-
148-
For reads, the library is somewhat slower for small documents and nested
149-
documents, but faster for large documents . It uses less memory in all cases::
150-
151-
ProfileReadSmall.peakmem_conventional_arrow 85.8M
152-
ProfileReadSmall.peakmem_to_arrow 83.1M
153-
ProfileReadSmall.time_conventional_arrow 38.1±0.3ms
154-
ProfileReadSmall.time_to_arrow 60.8±0.3ms
155-
156-
ProfileReadLarge.peakmem_conventional_arrow 138M
157-
ProfileReadLarge.peakmem_to_arrow 106M
158-
ProfileReadLarge.time_conventional_ndarray 243±20ms
159-
ProfileReadLarge.time_to_arrow 186±0.8ms
160-
161-
ProfileReadDocument.peakmem_conventional_arrow 209M
162-
ProfileReadDocument.peakmem_to_arrow 152M
163-
ProfileReadDocument.time_conventional_arrow 865±7ms
164-
ProfileReadDocument.time_to_arrow 937±1ms
149+
The following measurements were taken with {+driver-short+} version 1.0 and
150+
PyMongo version 4.4. For insertions, the library performs about the same as when
151+
using conventional PyMongo, and uses the same amount of memory.
152+
153+
.. code-block:: none
154+
155+
ProfileInsertSmall.peakmem_insert_conventional 107M
156+
ProfileInsertSmall.peakmem_insert_arrow 108M
157+
ProfileInsertSmall.time_insert_conventional 202±0.8ms
158+
ProfileInsertSmall.time_insert_arrow 181±0.4ms
159+
160+
ProfileInsertLarge.peakmem_insert_arrow 127M
161+
ProfileInsertLarge.peakmem_insert_conventional 125M
162+
ProfileInsertLarge.time_insert_arrow 425±1ms
163+
ProfileInsertLarge.time_insert_conventional 440±1ms
164+
165+
For reads, the library is slower for small documents and nested
166+
documents, but faster for large documents. It uses less memory in all cases.
167+
168+
.. code-block:: none
169+
170+
ProfileReadSmall.peakmem_conventional_arrow 85.8M
171+
ProfileReadSmall.peakmem_to_arrow 83.1M
172+
ProfileReadSmall.time_conventional_arrow 38.1±0.3ms
173+
ProfileReadSmall.time_to_arrow 60.8±0.3ms
174+
175+
ProfileReadLarge.peakmem_conventional_arrow 138M
176+
ProfileReadLarge.peakmem_to_arrow 106M
177+
ProfileReadLarge.time_conventional_ndarray 243±20ms
178+
ProfileReadLarge.time_to_arrow 186±0.8ms
179+
180+
ProfileReadDocument.peakmem_conventional_arrow 209M
181+
ProfileReadDocument.peakmem_to_arrow 152M
182+
ProfileReadDocument.time_conventional_arrow 865±7ms
183+
ProfileReadDocument.time_to_arrow 937±1ms

source/developer-guide.txt

Lines changed: 0 additions & 17 deletions
This file was deleted.

source/developer-guide/benchmarks.txt

Lines changed: 0 additions & 17 deletions
This file was deleted.

0 commit comments

Comments
 (0)