Add bulk operations utilities #224

KevinJBoyer · 2024-05-08T14:20:50Z

Ticket

n/a

Changes

Add bulk_ops.py, which exposes a bulk_upsert function for efficiently upserting large amounts of data into the database

Context for reviewers

Projects frequently need to read in large amounts of data into the database from external sources such as CSV files. This utility provides a flexible way of doing so efficiently.
I'm not familiar with the platform's approach to abstracting away the underlying database -- the code here is Postgres specific and uses the psycopg library. Feedback on how to adapt the code here to the platform's approach is welcome/appreciated.

Testing

make test args="tests/src/db/test_bulk_ops.py"

KevinJBoyer · 2024-05-08T14:22:13Z

app/tests/src/db/test_bulk_ops.py

+    conn = db_session.connection().connection
+    # Override mypy, because SQLAlchemy's DBAPICursor type doesn't specify the row_factory attribute, or that it functions as a context manager
+    with conn.cursor(row_factory=rows.class_row(Number)) as cur:  # type: ignore


Would love to know if there's a better way of doing this. I also considered:

db_client = db.PostgresDBClient() conn = db_client._engine.raw_connection()

but accessing _engine directly did not feel appropriate (and doesn't solve for the type issue in any case)

hmm, not sure

We could consider adding a raw_connection() method to the client class which does what you suggested. For the docs, mention that unless you're trying to do something very low level (ie. in psycopg) you'll almost never actually want to use it.

I added this and included a comment with that context -- LMK what you think!

app/src/db/bulk_ops.py

lorenyu · 2024-05-08T21:50:02Z

app/tests/src/db/test_bulk_ops.py

+    conn = db_session.connection().connection
+    # Override mypy, because SQLAlchemy's DBAPICursor type doesn't specify the row_factory attribute, or that it functions as a context manager
+    with conn.cursor(row_factory=rows.class_row(Number)) as cur:  # type: ignore


hmm, not sure

lorenyu · 2024-05-08T21:52:26Z

app/tests/src/db/test_bulk_ops.py

+        # Now modify half of the objects
+        for obj in objects[: int(len(objects) / 2)]:
+            obj.num = random.randint(1, 10000)
+
+        bulk_ops.bulk_upsert(
+            cur,
+            table,
+            attributes,
+            objects,
+            constraint,
+        )
+        conn.commit()


nit: it'd be nice to have the test case do a combination of inserts and updates rather than just inserts and updates separately

Added -- one round of inserts, then a second round of combo insert + updates

chouinar · 2024-05-09T13:49:44Z

app/src/db/bulk_ops.py

+    temp_table = f"temp_{table}"
+    create_temp_table(cur, temp_table=temp_table, src_table=table)


This is probably a very niche edge case, but what would happen if two temp tables were created with the same name by different processes? Does that cause any issues, or does them being in the transactions entirely shield them?

Great question, I tested it locally and it looks like the transaction isolation works like you'd expect. Here's the SQL I ran:

CREATE TEMP TABLE test (id INT) ON COMMIT DROP; SELECT * FROM test; -- In a separate connection! BEGIN; CREATE TEMP TABLE test (other INT) ON COMMIT DROP; SELECT * FROM test; COMMIT; -- Back in the original connection COMMIT;

app/src/db/bulk_ops.py

chouinar · 2024-05-09T13:56:47Z

app/tests/src/db/test_bulk_ops.py

+    conn = db_session.connection().connection
+    # Override mypy, because SQLAlchemy's DBAPICursor type doesn't specify the row_factory attribute, or that it functions as a context manager
+    with conn.cursor(row_factory=rows.class_row(Number)) as cur:  # type: ignore


We could consider adding a raw_connection() method to the client class which does what you suggested. For the docs, mention that unless you're trying to do something very low level (ie. in psycopg) you'll almost never actually want to use it.

app/tests/src/db/test_bulk_ops.py

lorenyu

looks great. just a nit on the test, don't feel too strongly about it though

app/tests/src/db/test_bulk_ops.py

Add bulk_ops

741c5b7

KevinJBoyer commented May 8, 2024

View reviewed changes

KevinJBoyer requested review from chouinar, lorenyu and rocketnova May 8, 2024 14:22

KevinJBoyer added 2 commits May 8, 2024 10:25

Add return type and move function call out of default parameter

272da42

Set type of update_condition to Optional

3ddb27d

lorenyu reviewed May 8, 2024

View reviewed changes

chouinar reviewed May 9, 2024

View reviewed changes

Address reviewer comments

fd15ef9

KevinJBoyer requested review from chouinar and lorenyu June 6, 2024 17:37

lorenyu approved these changes Jun 6, 2024

View reviewed changes

app/tests/src/db/test_bulk_ops.py Outdated Show resolved Hide resolved

app/tests/src/db/test_bulk_ops.py Show resolved Hide resolved

Update test

87e5e33

KevinJBoyer merged commit 0f5619c into main Jun 10, 2024

KevinJBoyer deleted the kb/add-bulk-ops branch June 10, 2024 19:47

		temp_table = f"temp_{table}"
		create_temp_table(cur, temp_table=temp_table, src_table=table)

Add bulk operations utilities #224

Add bulk operations utilities #224

Uh oh!

Conversation

KevinJBoyer commented May 8, 2024

Ticket

Changes

Context for reviewers

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KevinJBoyer Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorenyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KevinJBoyer Jun 6, 2024 •

edited

Loading