Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit 6807991

Browse files
authored
Merge pull request #288 from jardayn/docs_imp
New DB Driver guide update
2 parents 6308ad9 + 2e3703a commit 6807991

File tree

1 file changed

+39
-16
lines changed

1 file changed

+39
-16
lines changed

docs/new-database-driver-guide.rst

Lines changed: 39 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,24 @@ Then, users can install the dependencies needed for your database driver, with `
2424

2525
This way, data-diff can support a wide variety of drivers, without requiring our users to install libraries that they won't use.
2626

27-
2. Implement database module
27+
2. Implement a database module
2828
----------------------------
2929

3030
New database modules belong in the ``data_diff/databases`` directory.
3131

32+
The module consists of:
33+
1. Dialect (Class responsible for normalizing/casting fields. e.g. Numbers/Timestamps)
34+
2. Database class that handles connecting to the DB, querying (if the default doesn't work) , closing connectiosn and etc.
35+
36+
Choosing a base class, based on threading Model
37+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38+
39+
You can choose to inherit from either ``base.Database`` or ``base.ThreadedDatabase``.
40+
41+
Usually, databases with cursor-based connections, like MySQL or Postgresql, only allow one thread per connection. In order to support multithreading, we implement them by inheriting from ``ThreadedDatabase``, which holds a pool of worker threads, and creates a new connection per thread.
42+
43+
Usually, cloud databases, such as snowflake and bigquery, open a new connection per request, and support simultaneous queries from any number of threads. In other words, they already support multithreading, so we can implement them by inheriting directly from ``Database``.
44+
3245
Import on demand
3346
~~~~~~~~~~~~~~~~~
3447

@@ -50,16 +63,6 @@ Instead, they should be imported and initialized within a function. Example:
5063

5164
We use the ``import_helper()`` decorator to provide a uniform and informative error. The string argument should be the name of the package, as written in ``pyproject.toml``.
5265

53-
Choosing a base class, based on threading Model
54-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
55-
56-
You can choose to inherit from either ``base.Database`` or ``base.ThreadedDatabase``.
57-
58-
Usually, databases with cursor-based connections, like MySQL or Postgresql, only allow one thread per connection. In order to support multithreading, we implement them by inheriting from ``ThreadedDatabase``, which holds a pool of worker threads, and creates a new connection per thread.
59-
60-
Usually, cloud databases, such as snowflake and bigquery, open a new connection per request, and support simultaneous queries from any number of threads. In other words, they already support multithreading, so we can implement them by inheriting directly from ``Database``.
61-
62-
6366
:meth:`_query()`
6467
~~~~~~~~~~~~~~~~~~
6568

@@ -124,19 +127,40 @@ Docs:
124127

125128
- :meth:`data_diff.databases.database_types.AbstractDatabase.close`
126129

127-
:meth:`quote()`, :meth:`to_string()`, :meth:`normalize_number()`, :meth:`normalize_timestamp()`, :meth:`md5_to_int()`
130+
:meth:`quote()`, :meth:`to_string()`,
128131
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
129132

130-
These methods are used when creating queries.
131-
132-
They accept an SQL code fragment, and returns a new code fragment representing the appropriate computation.
133+
These methods are used when creating queries, to cast to quote a value or cast it to VARCHAR.
133134

134135
For more information, read their docs:
135136

136137
- :meth:`data_diff.databases.database_types.AbstractDatabase.quote`
137138

138139
- :meth:`data_diff.databases.database_types.AbstractDatabase.to_string`
139140

141+
:meth:`normalize_number()`, :meth:`normalize_timestamp()`, :meth:`md5_to_int()`
142+
143+
Because comparing data between 2 databases requires both the data to be in the same format - we have normalization functions.
144+
145+
Databases can have the same data in different formats, e.g. ``DECIMAL`` vs ``FLOAT`` vs ``VARCHAR``, with different precisions.
146+
DataDiff works by converting the values to ``VARCHAR`` and comparing it.
147+
Your normalize_number/normalize_timestamp functions should account for differing precisions between columns.
148+
149+
These functions accept an SQL code fragment, and returns a new code fragment representing the appropriate computation.
150+
151+
:meth:`parse_type`
152+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
153+
154+
This is used to determine types which the system cannot effectively detect.
155+
Examples:
156+
DECIMAL(10,3) needs to be parsed by a custom algorithm. You'd be using regex to split it into Field name + Width + Scale.
157+
158+
4. Debugging
159+
-----------------------
160+
161+
You can enable debug logging for tests by setting the logger level to ``DEBUG`` in /tests/common.py
162+
This will display all the queries ran + display types detected for columns.
163+
140164
3. Add tests
141165
--------------
142166

@@ -176,4 +200,3 @@ When debugging, we recommend using the `-f` flag, to stop on error. Also, use th
176200
-----------------------
177201

178202
Open a pull-request on github, and we'll take it from there!
179-

0 commit comments

Comments
 (0)