Skip to content

Turn pandas dataframes into SQL expressions to overload tables in SQL queries

License

Notifications You must be signed in to change notification settings

lbruand/gdf2sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gdf2sql

GDF2SQL tests

Goal

Simplify testing of PostgresQL SQL queries

The idea is to inject mock tables as SQL code around the tested query. The mock tables will be effectively overloading the actual tables. The good point is you don't need to modify the tables themselves. Under the hood, it uses Common Table Expressions using the keywords WITH to overload the table name with fake data ( but the actual table stays unchanged ).

As input, it accepts either geopandas dataframes or pandas dataframes.

Installation

pip install git+https://github.com/lbruand/gdf2sql#egg=gdf2sql

Requirements

  • python 3.5+
  • pandas
  • geopandas

Code example

df = pd.DataFrame(
        {
            "name": ["Buenos Aires", "Brasilia", "Santiago", "Bogota", "Caracas"],
            "Country": ["Argentina", "Brazil", "Chile", "Colombia", "Venezuela"],
            "Population": [20., 25., 9., 8., 5.],
            "Latitude": [-34.58, -15.78, -33.45, 4.60, 10.48],
            "Longitude": [-58.66, -47.91, -70.66, -74.08, -66.86],
        }
    )
inner_query = "SELECT name, ST_AsText(geometry) " \
                  "FROM nyc_subway_stats " \
                  "WHERE nyc_subway_stats.name = 'Brasilia'"
tables: List[VTable] = [(build_vtable("nyc_subway_stats", df))]
result_query = build_test_sql_query(tables, inner_query)

# Run the result query inside postgresql.
# It will run as if there as a nyc_subway_stats table containing the `df` dataframe.

How does it work under the hood : Injection

Let's take a query example :

SELECT name
FROM nyc_subway_stats
WHERE nyc_subway_stats.name = 'Brasilia'

(A)

Is transformed into :

WITH
nyc_subway_stats(name, Country, Population, Latitude, Longitude) AS (VALUES
('Buenos Aires', 'Argentina', 20.0, -34.58, -58.66),
('Brasilia', 'Brazil', 25.0, -15.78, -47.91),
('Santiago', 'Chile', 9.0, -33.45, -70.66),
('Bogota', 'Colombia', 8.0, 4.6, -74.08),
('Caracas', 'Venezuela', 5.0, 10.48, -66.86)),
INNERQUERY_6cee68 as (SELECT name FROM nyc_subway_stats WHERE nyc_subway_stats.name = 'Brasilia') SELECT * FROM INNERQUERY_6cee68

With two common table expressions :

  • one for the mock table nyc_subway_stats that overrides the actual content of the real nyc_subway_stats ( which might not even exist)
  • one for the inner query A

Diagram

Benefits

  • No need to update the underlying table before the test (which might be slow).
  • The data and the query are send to be tested inside the postgresql engine directly.
  • gdf2sql is independent of the database considered. So you might decide to unittest in SQLite and then to test integration in postgresQL. ( Beware that SQLite and PostgresQL are different SQL dialects. That fact might affect your queries. You want to use sqlglot to convert from one to the other.)
  • This lets you test the functionnal characteristic of you queries, in particular, in the face of malformed data in your tables.

caveats

  • Of course, there is no point of using this to test actual performance, indexes ... This will need actual functionnal tests.
  • You can still break down your queries into manageable subqueries and then compose them using WITH/CTE.

Why not use to_sql from pandas

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html To Be Continued

About

Turn pandas dataframes into SQL expressions to overload tables in SQL queries

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages