Skip to content
This repository has been archived by the owner on Sep 20, 2023. It is now read-only.

Clarification regarding missing query metrics #84

Closed
katsou55 opened this issue Jun 24, 2019 · 3 comments
Closed

Clarification regarding missing query metrics #84

katsou55 opened this issue Jun 24, 2019 · 3 comments

Comments

@katsou55
Copy link

katsou55 commented Jun 24, 2019

Hello aughingman7743,

First of all, awesome work you have here! Your implementation is quite handy.

I was looking for upgrading the version from 1 to 2 due to the new driver released by AWS and I noticed that the queries' execution information was lost (e.g. data_scanned_in_bytes).

I was looking into the code to try to hack this information since it is relevant to me, although without any success. Also, I tried to look for the query id to fetch the execution properties offline as work around.

Q1: Therefore, my question is if this is a limitation from the Driver itself ?
Your help is highly appreciated!


Moreover, I noticed that your PyAthena implementation has the information I am looking for since you use the athena api. Do you have any benchmark or concern regarding the utilization on the latter vs the former (PyAthenaJDBC).

Up to know I ran a simple test "select * from mytable limit 1000" and consistently, the PyAthenaJDBC has better performance over PyAthena with the simple cursor or pandas cursor.

Q2: Did you find the same results?

Cheers!


Thanks in advance

@laughingman7743
Copy link
Owner

Q1:
Older versions of the JDBC driver could obtain a query ID, but newer versions of the JDBC driver were not likely to be available.
https://github.com/laughingman7743/PyAthenaJDBC/blob/v1.3.4/pyathenajdbc/cursor.py#L95
Therefore, the query result information has been deleted when supporting the new driver.

Q2:
I have not taken a benchmark comparing PyAthena and PyAthenaJDBC.

Here's a comparison of PyAthena's Cursor and PandasCursor:
https://gist.github.com/laughingman7743/2e4d83ca4e394dc645e9ea9a45fe78ba

I'm interested in the performance of the JDBC version.

@laughingman7743
Copy link
Owner

The benchmark results have been added to the following branches:
https://github.com/laughingman7743/PyAthena/tree/add_benchmark_scripts/benchmarks
I also measure the performance of the JDBC version.

@katsou55
Copy link
Author

katsou55 commented Jul 5, 2019

I have ported my results into your benchmark script, here are the results in case you want to append the case.

Regarding the data, it is stored in ORC format compressed with snappy and it's partitioned by day.
Hope it helps.

And again, thanks for the awesome work.


1) SELECT * FROM "mydb"."mydata" limit 500;

PyAthenaJDBC Cursor ===========================
loop:0 count:500 elasped:6.10137915611
loop:1 count:500 elasped:2.72790908813
loop:2 count:500 elasped:2.40618085861
loop:3 count:500 elasped:2.7293741703
loop:4 count:500 elasped:2.67691302299
Avg: 3.32835125923
===============================================

PyAthena Cursor ===============================
loop:0 count:500 elasped:3.79181885719
loop:1 count:500 elasped:2.284968853
loop:2 count:500 elasped:2.23489999771
loop:3 count:500 elasped:3.27493000031
loop:4 count:500 elasped:2.25128793716
Avg: 2.76758112907
===============================================

PyAthena PandasCursor =========================
loop:0 count:500 elasped:3.91924095154
loop:1 count:500 elasped:1.97232317924
loop:2 count:500 elasped:3.05096411705
loop:3 count:500 elasped:2.04229307175
loop:4 count:500 elasped:2.98776197433
Avg: 2.79451665878
===============================================



2) SELECT * FROM "mydb"."mydata" limit 5000;

PyAthenaJDBC Cursor ===========================
loop:0 count:5000 elasped:11.7849650383
loop:1 count:5000 elasped:14.5608298779
loop:2 count:5000 elasped:14.3484158516
loop:3 count:5000 elasped:13.3424770832
loop:4 count:5000 elasped:14.0001211166
Avg: 13.6073617935
===============================================

PyAthena Cursor ===============================
loop:0 count:5000 elasped:13.2565009594
loop:1 count:5000 elasped:15.2821121216
loop:2 count:5000 elasped:14.016723156
loop:3 count:5000 elasped:14.1107928753
loop:4 count:5000 elasped:13.1769390106
Avg: 13.9686136246
===============================================

PyAthena PandasCursor =========================
loop:0 count:5000 elasped:7.07447004318
loop:1 count:5000 elasped:4.74981999397
loop:2 count:5000 elasped:4.56278800964
loop:3 count:5000 elasped:6.75243997574
loop:4 count:5000 elasped:5.7084209919
Avg: 5.76958780289
===============================================


3) SELECT * FROM "mydb"."mydata" limit 50000;

PyAthenaJDBC Cursor ===========================
loop:0 count:50000 elasped:103.403305054
loop:1 count:50000 elasped:102.769503832
loop:2 count:50000 elasped:104.066136122
loop:3 count:50000 elasped:124.901746035
loop:4 count:50000 elasped:102.194267988
Avg: 107.466991806
===============================================

PyAthena Cursor ===============================
loop:0 count:50000 elasped:98.2539699078
loop:1 count:50000 elasped:101.936420918
loop:2 count:50000 elasped:100.894956112
loop:3 count:50000 elasped:92.7668831348
loop:4 count:50000 elasped:93.4813268185
Avg: 97.4667113781
===============================================

PyAthena PandasCursor =========================
loop:0 count:50000 elasped:14.0817611217
loop:1 count:50000 elasped:13.3427770138
loop:2 count:50000 elasped:14.3496661186
loop:3 count:50000 elasped:14.4232530594
loop:4 count:50000 elasped:13.1008570194
Avg: 13.8596628666

@katsou55 katsou55 closed this as completed Jul 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants