Commit e966c38
[SPARK-34265][PYTHON][SQL] Instrument Python UDFs using SQL metrics
### What changes are proposed in this pull request?
This proposes to add SQLMetrics instrumentation for Python UDF execution, including Pandas UDF, and related operations such as MapInPandas and MapInArrow.
The proposed metrics are:
- data sent to Python workers
- data returned from Python workers
- number of output rows
### Why are the changes needed?
This aims at improving monitoring and performance troubleshooting of Python UDFs.
In particular it is intended as an aid to answer performance-related questions such as:
why is the UDF slow?, how much work has been done so far?, etc.
### Does this PR introduce _any_ user-facing change?
SQL metrics are made available in the WEB UI.
See the following examples:

### How was this patch tested?
Manually tested + a Python unit test and a Scala unit test have been added.
Example code used for testing:
```
from pyspark.sql.functions import col, pandas_udf
import time
pandas_udf("long")
def test_pandas(col1):
time.sleep(0.02)
return col1 * col1
spark.udf.register("test_pandas", test_pandas)
spark.sql("select rand(42)*rand(51)*rand(12) col1 from range(10000000)").createOrReplaceTempView("t1")
spark.sql("select max(test_pandas(col1)) from t1").collect()
```
This is used to test with more data pushed to the Python workers:
```
from pyspark.sql.functions import col, pandas_udf
import time
pandas_udf("long")
def test_pandas(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17):
time.sleep(0.02)
return col1
spark.udf.register("test_pandas", test_pandas)
spark.sql("select rand(42)*rand(51)*rand(12) col1 from range(10000000)").createOrReplaceTempView("t1")
spark.sql("select max(test_pandas(col1,col1+1,col1+2,col1+3,col1+4,col1+5,col1+6,col1+7,col1+8,col1+9,col1+10,col1+11,col1+12,col1+13,col1+14,col1+15,col1+16)) from t1").collect()
```
This (from the Spark doc) has been used to test with MapInPandas, where the number of output rows is different from the number of input rows:
```
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def filter_func(iterator):
for pdf in iterator:
yield pdf[pdf.id == 1]
df.mapInPandas(filter_func, schema=df.schema).show()
```
This for testing BatchEvalPython and metrics related to data transfer (bytes sent and received):
```
from pyspark.sql.functions import udf
udf
def test_udf(col1, col2):
return col1 * col1
spark.sql("select id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' col2 from range(10)").select(test_udf("id", "col2")).collect()
```
Closes #33559 from LucaCanali/pythonUDFKeySQLMetrics.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>1 parent b7a88cd commit e966c38
File tree
20 files changed
+193
-23
lines changed- dev/sparktestsupport
- docs
- python/pyspark/sql/tests
- sql/core/src
- main/scala/org/apache/spark/sql/execution
- python
- streaming
- test/scala/org/apache/spark/sql/execution/python
20 files changed
+193
-23
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
484 | 484 | | |
485 | 485 | | |
486 | 486 | | |
| 487 | + | |
487 | 488 | | |
488 | 489 | | |
489 | 490 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
406 | 406 | | |
407 | 407 | | |
408 | 408 | | |
| 409 | + | |
| 410 | + | |
409 | 411 | | |
410 | 412 | | |
411 | 413 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
Lines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| |||
163 | 163 | | |
164 | 164 | | |
165 | 165 | | |
166 | | - | |
| 166 | + | |
| 167 | + | |
167 | 168 | | |
168 | 169 | | |
169 | 170 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| |||
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
61 | | - | |
| 62 | + | |
| 63 | + | |
62 | 64 | | |
63 | 65 | | |
64 | 66 | | |
| |||
116 | 118 | | |
117 | 119 | | |
118 | 120 | | |
| 121 | + | |
119 | 122 | | |
120 | 123 | | |
121 | 124 | | |
| |||
126 | 129 | | |
127 | 130 | | |
128 | 131 | | |
| 132 | + | |
| 133 | + | |
129 | 134 | | |
130 | 135 | | |
131 | 136 | | |
| |||
Lines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
64 | | - | |
| 64 | + | |
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
88 | | - | |
| 88 | + | |
| 89 | + | |
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
| |||
Lines changed: 3 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
35 | | - | |
| 36 | + | |
| 37 | + | |
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
| |||
Lines changed: 4 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
| 35 | + | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| |||
77 | 77 | | |
78 | 78 | | |
79 | 79 | | |
80 | | - | |
| 80 | + | |
| 81 | + | |
81 | 82 | | |
82 | 83 | | |
83 | 84 | | |
| |||
94 | 95 | | |
95 | 96 | | |
96 | 97 | | |
| 98 | + | |
97 | 99 | | |
98 | 100 | | |
99 | 101 | | |
| |||
Lines changed: 7 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
| |||
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
48 | | - | |
| 49 | + | |
| 50 | + | |
49 | 51 | | |
50 | 52 | | |
51 | 53 | | |
| |||
77 | 79 | | |
78 | 80 | | |
79 | 81 | | |
| 82 | + | |
80 | 83 | | |
81 | 84 | | |
82 | 85 | | |
83 | 86 | | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
84 | 90 | | |
85 | 91 | | |
86 | 92 | | |
| |||
Lines changed: 3 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
57 | | - | |
| 57 | + | |
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| |||
77 | 77 | | |
78 | 78 | | |
79 | 79 | | |
80 | | - | |
81 | 80 | | |
82 | 81 | | |
83 | 82 | | |
| |||
97 | 96 | | |
98 | 97 | | |
99 | 98 | | |
100 | | - | |
| 99 | + | |
| 100 | + | |
101 | 101 | | |
102 | 102 | | |
103 | 103 | | |
| |||
0 commit comments