Skip to content

Commit 928162f

Browse files
authored
Minor: Add new Extended ClickBench benchmark queries (#8950)
1 parent 4a3986a commit 928162f

File tree

2 files changed

+170
-11
lines changed

2 files changed

+170
-11
lines changed

benchmarks/queries/clickbench/README.md

Lines changed: 167 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,23 +11,180 @@ ClickBench is focused on aggregation and filtering performance (though it has no
1111
[ClickBench repository]: https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql
1212

1313
## "Extended" Queries
14-
The "extended" queries are not part of the official ClickBench benchmark.
15-
Instead they are used to test other DataFusion features that are not
16-
covered by the standard benchmark
1714

18-
Each description below is for the corresponding line in `extended.sql` (line 1
19-
is `Q0`, line 2 is `Q1`, etc.)
15+
The "extended" queries are not part of the official ClickBench benchmark.
16+
Instead they are used to test other DataFusion features that are not covered by
17+
the standard benchmark Each description below is for the corresponding line in
18+
`extended.sql` (line 1 is `Q0`, line 2 is `Q1`, etc.)
19+
20+
### Q0: Data Exploration
21+
22+
**Question**: "How many distinct searches, mobile phones, and mobile phone models are there in the dataset?"
23+
24+
**Important Query Properties**: multiple `COUNT DISTINCT`s, with low and high cardinality
25+
distinct string columns.
26+
27+
```sql
28+
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel")
29+
FROM hits;
30+
```
31+
32+
33+
### Q1: Data Exploration
34+
35+
**Question**: "How many distinct "hit color", "browser country" and "language" are there in the dataset?"
36+
37+
**Important Query Properties**: multiple `COUNT DISTINCT`s. All three are small strings (length either 1 or 2).
2038

21-
### Q0
22-
Models initial Data exploration, to understand some statistics of data.
23-
Import Query Properties: multiple `COUNT DISTINCT` on strings
2439

2540
```sql
26-
SELECT
27-
COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel")
41+
SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage")
2842
FROM hits;
2943
```
3044

45+
### Q2: Top 10 anaylsis
3146

47+
**Question**: "Find the top 10 "browser country" by number of distinct "social network"s,
48+
including the distinct counts of "hit color", "browser language",
49+
and "social action"."
3250

51+
**Important Query Properties**: GROUP BY short, string, multiple `COUNT DISTINCT`s. There are several small strings (length either 1 or 2).
3352

53+
```sql
54+
SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction")
55+
FROM hits
56+
GROUP BY 1
57+
ORDER BY 2 DESC
58+
LIMIT 10;
59+
```
60+
61+
62+
## Data Notes
63+
64+
Here are some interesting statistics about the data used in the queries
65+
Max length of `"SearchPhrase"` is 1113 characters
66+
```sql
67+
select min(length("SearchPhrase")) as "SearchPhrase_len_min", max(length("SearchPhrase")) "SearchPhrase_len_max" from 'hits.parquet' limit 10;
68+
+----------------------+----------------------+
69+
| SearchPhrase_len_min | SearchPhrase_len_max |
70+
+----------------------+----------------------+
71+
| 0 | 1113 |
72+
+----------------------+----------------------+
73+
```
74+
75+
76+
Here is the schema of the data
77+
```sql
78+
❯ describe 'hits.parquet';
79+
+-----------------------+-----------+-------------+
80+
| column_name | data_type | is_nullable |
81+
+-----------------------+-----------+-------------+
82+
| WatchID | Int64 | NO |
83+
| JavaEnable | Int16 | NO |
84+
| Title | Utf8 | NO |
85+
| GoodEvent | Int16 | NO |
86+
| EventTime | Int64 | NO |
87+
| EventDate | UInt16 | NO |
88+
| CounterID | Int32 | NO |
89+
| ClientIP | Int32 | NO |
90+
| RegionID | Int32 | NO |
91+
| UserID | Int64 | NO |
92+
| CounterClass | Int16 | NO |
93+
| OS | Int16 | NO |
94+
| UserAgent | Int16 | NO |
95+
| URL | Utf8 | NO |
96+
| Referer | Utf8 | NO |
97+
| IsRefresh | Int16 | NO |
98+
| RefererCategoryID | Int16 | NO |
99+
| RefererRegionID | Int32 | NO |
100+
| URLCategoryID | Int16 | NO |
101+
| URLRegionID | Int32 | NO |
102+
| ResolutionWidth | Int16 | NO |
103+
| ResolutionHeight | Int16 | NO |
104+
| ResolutionDepth | Int16 | NO |
105+
| FlashMajor | Int16 | NO |
106+
| FlashMinor | Int16 | NO |
107+
| FlashMinor2 | Utf8 | NO |
108+
| NetMajor | Int16 | NO |
109+
| NetMinor | Int16 | NO |
110+
| UserAgentMajor | Int16 | NO |
111+
| UserAgentMinor | Utf8 | NO |
112+
| CookieEnable | Int16 | NO |
113+
| JavascriptEnable | Int16 | NO |
114+
| IsMobile | Int16 | NO |
115+
| MobilePhone | Int16 | NO |
116+
| MobilePhoneModel | Utf8 | NO |
117+
| Params | Utf8 | NO |
118+
| IPNetworkID | Int32 | NO |
119+
| TraficSourceID | Int16 | NO |
120+
| SearchEngineID | Int16 | NO |
121+
| SearchPhrase | Utf8 | NO |
122+
| AdvEngineID | Int16 | NO |
123+
| IsArtifical | Int16 | NO |
124+
| WindowClientWidth | Int16 | NO |
125+
| WindowClientHeight | Int16 | NO |
126+
| ClientTimeZone | Int16 | NO |
127+
| ClientEventTime | Int64 | NO |
128+
| SilverlightVersion1 | Int16 | NO |
129+
| SilverlightVersion2 | Int16 | NO |
130+
| SilverlightVersion3 | Int32 | NO |
131+
| SilverlightVersion4 | Int16 | NO |
132+
| PageCharset | Utf8 | NO |
133+
| CodeVersion | Int32 | NO |
134+
| IsLink | Int16 | NO |
135+
| IsDownload | Int16 | NO |
136+
| IsNotBounce | Int16 | NO |
137+
| FUniqID | Int64 | NO |
138+
| OriginalURL | Utf8 | NO |
139+
| HID | Int32 | NO |
140+
| IsOldCounter | Int16 | NO |
141+
| IsEvent | Int16 | NO |
142+
| IsParameter | Int16 | NO |
143+
| DontCountHits | Int16 | NO |
144+
| WithHash | Int16 | NO |
145+
| HitColor | Utf8 | NO |
146+
| LocalEventTime | Int64 | NO |
147+
| Age | Int16 | NO |
148+
| Sex | Int16 | NO |
149+
| Income | Int16 | NO |
150+
| Interests | Int16 | NO |
151+
| Robotness | Int16 | NO |
152+
| RemoteIP | Int32 | NO |
153+
| WindowName | Int32 | NO |
154+
| OpenerName | Int32 | NO |
155+
| HistoryLength | Int16 | NO |
156+
| BrowserLanguage | Utf8 | NO |
157+
| BrowserCountry | Utf8 | NO |
158+
| SocialNetwork | Utf8 | NO |
159+
| SocialAction | Utf8 | NO |
160+
| HTTPError | Int16 | NO |
161+
| SendTiming | Int32 | NO |
162+
| DNSTiming | Int32 | NO |
163+
| ConnectTiming | Int32 | NO |
164+
| ResponseStartTiming | Int32 | NO |
165+
| ResponseEndTiming | Int32 | NO |
166+
| FetchTiming | Int32 | NO |
167+
| SocialSourceNetworkID | Int16 | NO |
168+
| SocialSourcePage | Utf8 | NO |
169+
| ParamPrice | Int64 | NO |
170+
| ParamOrderID | Utf8 | NO |
171+
| ParamCurrency | Utf8 | NO |
172+
| ParamCurrencyID | Int16 | NO |
173+
| OpenstatServiceName | Utf8 | NO |
174+
| OpenstatCampaignID | Utf8 | NO |
175+
| OpenstatAdID | Utf8 | NO |
176+
| OpenstatSourceID | Utf8 | NO |
177+
| UTMSource | Utf8 | NO |
178+
| UTMMedium | Utf8 | NO |
179+
| UTMCampaign | Utf8 | NO |
180+
| UTMContent | Utf8 | NO |
181+
| UTMTerm | Utf8 | NO |
182+
| FromTag | Utf8 | NO |
183+
| HasGCLID | Int16 | NO |
184+
| RefererHash | Int64 | NO |
185+
| URLHash | Int64 | NO |
186+
| CLID | Int32 | NO |
187+
+-----------------------+-----------+-------------+
188+
105 rows in set. Query took 0.034 seconds.
189+
190+
```
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
1-
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits;
1+
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits;
2+
SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM hits;
3+
SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction") FROM hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

0 commit comments

Comments
 (0)