@@ -11,23 +11,180 @@ ClickBench is focused on aggregation and filtering performance (though it has no
1111[ ClickBench repository ] : https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql
1212
1313## "Extended" Queries
14- The "extended" queries are not part of the official ClickBench benchmark.
15- Instead they are used to test other DataFusion features that are not
16- covered by the standard benchmark
1714
18- Each description below is for the corresponding line in ` extended.sql ` (line 1
19- is ` Q0 ` , line 2 is ` Q1 ` , etc.)
15+ The "extended" queries are not part of the official ClickBench benchmark.
16+ Instead they are used to test other DataFusion features that are not covered by
17+ the standard benchmark Each description below is for the corresponding line in
18+ ` extended.sql ` (line 1 is ` Q0 ` , line 2 is ` Q1 ` , etc.)
19+
20+ ### Q0: Data Exploration
21+
22+ ** Question** : "How many distinct searches, mobile phones, and mobile phone models are there in the dataset?"
23+
24+ ** Important Query Properties** : multiple ` COUNT DISTINCT ` s, with low and high cardinality
25+ distinct string columns.
26+
27+ ``` sql
28+ SELECT COUNT (DISTINCT " SearchPhrase" ), COUNT (DISTINCT " MobilePhone" ), COUNT (DISTINCT " MobilePhoneModel" )
29+ FROM hits;
30+ ```
31+
32+
33+ ### Q1: Data Exploration
34+
35+ ** Question** : "How many distinct "hit color", "browser country" and "language" are there in the dataset?"
36+
37+ ** Important Query Properties** : multiple ` COUNT DISTINCT ` s. All three are small strings (length either 1 or 2).
2038
21- ### Q0
22- Models initial Data exploration, to understand some statistics of data.
23- Import Query Properties: multiple ` COUNT DISTINCT ` on strings
2439
2540``` sql
26- SELECT
27- COUNT (DISTINCT " SearchPhrase" ), COUNT (DISTINCT " MobilePhone" ), COUNT (DISTINCT " MobilePhoneModel" )
41+ SELECT COUNT (DISTINCT " HitColor" ), COUNT (DISTINCT " BrowserCountry" ), COUNT (DISTINCT " BrowserLanguage" )
2842FROM hits;
2943```
3044
45+ ### Q2: Top 10 anaylsis
3146
47+ ** Question** : "Find the top 10 "browser country" by number of distinct "social network"s,
48+ including the distinct counts of "hit color", "browser language",
49+ and "social action"."
3250
51+ ** Important Query Properties** : GROUP BY short, string, multiple ` COUNT DISTINCT ` s. There are several small strings (length either 1 or 2).
3352
53+ ``` sql
54+ SELECT " BrowserCountry" , COUNT (DISTINCT " SocialNetwork" ), COUNT (DISTINCT " HitColor" ), COUNT (DISTINCT " BrowserLanguage" ), COUNT (DISTINCT " SocialAction" )
55+ FROM hits
56+ GROUP BY 1
57+ ORDER BY 2 DESC
58+ LIMIT 10 ;
59+ ```
60+
61+
62+ ## Data Notes
63+
64+ Here are some interesting statistics about the data used in the queries
65+ Max length of ` "SearchPhrase" ` is 1113 characters
66+ ``` sql
67+ ❯ select min (length(" SearchPhrase" )) as " SearchPhrase_len_min" , max (length(" SearchPhrase" )) " SearchPhrase_len_max" from ' hits.parquet' limit 10 ;
68+ + -- --------------------+----------------------+
69+ | SearchPhrase_len_min | SearchPhrase_len_max |
70+ + -- --------------------+----------------------+
71+ | 0 | 1113 |
72+ + -- --------------------+----------------------+
73+ ```
74+
75+
76+ Here is the schema of the data
77+ ``` sql
78+ ❯ describe ' hits.parquet' ;
79+ + -- ---------------------+-----------+-------------+
80+ | column_name | data_type | is_nullable |
81+ + -- ---------------------+-----------+-------------+
82+ | WatchID | Int64 | NO |
83+ | JavaEnable | Int16 | NO |
84+ | Title | Utf8 | NO |
85+ | GoodEvent | Int16 | NO |
86+ | EventTime | Int64 | NO |
87+ | EventDate | UInt16 | NO |
88+ | CounterID | Int32 | NO |
89+ | ClientIP | Int32 | NO |
90+ | RegionID | Int32 | NO |
91+ | UserID | Int64 | NO |
92+ | CounterClass | Int16 | NO |
93+ | OS | Int16 | NO |
94+ | UserAgent | Int16 | NO |
95+ | URL | Utf8 | NO |
96+ | Referer | Utf8 | NO |
97+ | IsRefresh | Int16 | NO |
98+ | RefererCategoryID | Int16 | NO |
99+ | RefererRegionID | Int32 | NO |
100+ | URLCategoryID | Int16 | NO |
101+ | URLRegionID | Int32 | NO |
102+ | ResolutionWidth | Int16 | NO |
103+ | ResolutionHeight | Int16 | NO |
104+ | ResolutionDepth | Int16 | NO |
105+ | FlashMajor | Int16 | NO |
106+ | FlashMinor | Int16 | NO |
107+ | FlashMinor2 | Utf8 | NO |
108+ | NetMajor | Int16 | NO |
109+ | NetMinor | Int16 | NO |
110+ | UserAgentMajor | Int16 | NO |
111+ | UserAgentMinor | Utf8 | NO |
112+ | CookieEnable | Int16 | NO |
113+ | JavascriptEnable | Int16 | NO |
114+ | IsMobile | Int16 | NO |
115+ | MobilePhone | Int16 | NO |
116+ | MobilePhoneModel | Utf8 | NO |
117+ | Params | Utf8 | NO |
118+ | IPNetworkID | Int32 | NO |
119+ | TraficSourceID | Int16 | NO |
120+ | SearchEngineID | Int16 | NO |
121+ | SearchPhrase | Utf8 | NO |
122+ | AdvEngineID | Int16 | NO |
123+ | IsArtifical | Int16 | NO |
124+ | WindowClientWidth | Int16 | NO |
125+ | WindowClientHeight | Int16 | NO |
126+ | ClientTimeZone | Int16 | NO |
127+ | ClientEventTime | Int64 | NO |
128+ | SilverlightVersion1 | Int16 | NO |
129+ | SilverlightVersion2 | Int16 | NO |
130+ | SilverlightVersion3 | Int32 | NO |
131+ | SilverlightVersion4 | Int16 | NO |
132+ | PageCharset | Utf8 | NO |
133+ | CodeVersion | Int32 | NO |
134+ | IsLink | Int16 | NO |
135+ | IsDownload | Int16 | NO |
136+ | IsNotBounce | Int16 | NO |
137+ | FUniqID | Int64 | NO |
138+ | OriginalURL | Utf8 | NO |
139+ | HID | Int32 | NO |
140+ | IsOldCounter | Int16 | NO |
141+ | IsEvent | Int16 | NO |
142+ | IsParameter | Int16 | NO |
143+ | DontCountHits | Int16 | NO |
144+ | WithHash | Int16 | NO |
145+ | HitColor | Utf8 | NO |
146+ | LocalEventTime | Int64 | NO |
147+ | Age | Int16 | NO |
148+ | Sex | Int16 | NO |
149+ | Income | Int16 | NO |
150+ | Interests | Int16 | NO |
151+ | Robotness | Int16 | NO |
152+ | RemoteIP | Int32 | NO |
153+ | WindowName | Int32 | NO |
154+ | OpenerName | Int32 | NO |
155+ | HistoryLength | Int16 | NO |
156+ | BrowserLanguage | Utf8 | NO |
157+ | BrowserCountry | Utf8 | NO |
158+ | SocialNetwork | Utf8 | NO |
159+ | SocialAction | Utf8 | NO |
160+ | HTTPError | Int16 | NO |
161+ | SendTiming | Int32 | NO |
162+ | DNSTiming | Int32 | NO |
163+ | ConnectTiming | Int32 | NO |
164+ | ResponseStartTiming | Int32 | NO |
165+ | ResponseEndTiming | Int32 | NO |
166+ | FetchTiming | Int32 | NO |
167+ | SocialSourceNetworkID | Int16 | NO |
168+ | SocialSourcePage | Utf8 | NO |
169+ | ParamPrice | Int64 | NO |
170+ | ParamOrderID | Utf8 | NO |
171+ | ParamCurrency | Utf8 | NO |
172+ | ParamCurrencyID | Int16 | NO |
173+ | OpenstatServiceName | Utf8 | NO |
174+ | OpenstatCampaignID | Utf8 | NO |
175+ | OpenstatAdID | Utf8 | NO |
176+ | OpenstatSourceID | Utf8 | NO |
177+ | UTMSource | Utf8 | NO |
178+ | UTMMedium | Utf8 | NO |
179+ | UTMCampaign | Utf8 | NO |
180+ | UTMContent | Utf8 | NO |
181+ | UTMTerm | Utf8 | NO |
182+ | FromTag | Utf8 | NO |
183+ | HasGCLID | Int16 | NO |
184+ | RefererHash | Int64 | NO |
185+ | URLHash | Int64 | NO |
186+ | CLID | Int32 | NO |
187+ + -- ---------------------+-----------+-------------+
188+ 105 rows in set . Query took 0 .034 seconds.
189+
190+ ```
0 commit comments