Skip to content

Commit b741910

Browse files
authored
Design: Clustering in SQLflow (#737)
* Fix executor test * Design: Clustering in SQLflow * fix:Design of Clustering in SQLflow * cluster_model_train_overview.png * fix 2.0 Design: Clustering in SQLflow * fix2.0 Design: Clustering in SQLflow * fix3.0 Design: Clustering in SQLflow * modify cluster_model_train_overview.png
1 parent 9f9bba7 commit b741910

File tree

2 files changed

+146
-0
lines changed

2 files changed

+146
-0
lines changed

doc/cluster_design.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Design: Clustering in SQLflow to analyze patterns in data
2+
3+
## ClusterModel introduction
4+
5+
Most of time when businessman and analyst faced the data, they need not only the supervised learning model to perform classification and prediction, but also unsupervised learning to catch hidden patterns. This can help analysts to draw inferences from datasets consisting of input data without labeled responses, such as grouping users by their behavioral characteristics.
6+
7+
This design document introduced how to support `Cluster Model` in SQLFLow.
8+
9+
The figure below demonstrates overall workflow for clusterModel training, which include both the pre_train autoencoder model and the clustering model.
10+
<img src="figures/cluster_model_train_overview.png">
11+
12+
1. The first part is used to load a pre_trained model. We use the output of the trained encoder layer as the input to the clustering model.
13+
2. Then, the clustering model starts training with randomly initialized weights, and generate clusters after multiple iterations.
14+
3. The overall train process ultimately outputs an unsupervised clustering model.
15+
16+
##How to implement ClusterModel it in SQLFlow
17+
18+
### User interface in SQLFlow
19+
20+
In this scenario, we focus on the extraction of data patterns in unsupervised learning.
21+
22+
So, the user can use `TRAIN` keyword to training a model. The user can also specify the training hyper-parameters with the keyword `WITH` and determine whether to use pre-trained model by `USING`. The training and predicting syntax looks like:
23+
24+
TRAIN SQL:
25+
26+
``` sql
27+
SELECT * FROM input_table
28+
TRAIN clusterModel
29+
WITH
30+
model.encode_units = [100, 7]
31+
model.n_clusters = 5
32+
model.run_pretrain = false
33+
COLUMN m1, m2, m3, m4, m5, m6, m7, m8, m9, m10
34+
USING existed_pretrain_model
35+
INTO my_cluster_model;
36+
```
37+
38+
PREDICT SQL:
39+
40+
``` sql
41+
SELECT *
42+
FROM input_table
43+
PREDICT output_table
44+
USING my_cluster_model;
45+
```
46+
47+
where:
48+
- `input_table` is the high-dimensional table to be clustered.
49+
- `model.encode_units` is the autoencoder model layer's encoder units, the decode_units can reverse encode_units directly.
50+
- `model.n_clusters` is the number of patterns after clustering.
51+
- `my_cluster_model` is the trained cluster model.
52+
- `run_pretrain` is used to determine if autoencoder pre_train needs to be run, default true.
53+
- `existed_pretrain_model` is used to specify an existing pretrain_model
54+
- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table. The `group_id` is the category label predicted by the cluster model.
55+
56+
### Code Details
57+
58+
- sqlflow_models/clusterModel.py
59+
60+
```python
61+
class clusterModel(tf.keras.Model):
62+
def pre_train(dataset):
63+
...
64+
self.autoencoder.fit(dataset)
65+
pretrainmodel.save("/tmp/ae_pretrain.h5"
66+
def target_distribution():
67+
...
68+
def cluster_train_loop():
69+
for ite in range(int(maxiter)):
70+
if ite % update_interval == 0:
71+
q = model.predict(x, verbose=0)
72+
p = target_distribution(q) # update the auxiliary target distribution p
73+
y_pred = q.argmax(1)
74+
idx = index_array[index * batch_size: min((index+1) * batch_size, x.shape[0])]
75+
loss = model.train_on_batch(x=x[idx], y=p[idx])
76+
index = index + 1 if (index + 1) * batch_size <= x.shape[0] else 0
77+
```
78+
79+
- template_tf.go
80+
```python
81+
if hasattr(classifier, 'pre_train'):
82+
classifier.pre_train(...)
83+
if hasattr(classifier, 'cluster_train_loop'):
84+
classifier.cluster_train_loop(...)
85+
```
86+
87+
## Note
88+
89+
The user can choose whether to run pre_train before the cluster model, ie run_pretrain=true. And the user can also choose to load the already trained model by loading the existed_pretrain_model.
90+
91+
Therefore, there are four cases in total:
92+
93+
1. model.run_pretrain = true & User do not use `USING` keyword in this situation.
94+
95+
Autoencoder Pre_train + Random initialization weights for cluster. (Note that model.encode_units "does work" at this time.)
96+
97+
2. model.run_pretrain = true & Using existed_pretrain_model:
98+
99+
existed_pretrain_model Pre_train + Random initialization weights for cluster. (Note that model.encode_units "does not work" at this time.)
100+
101+
3. model.run_pretrain = false & User do not use `USING` keyword in this situation:
102+
103+
Random initialization weights for cluster. (Note that model.encode_units "does not work" at this time.)
104+
105+
4. model.run_pretrain = false & Using existed_pretrain_model:
106+
107+
existed_pretrain_model Pre_train + Random initialization weights for cluster. (Note that model.encode_units "does not work" at this time.)
108+
109+
- Users can use the trained cluster model in ` PREDICT SQL` to predict the group of input_table to get output_table.
110+
111+
- Finally, the user can perform a combined aggregation operation on the output_table based on the SQL statement to obtain a result_table, which can be saved to the local dataframe and then analyzed according to his own needs.
112+
113+
Sometimes, analysts will compare the mean of each feature in each group of users, this helps them to understand the difference of behavioral characteristics in each group.
114+
115+
```mysql
116+
%%sqlflow
117+
select
118+
group_id
119+
, avg(m1) as avgm1
120+
, avg(m2) as avgm2
121+
, avg(m3) as avgm3
122+
, avg(m4) as avgm4
123+
, avg(m5) as avgm5
124+
, avg(m6) as avgm6
125+
, avg(m7) as avgm7
126+
, avg(m8) as avgm8
127+
, avg(m9) as avgm9
128+
, avg(m10) as avgm10
129+
from output_table
130+
group by group_id
131+
```
132+
133+
```python
134+
_.to_dataframes(result_table)
135+
```
136+
137+
- The example of result_table:
138+
139+
|group_id | m1 | m2 | m3 | m4 | m5 | m6 | m7 | m8 | m9 | m10 |
140+
|---------|------|------|------|------|------|------|------|------|------|------|
141+
| 0 | 0.017| 0.015| 0.013| 0.012| 0.01 | 0.01 | 0.009| 0.008| 0.008| 0.008|
142+
| 1 | 0.195| 0.173| 0.154| 0.138| 0.124| 0.111| 0.1 | 0.091| 0.083| 0.076|
143+
| 2 | 0.014| 0.012| 0.011| 0.01 | 0.009| 0.008| 0.007| 0.005| 0.005| 0.004|
144+
| 3 | 0.005| 0.003| 0.003| 0.002| 0.001| 0.001| 0.001| 0.0 | 0.0 | 0.0 |
145+
| 4 | 0.311| 0.291| 0.274| 0.257| 0.24 | 0.224| 0.209| 0.196| 0.185| 0.175|
146+
193 KB
Loading

0 commit comments

Comments
 (0)