|
| 1 | +# _Design:_ Intermediate Representation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +As SQLFlow is supporting more and more machine learning toolkits, the corresponding code generation logics are better being organized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package. |
| 6 | + |
| 7 | +The core `sql` package should include the following functionalities: |
| 8 | +1. The entry point of running extended SQL statements. |
| 9 | +1. The [parsing](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/sql_parser.md) of extended SQL statements. |
| 10 | +1. The verification of extended SQL statements, including verifying the syntax, the existence of the selected fields. |
| 11 | +1. The [feature derivation](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/feature_derivation.md), including name, type, shape, and preprocessing method of the select fields. |
| 12 | +1. The [training data and validation data split](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/training_and_validation.md). |
| 13 | + |
| 14 | +With these functionalities, the `sql` package çan translate user typed extended SQL statements to an IR as an exposed Go struct. The codegen package takes the IR and returns a generated Python program for the `sql` package to execute. |
| 15 | + |
| 16 | +## Code Structure |
| 17 | + |
| 18 | +We propose the following code structures. |
| 19 | + |
| 20 | +``` |
| 21 | +sql/ |
| 22 | + ... |
| 23 | + codegen/ |
| 24 | + tensorflow/ |
| 25 | + train.go |
| 26 | + predict.go |
| 27 | + analyze.go |
| 28 | + xgboost/ |
| 29 | + ... |
| 30 | +``` |
| 31 | + |
| 32 | +The `tensorflow` package will expose function `func Train(ir sql.TrainIR) string, error`, which takes the `sql`'s `TrainIR` and returns a generated Python program. |
| 33 | + |
| 34 | +## Intermediate Representation |
| 35 | + |
| 36 | +We propose the following struct as the IR for code generation. |
| 37 | + |
| 38 | +```go |
| 39 | +package sql |
| 40 | + |
| 41 | +import ( |
| 42 | + "github.com/sql-machine-learning/sqlflow/sql/columns" |
| 43 | +) |
| 44 | + |
| 45 | +type FieldType int |
| 46 | + |
| 47 | +const ( |
| 48 | + Int FieldType = iota |
| 49 | + Float |
| 50 | + String |
| 51 | +) |
| 52 | + |
| 53 | +// FieldMeta contains the meta information for decoding and feature columns |
| 54 | +type FieldMeta struct { |
| 55 | + DType FieldType // e.g. "float", "int32" |
| 56 | + Delimiter string // e.g. "," |
| 57 | + Shape []int // e.g. [1], [1 2 3] |
| 58 | + IsSparse bool // e.g. false |
| 59 | + FeatureColumn []columns.FeatureColumn // e.g. [EmbeddingColumn, CategoryIDColumn] |
| 60 | +} |
| 61 | + |
| 62 | +// TrainIR is the intermediate representation for code generation of a training job |
| 63 | +type TrainIR struct { |
| 64 | + DataSource string // e.g. "hive://root:root@localhost:10000/churn" |
| 65 | + Select string // e.g. "select * from iris.train" |
| 66 | + ValidationSelect string // e.g. "select * from iris.val;" |
| 67 | + Estimator string // e.g. "DNNClassifier" |
| 68 | + Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} |
| 69 | + Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} |
| 70 | + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} |
| 71 | +} |
| 72 | + |
| 73 | +// PredictIR is the intermediate representation for code generation of a prediction job |
| 74 | +type PredictIR struct { |
| 75 | + DataSource string // e.g. "hive://root:root@localhost:10000/churn" |
| 76 | + Select string // e.g. "select * from iris.test" |
| 77 | + Estimator string // e.g. "DNNClassifier" |
| 78 | + Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} |
| 79 | + Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} |
| 80 | + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} |
| 81 | + ReusltTable string // e.g. "iris.predict" |
| 82 | +} |
| 83 | + |
| 84 | +// AnalyzeIR is the intermediate representation for code generation of a analysis job |
| 85 | +type AnalyzeIR struct { |
| 86 | + DataSource string // e.g. "hive://root:root@localhost:10000/churn" |
| 87 | + Select string // e.g. "select * from iris.train" |
| 88 | + Estimator string // e.g. "DNNClassifier" |
| 89 | + Attribute map[string]interface{} // e.g. {"analyze.plot_type": "bar"} |
| 90 | + Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} |
| 91 | + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} |
| 92 | +} |
| 93 | +``` |
| 94 | + |
| 95 | +Please be aware that all the IR excludes the information of the current working directory. This information belongs to the `executor` in `sql` package. For a prediction/analyze job, the `executor` should recover everything produced by the training job. |
| 96 | + |
| 97 | +Please be aware that `TrainIR` excludes the saving table name. This information belongs to the `executor` in `sql` package. |
0 commit comments