Skip to content

Conversation

@tonyyang-svail
Copy link
Collaborator

No description provided.

@tonyyang-svail tonyyang-svail changed the title [wip] Intermediate Representation [Design] Intermediate Representation Sep 6, 2019
Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to have IR now, we still need to look into the details of the IR struct.

ExtraSelect map[string]string // e.g. {"validation": "select * from iris.val;"}
Estimator string // e.g. "DNNClassifier"
Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]}
Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature only have information about how to parse the column data, we still need information about how to construct a feature column on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I added FeatureColumn field in FieldMeta.

Select string // e.g. "select * from iris.train"
ExtraSelect map[string]string // e.g. {"validation": "select * from iris.val;"}
Estimator string // e.g. "DNNClassifier"
Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#774 here's some of my thoughts about refactoring attribute resolution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero Thanks for the link. I think Attribute map[string]interface{} here is fine.

I believe AttrMeta belong to the code generation package since different machine learning toolkits accept different attributes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe AttrMeta belong to the code generation package since different machine learning toolkits accept different attributes

It is. Each submitter's code generation package define it's AttrMeta and call a function to get an Attribute map[string]interface{}.

ExtraConfig string // Extra configuration in JSON format. e.g. OSS credential
Select string // e.g. "select * from iris.train"
Estimator string // e.g. "DNNClassifier"
Attribute map[string]interface{} // e.g. {"predict.batch_size": 32}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attribute(s)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think singular is fine. The map type already indicates there are multiple attributes. The same reasoning applies to Feature.

IsSparse bool // e.g. false
}

// TrainIR is the intermediate representation for code generation of a training job
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to implement an IR for each job, how about simplifying like:

type FeatureMeta struct {
    DType string
    Delimiter string 
    ...
}
type DBConn struct {
    Driver string
    User string
    ....
}
type ClauseIR struct {
    Estimator string
    SelectClause string
    Attributes map[string]interface{}
    DBConn DBConn
    Features map[string]FeatureMeta
    ...
}

Each generator can extend the ClauseIR as needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 Thanks for the suggestion. Combining all three IRs to a single ClauseIR does save some code. However, I still advocate using separate IRs for different job types. Here is my reasoning.

  • Avoid confusion. The developer of xgboost.Predict would be confused by the ValidationSelect field in ClauseIR. Also, as we adding more features to SQLFlow, more fields would be added to CluaseIR, and the confusion will increase.
  • Less work. We either distinguish the job type in sql or in codegen. However, there are many codegens and only one sql. Distinguishing the job type in sql saves works in all codegens.

Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally

@weiguoz weiguoz self-requested a review September 9, 2019 00:57
Copy link
Collaborator

@weiguoz weiguoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants