Skip to content

Conversation

@typhoonzero
Copy link
Collaborator

@typhoonzero typhoonzero commented Aug 14, 2020

When training on PAI, we generate the train code calls runtime.pai.tensorflow.train which will call oss.save_oss_model to save the model and model meta-data to OSS, yet we generate the code using tfSaveModelTmplText to save the model meta-data again:

const tfSaveModelTmplText = tfImportsText + `
import types
estimator = import_model('''{{.Estimator}}''')
is_estimator = is_tf_estimator(estimator)
# Keras single node is using h5 format to save the model, no need to deal with export model format.
# Keras distributed mode will use estimator, so this is also needed.
FLAGS = tf.app.flags.FLAGS
if is_estimator:
if FLAGS.task_index == 0:
with open("exported_path", "r") as fn:
saved_model_path = fn.read()
oss.save_dir("{{.OSSModelDir}}", saved_model_path)
oss.save_file("{{.OSSModelDir}}", "exported_path")
else:
if len(FLAGS.worker_hosts.split(",")) > 1:
if FLAGS.task_index == 0:
oss.save_file("{{.OSSModelDir}}", "exported_path")
else:
oss.save_file("{{.OSSModelDir}}", "model_save")
oss.save_metas("{{.OSSModelDir}}",
{{.NumWorkers}},
"tensorflow_model_desc",
"{{.Estimator}}",
feature_column_names,
feature_column_names_map,
feature_metas,
label_meta,
model_params,
feature_columns_code)
`

We do not need to save the meta-data twice.

And, we need to save the model meta-data as some Python code to save feature column, optimizer info. Yet runtime.pai.tensorflow.train tries to save the model_params as objects which will cause an error, we need to save the model_params before it is converted to objects.

File "/tensorflow_jobs/runtime/pai/tensorflow/train.py", line 123, in train
feature_columns_code, num_workers)
File "/tensorflow_jobs/runtime/model/oss.py", line 292, in save_oss_model
label_meta, model_params, feature_columns_code)
File "/tensorflow_jobs/runtime/model/oss.py", line 232, in save_metas
serialized = pickle.dumps(list(meta))
...
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle SwigPyObject objects

We should polish more after this fix was merged.

Copy link
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite understand what this PR does. Would you like to explain why this PR?

Copy link
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@typhoonzero typhoonzero merged commit a2871a3 into sql-machine-learning:develop Aug 14, 2020
@typhoonzero typhoonzero deleted the fix_pai_training_with_optimizer_save_model branch August 14, 2020 08:46
sneaxiy added a commit that referenced this pull request Aug 14, 2020
* Update OptFlow FSL generation code when len(variables) == 2 (#2778)

* change optflow api

* polish

* add more ut

* alps submitter and codegen (#2771)

* test alps submitter

* add alps codegen

* update

* Support download model to local in cli (#2779)

* support download model to local in cli

* update

* update

* update

* Add query api to db.py (#2782)

* Add query to db.py

* change delete with truncate

* Set EnableWindowFunc to be true by default for TiDB parser. (#2786)

* Make the attribute check for XGBoost model compatible with reg:linear

* set EnableWindowFunc to true by default for TiDB parser.

* fix Tensorflow -> TensorFlow (#2783)

* Add optimization guide doc (#2785)

* add optimization guide doc

* polish according to comments

* Generate workflow using runtime (#2784)

* WIP generate workflow using runtime

* wip update

* update

* update

* fix hive ci

* DB api base class (#2787)

* Add query to db.py

* change delete with truncate

* DB interface base class

* add to_dict and from_dict method (#2792)

* Fix develop jupyter image build (#2790)

* fix develop jupyter image build

* update

* Generate workflow step code using runtime fea der (#2791)

* WIP generate workflow step code using runtime fea der

* tested local

* update

* update

* update

* fix tests

* Add MySQL db-api implementation (#2793)

* Add query to db.py

* change delete with truncate

* DB interface base class

* Add MySQL db-api implementation

* remove unused import

* fix actions maxcompute test not running (#2795)

* Enable flake8 check on CI (#2788)

* test ci

* test again

* update

* update and fix

* fix travis ci env

* generate python feacol code (#2797)

* Add hive DB-API (#2798)

* Add query to db.py

* change delete with truncate

* DB interface base class

* Add MySQL db-api implementation

* remove unused import

* polish mysql db-api

* Add hive DB-API

* modify doc

* format code

* modify cora dataset to adapt csv format (#2780)

* Add json dump and load support for FeatureColumn (#2794)

* add json dump load support

* update vocabulary type

* update

* update

* update

* Add maxcompute DB-API (#2801)

* Add maxcompute DB-API

* remove unused import

* format code

* Push images on self hosted machine (#2799)

* push images on self hosted machine

* update

* update

* update

* update

* test install.sh

* fix go mirrors

* clean up

* add clean up

* update clean up script

* fix pai xgboost package deps (#2803)

* Simplify TO RUN command - use filename instead of absolute path for the executable or script program (#2804)

* Make the attribute check for XGBoost model compatible with reg:linear

* Derive the absolute path of the runnable program if users just input a file name.

* Use python -m command to invoke the TO RUN statement in default submitter.

* Move getRunnableProgramAbsPath to alisa.go

* Polish DB-API code, export unified connect function from package. (#2808)

* Add maxcompute DB-API

* remove unused import

* format code

* polish db-api

* add solved y to optimize (#2810)

* Generate couler code of workflow steps (#2806)

* wip

* fix yaml generate

* fix tests

* fix package deps

* fix pip package deps

* update

* Refine metadata collect and save/load (#2807)

* move and refine metadata

* fix ci ut

* fix ut

* follow lhw comment

* Adapt paiio with DB-API (#2809)

* Add maxcompute DB-API

* remove unused import

* format code

* polish db-api

* Adapt paiio with DB-API

* Adapt paiio with DB-API

* add try import paiio

* fix typo

* disable actions maxcompute test (#2814)

* make constraint optional (#2812)

* fix typo (#2820)

* Install BARON solver in Docker image (#2811)

* install baron solver in Docker image

* polish

* add pyomo baron into step docker image

* Polish DB-API to support Python2 so can run on PAI (#2815)

* polish db-api to support Python2 so can run on PAI

* enable unittest for hive db-api

* switch to github actions (#2818)

* Add experimental workflow end2end test (#2813)

* add experimental workflow end2end test

* fix workflow ci env

* update test code

* pull latest step before running workflow

* Add  Model.save_to_oss and Model.load_from_oss (#2817)

* add save_to_oss/load_from_oss

* change pickle protocol

* add more explanations on oss_model_dir doc

* fix ut

* Fix relative importing cause error (#2823)

* fix relative importing cause error

* clean up

* Use unified DB-API in codebase (#2821)

* Add maxcompute DB-API

* remove unused import

* format code

* polish db-api

* Adapt paiio with DB-API

* Adapt paiio with DB-API

* add try import paiio

* use db-api in old code

* DB-API support Python2 so can run on PAI

* polish db-api to support Python2 so can run on PAI

* polish db-api to support Python2 so can run on PAI

* polish db-api to support Python2 so can run on PAI

* Use unified DB-API in codebase.

* Use unified DB-API in codebase.

* polish code

* remove debug info

* fix ut

* Generate workflow step for normal statement run (#2824)

* generate workflow step for normal statement run

* clean up

* build step image before run workflow test

* fix is_query

* Fix pai training with optimizer config (#2828)

* fix pai training with optimizer config

* remove template

* Save the trained xgboost model (#2822)

* save trained xgboost model

* fix flake8 check

* fix ut

* fix ut

* fix workflow ut

* fix cwd error

Co-authored-by: Wu Yi <typhoonzero1986@gmail.com>
Co-authored-by: HongwuLin <lhw362950217@sina.com>
Co-authored-by: brightcoder01 <55301748+brightcoder01@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants