Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML: Add a TablesClient for automl-tables specific behavior. #8720

Merged
merged 11 commits into from
Aug 14, 2019
Merged

AutoML: Add a TablesClient for automl-tables specific behavior. #8720

merged 11 commits into from
Aug 14, 2019

Conversation

lwander
Copy link
Contributor

@lwander lwander commented Jul 19, 2019

AutoML tables is heavily used by datascientists in jupyter notebooks. Based on feedback we've gotten, the existing client is hard to use, obtuse, not notebook-friendly. This wrapper client aims to fix that.

Using the client (see integration tests for more detailed snippets):

from google.cloud import automl_v1beta1

client = automl_v1beta1.TablesClient(project='my-project-123')

client.list_datasets()

dataset = client.create_dataset('my_dataset')
client.import_data(dataset=dataset, gcs_input_uris='gs://my-data/data.csv').result()
client.set_target_column(dataset=dataset, column_spec_display_name='output')

model = client.create_model(dataset=dataset, train_budget_milli_node_hours=1000).result()

Testing

Most testing is done in unit tests, however we have some integration tests covering the common user flows as well (create dataset -> import data -> train -> predict). To speed up the testing, we create resources the first time tests are run, and run tests off of those in subsequent runs. This mitigates the minimum 1h+ model creation & deployment time.

Future work

In no particular order, we plan to add:

  • IPython integration to pretty print objects
  • Matplotlib integration to show dataset & model stats
  • Pandas/numpy integration for data import & batch predict

Sorry for the large change, most of it is doc strings and unit tests. We were recommended to check in the full basic functionality first.

Lars Wander and others added 5 commits July 19, 2019 16:32
Additional test, docs & proposed cleanup needs to happen on top of this.
#16)

* update create_model to allow user to specify included or excluded columns

* made minor changes stylistically and with added ValueError outputs
* added two new func: set time, get table address

* changed indentation
@lwander lwander requested a review from busunkim96 as a code owner July 19, 2019 20:42
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no This human has *not* signed the Contributor License Agreement. label Jul 19, 2019
@lwander
Copy link
Contributor Author

lwander commented Jul 19, 2019

@jonathan1920 you need to sign the CLA, or resign the commits with the correct @google.com address

@beccasaurus
Copy link
Contributor

@merla18 Please take a look.

At a high level, it looks like this custom layer does the following:

  • Instantiate client with Project ID so the client knows the Project ID
  • Treat the resources' Display Names as a shorthand identifier
  • CRUD operations can accept any of the following to identify a resource:
    • Full name (projects/X/locations/Y/resource_type/ID)
    • The resource's ID only (project ID stored on client)
    • The resource's Display Name
    • Python pb object representing the resource, itself
  • This is branded for Tables but most of the functionality would work across the board for AutoML
    • The identifier shorthand functionality would also work for any resourceful GCP API

Project ID on client

IIRC there have been proposals over the years for storing project on the client.

As this manual layer implements this feature, I'm wondering: is this a feature that has any active efforts for GAX/core?

Calling RPC methods with shorthand for resource names/identifiers

Are there any other Python Google Cloud Client Libraries that implement something like this?

I'm a little worried about consistency. When learning GCP APIs, part of the learning curve is resources (their "names" which are fully qualified identifiers -vs- display names which many resources have -vs- identifiers which are often server-side generated but sometimes client provided).

This layer simplifies working with resources a lot, it looks very elegant. But learning this interface might confuse developers when they use this interface and then try using any of the other GCP API Python Cloud Client Libraries.

Looks elegant! I'd like to find out more, though. Especially because this layer simplifies working with GCP API resources in general, which I love.


/cc @sirtorry @andrewferlitsch @dizcology @lukesneeringer @crwilcox @JustinBeckwith

@andrewferlitsch
Copy link
Contributor

andrewferlitsch commented Jul 19, 2019 via email

@busunkim96 busunkim96 added the api: automl Issues related to the AutoML API. label Jul 19, 2019
@busunkim96 busunkim96 changed the title Adds a TablesClient for automl-tables specific behavior. AutoML: Adds a TablesClient for automl-tables specific behavior. Jul 19, 2019
@busunkim96 busunkim96 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 19, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 19, 2019
@andrewferlitsch
Copy link
Contributor

@beccasaurus I recommend we do a review and see the best "wrapper" approach that can be applied uniformly across all the APIs.

For example, it maybe better to provide an attribute to a generic Client type that would indicate the type of dataset/model (Tables, etc) vs. subclassing (i.e., TablesClient is a subclass of Client).

@lwander
Copy link
Contributor Author

lwander commented Jul 20, 2019

That's a great point:

I'm a little worried about consistency. When learning GCP APIs, part of the learning curve is resources (their "names" which are fully qualified identifiers

The real difficultly in using the original client is that a lot of these identifiers are hidden behind multiple API calls & string transformations. For example, to update a column (a necessary and required step), you need to know first the fully-qualified name of the dataset, use that to lookup the fully-qualified name of the primary table, and then search the primary table for the column in question. At that point, you need to extract the API-assigned identifier from the column's fully qualified name (something of the form TBL123...). At this point you're ready to update the column. This doesn't even get into the odd behavior you see where certain properties of the column are required in the update, and others are optional.

This API works great when you're a browser & always have the entire state of all your resources in memory. However, when you're just getting started in a jupyter notebook and want to quickly run some experiments using autoML, you're out of luck.

Would the "correct" solution be to rewrite all the APIs with both client users & browsers in mind? Maybe. But that doesn't cover the future work we want to implement, namely integration with pandas, integration with matplotlib, and integration with IPython.

On top of that, we have complaints from both internal users who are experts in GCP, as well as users who are brand new to the platform (and these complaints don't even address the lack of the above mentioned enhancements). Overall, there is just too much complexity in the client to perform very simple operations.

In the end, the tables team wants a polished, first-class, data scientist-friendly client, as this is what our users are asking for. Consequently, we're ready to put in the effort to maintain this client.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Jul 20, 2019
automl/README.rst Outdated Show resolved Hide resolved
@tseaver tseaver added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 23, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 23, 2019
@tseaver tseaver added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 29, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jul 29, 2019
@tseaver tseaver added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 1, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 1, 2019
@lwander
Copy link
Contributor Author

lwander commented Aug 1, 2019

Updated

@tseaver tseaver added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 1, 2019
@beccasaurus
Copy link
Contributor

✅ Green light from my end. We followed up about Project ID / Resource Objects offline.

I'm +1 to merge this when it's considered ready.

@busunkim96 Could you add @sirtorry as a reviewer?

@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 1, 2019
@busunkim96
Copy link
Contributor

I opened a PR on sloth to add @sirtorry to yoshi-python. You should be able to add comments for now.

@busunkim96 busunkim96 requested a review from sirtorry August 1, 2019 20:39
Copy link
Contributor

@sirtorry sirtorry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests lgtm

@tseaver tseaver changed the title AutoML: Adds a TablesClient for automl-tables specific behavior. AutoML: Add a TablesClient for automl-tables specific behavior. Aug 7, 2019
@lwander
Copy link
Contributor Author

lwander commented Aug 8, 2019

I've made the list system tests more stringent. Anything blocking us from merging?

@busunkim96 busunkim96 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 8, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 8, 2019
@lwander
Copy link
Contributor Author

lwander commented Aug 14, 2019

Friendly ping

@busunkim96
Copy link
Contributor

@sirtorry Please merge it when you're ready. You should have write permissions to the repo.

@busunkim96 busunkim96 requested a review from sirtorry August 14, 2019 17:31
@sirtorry sirtorry merged commit e9ea9ed into googleapis:master Aug 14, 2019
@lwander
Copy link
Contributor Author

lwander commented Aug 15, 2019

Thanks all! 🙂

emar-kar pushed a commit to MaxxleLLC/google-cloud-python that referenced this pull request Sep 18, 2019
…leapis#8720)

* Checking in staged client helper code

Additional test, docs & proposed cleanup needs to happen on top of this.

* update create_model to allow user to specify included or excluded col… (#16)

* update create_model to allow user to specify included or excluded columns

* made minor changes stylistically and with added ValueError outputs

* Update doc gen & module structure. Add unit & system tests

* added two new func: set time, get table address (#23)

* added two new func: set time, get table address

* changed indentation

* Add system tests

* Address linter & python2.7 import errors

* Passes **kwargs through to client & implements missing methods

* Support BQ as input/output in batch_predict

* Address first round of feedback

* Switch to pytest.raises, fix .rst formatting exception

* Make list system tests more stringent
parthea pushed a commit that referenced this pull request Oct 21, 2023
* Checking in staged client helper code

Additional test, docs & proposed cleanup needs to happen on top of this.

* update create_model to allow user to specify included or excluded col… (#16)

* update create_model to allow user to specify included or excluded columns

* made minor changes stylistically and with added ValueError outputs

* Update doc gen & module structure. Add unit & system tests

* added two new func: set time, get table address (#23)

* added two new func: set time, get table address

* changed indentation

* Add system tests

* Address linter & python2.7 import errors

* Passes **kwargs through to client & implements missing methods

* Support BQ as input/output in batch_predict

* Address first round of feedback

* Switch to pytest.raises, fix .rst formatting exception

* Make list system tests more stringent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: automl Issues related to the AutoML API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants