Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added BQML pipeline library for remote inference #2

Merged
merged 10 commits into from
Apr 3, 2024
Merged

Added BQML pipeline library for remote inference #2

merged 10 commits into from
Apr 3, 2024

Conversation

chtyim
Copy link
Collaborator

@chtyim chtyim commented Apr 2, 2024

​​Remote inference in BQML is subject to retryable error caused by quotas and limitations of the remote endpoints. This Dataform package is to provide a library to retry on retryable errors. The general idea is to have a REPEAT loop to continuously process a batch of rows from the source table and merge into the target tables for those that don't have retryable error.

@chtyim chtyim requested review from Ekrekr, BenBirt and lewish April 2, 2024 15:18
Copy link

@Ekrekr Ekrekr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM!

Some of the SQL is a bit beyond be, but the general Dataform package part of the approach looks pretty clean now.

Some tests about the generated SQL might help with making sure this works and is stable - installing mocha or something else as a dev dependency so that it's not included in the main package.

@@ -0,0 +1,22 @@
module.exports = {
/**
* Declares the given source as a resolvable Dataform data source.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Declares the given source as a resolvable Dataform data source.
* Declares the resolvable as a Dataform data source.

* @param {Number} batch_duration_secs the number of seconds to pass before breaking the batching loop if it
* hasn't been finished before within this duration. Default value is 22 hours
*/
const obj_table_ml = (source_table, source, target_table, accept_filter, {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: make these functions, rather than just consts, if there's no reason they can't be?

* @param {Number} batch_duration_secs the number of seconds to pass before breaking the batching loop if it
* hasn't been finished before within this duration. Default value is 22 hours
*/
const obj_table_ml = (source_table, source, target_table, accept_filter, {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of suggesting that all parameters should be passed in via an object, so that they're all named by default, so that the example at the start would be

obj_table_ml({
    // Name of the object table that points to a set of images
    obj_table: "product_image",
    ...
});

but this gives nice error messages by default when the positional arguments are missing - so it's probably best as it is.

README.md Outdated
// Name of the multimodel that has `gemini-pro-vision` as the endpoint
let model = "multi-llm";
// Name of the object table that points to a set of images
let obj_table = "product_image";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters in all the functions call this source_table rather than obj_table - so probs best to pick one and stick with it

* @param {Number} batch_duration_secs the number of seconds to pass before breaking the batching loop if it
* hasn't been finished before within this duration. Default value is 22 hours
*/
const obj_table_ml = (source_table, source, target_table, accept_filter, {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term target is a bit overloaded in the context of dataform, so target_table may be better named output_table

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We normally call files like this utils.js instead, because CommonJS is a specific thing in JS

Comment on lines 33 to 36
// Initialize by creating the target table with a small limit to avoid timeout
operate(`init_${target_table}`)
.queries((ctx) =>
`CREATE TABLE IF NOT EXISTS ${ctx.resolve(target_table)} AS ${source_func(ctx)} WHERE ${accept_filter} LIMIT 10`);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a table by a Dataform action, that isn't made by the Dataform action itself, breaks the Dataform paradigm a bit.

IIUC, this shouldn't actually be needed anyways - the preOps below is what references target_table, but that only happens on the incremental run due to the ctx.when(ctx.incremental(), when the table should exist. So I would be surprised if this function didn't work without this operate( removed entirely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because I want the function contract be "Give me a source table and a ML function, it will guarantee to process all rows with the ML function".

If I just rely on the publish with incremental type, it would require the script to be triggered twice at least, with the first run to create the table, and the second one to have a repeat loop and merge statement.

I admit that it is a bit hacky to have this init operation, but just to make this package easier to consume without a strict requirement on scheduling.

I also tried using two publish call, but seems like dataform disallow to have two publish calls on the same table.

package.json Outdated
{
"name": "bqml",
"dependencies": {
"@dataform/core": "2.9.0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"@dataform/core": "2.9.0"
"@dataform/core": "3.0.0-beta.4"

Just so it's up to date as possible.

@chtyim
Copy link
Collaborator Author

chtyim commented Apr 3, 2024

Thanks for the review. Will add tests using the mocha in the next PR.

@chtyim chtyim merged commit f07956e into main Apr 3, 2024
1 check passed
@chtyim chtyim deleted the bqml branch April 3, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants