-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added BQML pipeline library for remote inference #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM!
Some of the SQL is a bit beyond be, but the general Dataform package part of the approach looks pretty clean now.
Some tests about the generated SQL might help with making sure this works and is stable - installing mocha or something else as a dev dependency so that it's not included in the main package.
modules/common.js
Outdated
@@ -0,0 +1,22 @@ | |||
module.exports = { | |||
/** | |||
* Declares the given source as a resolvable Dataform data source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Declares the given source as a resolvable Dataform data source. | |
* Declares the resolvable as a Dataform data source. |
modules/object_table_ml.js
Outdated
* @param {Number} batch_duration_secs the number of seconds to pass before breaking the batching loop if it | ||
* hasn't been finished before within this duration. Default value is 22 hours | ||
*/ | ||
const obj_table_ml = (source_table, source, target_table, accept_filter, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: make these functions, rather than just consts, if there's no reason they can't be?
modules/object_table_ml.js
Outdated
* @param {Number} batch_duration_secs the number of seconds to pass before breaking the batching loop if it | ||
* hasn't been finished before within this duration. Default value is 22 hours | ||
*/ | ||
const obj_table_ml = (source_table, source, target_table, accept_filter, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of suggesting that all parameters should be passed in via an object, so that they're all named by default, so that the example at the start would be
obj_table_ml({
// Name of the object table that points to a set of images
obj_table: "product_image",
...
});
but this gives nice error messages by default when the positional arguments are missing - so it's probably best as it is.
README.md
Outdated
// Name of the multimodel that has `gemini-pro-vision` as the endpoint | ||
let model = "multi-llm"; | ||
// Name of the object table that points to a set of images | ||
let obj_table = "product_image"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameters in all the functions call this source_table
rather than obj_table
- so probs best to pick one and stick with it
modules/object_table_ml.js
Outdated
* @param {Number} batch_duration_secs the number of seconds to pass before breaking the batching loop if it | ||
* hasn't been finished before within this duration. Default value is 22 hours | ||
*/ | ||
const obj_table_ml = (source_table, source, target_table, accept_filter, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term target
is a bit overloaded in the context of dataform, so target_table
may be better named output_table
modules/common.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We normally call files like this utils.js instead, because CommonJS is a specific thing in JS
modules/object_table_ml.js
Outdated
// Initialize by creating the target table with a small limit to avoid timeout | ||
operate(`init_${target_table}`) | ||
.queries((ctx) => | ||
`CREATE TABLE IF NOT EXISTS ${ctx.resolve(target_table)} AS ${source_func(ctx)} WHERE ${accept_filter} LIMIT 10`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a table by a Dataform action, that isn't made by the Dataform action itself, breaks the Dataform paradigm a bit.
IIUC, this shouldn't actually be needed anyways - the preOps below is what references target_table
, but that only happens on the incremental run due to the ctx.when(ctx.incremental()
, when the table should exist. So I would be surprised if this function didn't work without this operate(
removed entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because I want the function contract be "Give me a source table and a ML function, it will guarantee to process all rows with the ML function".
If I just rely on the publish
with incremental type, it would require the script to be triggered twice at least, with the first run to create the table, and the second one to have a repeat loop and merge statement.
I admit that it is a bit hacky to have this init operation, but just to make this package easier to consume without a strict requirement on scheduling.
I also tried using two publish
call, but seems like dataform disallow to have two publish calls on the same table.
package.json
Outdated
{ | ||
"name": "bqml", | ||
"dependencies": { | ||
"@dataform/core": "2.9.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"@dataform/core": "2.9.0" | |
"@dataform/core": "3.0.0-beta.4" |
Just so it's up to date as possible.
Thanks for the review. Will add tests using the mocha in the next PR. |
Remote inference in BQML is subject to retryable error caused by quotas and limitations of the remote endpoints. This Dataform package is to provide a library to retry on retryable errors. The general idea is to have a
REPEAT
loop to continuously process a batch of rows from the source table and merge into the target tables for those that don't have retryable error.