Add support for asynchronous embeddings export#394
Conversation
| if status["status"] != "Completed": | ||
| raise JobError(status, self) | ||
|
|
There was a problem hiding this comment.
why raise a JobError if the job is not completed? perhaps its still running?
There was a problem hiding this comment.
My thought process was that the usage pattern would be the following:
export_job = dataset.export_embeddings()
export_job.sleep_until_complete(False)
result = export_job.result_urls()We could just wait for the result urls inside result_urls() also, but then I'd highlight it somehow that obtaining the results could run for a long time.
There was a problem hiding this comment.
Alright, that makes sense, didn't noticed the AsyncJob inheritence.
This is a neat idea, to have customized job result classes
There was a problem hiding this comment.
Maybe we should add a wait_for_completion parameter. It might even be the default to wait for the job to complete.
There was a problem hiding this comment.
Good idea, let's do that
gatli
left a comment
There was a problem hiding this comment.
Look good to me! Let's address the instantiation of the EmbeddingsExportJob (and for that matter any AsyncJob) such that we can let people trigger this in one process and poll in another.
| poetry run black --check . | ||
| - run: | ||
| name: Ruff Lint Check # See pyproject.tooml [tool.ruff] | ||
| name: Ruff Lint Check # See pyproject.toml [tool.ruff] |
| if status["status"] != "Completed": | ||
| raise JobError(status, self) | ||
|
|
There was a problem hiding this comment.
Maybe we should add a wait_for_completion parameter. It might even be the default to wait for the job to complete.
nucleus/async_job.py
Outdated
| class EmbeddingsExportJob(AsyncJob): | ||
| def result_urls(self) -> List[str]: |
There was a problem hiding this comment.
I'm wondering how you would instantiate this in another process. I think we need a classmethod from_id that would allow you to spin this up in one environment and then poll in another just from the job_id
There was a problem hiding this comment.
Good point, that would be used through the NucleusClient.list_jobs method though right? So something like this:
jobs = NucleusClient.list_jobs()
export_job = EmbeddingsExportJob.from_job_id(jobs[0].job_id)
There was a problem hiding this comment.
Added a from_id to the AsyncJob, but I couldn't make it more typesafe (e.g. client argument is) still inferred as any. Do you have any ideas on how to improve?


No description provided.