Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GH-1221] import snapshots into a workspace #347

Merged
merged 18 commits into from
Apr 2, 2021

Conversation

ehigham
Copy link
Member

@ehigham ehigham commented Mar 19, 2021

RR: https://broadinstitute.atlassian.net/browse/GH-1221
In this PR, I'm adding test demonstration code for how to import data from a TDR snapshot into a Terra Workspace. This includes:

  • Querying all the rows in the specified table in the snapshot
  • Selecting and renaming the columns of the table to match those of the target entity
  • Importing the rows as new entities into the workspace.

The renaming works by viewing the rows of the table as a list of maps, mapping those names onto workspace names, then reconstructing the table.
Added dataset/snapshot querying utilities to the datarepo namespace.

Bug fixes:

  • "flattening" the big-query rows didn't flatten rows of arrays

@ehigham ehigham marked this pull request as ready for review March 30, 2021 18:20
@rexwangcc rexwangcc self-requested a review March 30, 2021 20:08
@ehigham ehigham changed the title [GH-1221] import snapshots to workspace [GH-1221] import snapshots into a workspace Mar 31, 2021
@ehigham ehigham force-pushed the ehigham/GH-1221-import-snapshots-to-workspace branch from e456f28 to 2b8d226 Compare March 31, 2021 01:56
Copy link
Contributor

@rexwangcc rexwangcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for adding the import functionalities!! I left some questions, mainly for my own understanding :octocat:

api/src/wfl/service/datarepo.clj Show resolved Hide resolved
api/src/wfl/service/datarepo.clj Show resolved Hide resolved
api/src/wfl/service/google/bigquery.clj Show resolved Hide resolved
api/src/wfl/util.clj Show resolved Hide resolved
(throw (TimeoutException. "Max number of attempts exceeded")))
(log/debugf "Sleeping - attempt #%s" attempt)
(.sleep TimeUnit/MILLISECONDS 100)
(recur (inc attempt)))))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you help explain the use case of this function? it looks like this is more like "check thrice in 0.3 seconds, throw or proceed with thunk" than a regular polling to me. (i guess it's the short sleep time and the small attempt number cap got me?)

Copy link
Member Author

@ehigham ehigham Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because importing the entity then getting the entity sometimes returns empty. This code tries at most 3 times to call the get. I've just pulled out the thunk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because importing the entity then getting the entity sometimes returns empty.

That is weird… So the point of this function is more like repeat more times than wait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the goal is to poll the endpoint until the new entity is visible. It'll call at most 3 times.

api/test/wfl/integration/datarepo_test.clj Show resolved Hide resolved
api/test/wfl/tools/snapshots.clj Show resolved Hide resolved
api/test/wfl/unit/google/bigquery_test.clj Show resolved Hide resolved
@tbl3rd tbl3rd self-assigned this Mar 31, 2021
Copy link
Contributor

@tbl3rd tbl3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for figuring this out.

api/src/wfl/service/datarepo.clj Outdated Show resolved Hide resolved
api/src/wfl/service/datarepo.clj Outdated Show resolved Hide resolved
@@ -74,24 +77,23 @@
"ingest"
dataset-id
{:format "json"
:load_tag "string"
:load_tag (new-load-tag)
:max_bad_records 0
:path path
:table table}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd formatting, but old.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could it be improved?

api/src/wfl/service/datarepo.clj Outdated Show resolved Hide resolved
api/src/wfl/service/datarepo.clj Show resolved Hide resolved
api/test/wfl/integration/datarepo_test.clj Show resolved Hide resolved
api/src/wfl/util.clj Show resolved Hide resolved
(map :name)
set)]
(doseq [[_ name] entities] (is (contains? names name))))
(finally (firecloud/delete-entities workspace entities))))))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could (keep :name) and drop contains?.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't (keep :name) and (map :name) both deliver the same thing?

api/test/wfl/unit/google/bigquery_test.clj Outdated Show resolved Hide resolved
api/test/wfl/unit/google/bigquery_test.clj Show resolved Hide resolved
#{"running"})]
(while (running?) (.sleep TimeUnit/SECONDS seconds))
(result)))
([job-id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clojure q - is a multi-arity function the best / only way to provide default argument(s)? What about the case where you might wish to allow multiple default arguments?

Copy link
Member Author

@ehigham ehigham Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly one way, not sure if it's the best way. You can simulate named arguments by taking key-value pairs. Here's one way to implement that. There other ways and I'm not sure this is the best:

(let [defaults {:foo 'value1
                :bar 'value2
                :baz 'value3}]
  (defn my-named-args-function
    [& options]
    (let [inputs (apply assoc defaults options)]
      (use inputs))))

;; usage
(my-named-args-function :bar 12 :baz pi)

@ehigham ehigham merged commit e3d500d into main Apr 2, 2021
@ehigham ehigham deleted the ehigham/GH-1221-import-snapshots-to-workspace branch April 2, 2021 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants