Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit ingestion of VIDRL flat files #161

Open
joverlee521 opened this issue Sep 11, 2024 · 7 comments · May be fixed by #164
Open

Revisit ingestion of VIDRL flat files #161

joverlee521 opened this issue Sep 11, 2024 · 7 comments · May be fixed by #164
Assignees

Comments

@joverlee521
Copy link
Contributor

Brought up by @huddlej on Slack that OneDrive includes flat files. Ingesting the flat files should make the rest of #158 easier?

Revisit changes made in #103 and update it to work with the latest version of the flat files.

@joverlee521
Copy link
Contributor Author

Thanks @j23414 for investigating the latest flat files 🙏

Jotting down notes for updating the flat file ingest:

  • the vidrl_flat_file_column_map.tsv will definitely need to be updated
  • there is a single homologous titre column so our ingest needs to create a row per reference strain to capture these homologous titers
  • the reference strain use the full strain name so we would no longer need the serum mapping 🎉
  • human sera pools include the reference strain so we would no longer need to keep track of the vaccine mapping 🎉

@joverlee521
Copy link
Contributor Author

there is a single homologous titre column so our ingest needs to create a row per reference strain to capture these homologous titers

Oh, there's a separate file for the reference panel results. Each flat file has a matching *_reference_panel.csv file that includes the references' homologous titers.

@joverlee521
Copy link
Contributor Author

joverlee521 commented Oct 7, 2024

The *_reference_panel.csv has a subset of the columns used in the main *_flat_file.csv and it only includes the shortened name of the antisera. So the antisera -> reference name mapping from the _flat_file.csv will need to be preserved to be used for the processing of the matching _reference_panel.csv file.

@huddlej
Copy link
Contributor

huddlej commented Oct 7, 2024

@joverlee521 I think we originally asked for the reference panel file and Sheena made it for us. Then later Sheena modified her script that produces the flat files to pull in the relevant information from the reference panel file, so we didn't have to parse that reference information separately.

Is there anything in the reference panel file that we can't get from the flat file by parsing the unique homologous titers like you mentioned above?

We could jump on a huddle tomorrow to chat, if that's helpful. It's been a little while since I looked at these files, too...

@joverlee521
Copy link
Contributor Author

joverlee521 commented Oct 7, 2024

Is there anything in the reference panel file that we can't get from the flat file by parsing the unique homologous titers like you mentioned above?

Yeah, looking at the *_flat_file.csv more closely, they are completely missing the reference titer measurements. They only include the results for test virus x reference virus, but do not include any of the reference virus x reference virus results.

@huddlej
Copy link
Contributor

huddlej commented Oct 7, 2024

Got it. I can't see the latest files any more (curse OneDrive!), but in the last view I had of those files, they included columns for reference antigen, reference passage, and homologous titre which would represent most of the reference titer measurements we need, but maybe it isn't enough.

To get those homologous reference values into our standard format we would need to make new records for each unique combination of antigen, passage, and titer with the test virus value equal to the reference antigen, test virus passage equal to reference passage, and titre each to homologous titre. We would be missing the antisera and ferret columns, though. We don't need antisera, when it is just an abbreviation of the reference virus name, but we probably want ferret. That supports the case for parsing the separate reference panel file, if that file has that information.

@joverlee521
Copy link
Contributor Author

We chatted about this today and decided that we do need to ingest the additional reference_panel.csv. This will ensure our ingest of the flat files includes the all measurements as the previous Excel files.

I'll update tdb/vidrl_upload.py to work with the new flat files and test on a couple Excel/flat file pairs to get a diff of the two paths.

joverlee521 added a commit that referenced this issue Oct 9, 2024
The column map will be more complicated with the need to ingest two
slightly different flat files (_flat_file.csv and _reference_panel.csv)
as discussed in #161 (comment).

I also found myself constantly toggling back and forth between the
separate column_map.tsv and the upload script to figure out how the
columns are being used, so it makes more sense to just hard-code the
column map in the script.
@huddlej huddlej self-assigned this Oct 9, 2024
joverlee521 added a commit that referenced this issue Oct 16, 2024
The column map will be more complicated with the need to ingest two
slightly different flat files (_flat_file.csv and _reference_panel.csv)
as discussed in #161 (comment).

I also found myself constantly toggling back and forth between the
separate column_map.tsv and the upload script to figure out how the
columns are being used, so it makes more sense to just hard-code the
column map in the script.
@joverlee521 joverlee521 linked a pull request Oct 17, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants