-
Couldn't load subscription status.
- Fork 79
Add qiimp xlsx #2599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add qiimp xlsx #2599
Changes from 5 commits
51986a1
d701b81
011b339
2fe8e71
fd564e6
df980db
39d7906
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,6 +51,9 @@ | |
| from os.path import join, basename, isdir, exists | ||
| from os import walk, remove, listdir, makedirs, rename | ||
| from shutil import move, rmtree, copy as shutil_copy | ||
| from openpyxl import load_workbook | ||
| from tempfile import mkstemp | ||
| from csv import writer as csv_writer | ||
| from json import dumps | ||
| from datetime import datetime | ||
| from itertools import chain | ||
|
|
@@ -1730,6 +1733,34 @@ def _get_filehandle(filepath_or, *args, **kwargs): | |
| if _is_string_or_bytes(filepath_or): | ||
| if h5py.is_hdf5(filepath_or): | ||
| fh, own_fh = h5py.File(filepath_or, *args, **kwargs), True | ||
| elif filepath_or.endswith('.xlsx'): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is going to fail when I think the XLSX parsing should be moved elsewhere this function seems too generic as it should only open a file path or file handle. Is there any motivation to have this here instead of in the sample/prep information parsing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right! I initially was gonna add the code in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair enough, the use of this function is rather limited (see these search results). It still seems too odd to me that a function that's intended to return a file handle will parse a file write it as a txt file and return that file handle. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, the idea of parsing the xmlx is that it will be transformed to txt and then use the regular prep/sample code ... ¯_(ツ)_/¯ ... I don't have a better place to move it but if you do, let me know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes more sense to have this outside of this function. I see as two very different things, (1) opening a file and (2) parsing a file. And this function opens the file and shouldn't be parsing it. Is there any technical limitation to putting this in load_template_to_dataframe? Perhaps as a helper function somewhere in there? |
||
| # due to extension, let's assume Excel file | ||
| wb = load_workbook(filename=filepath_or, data_only=True) | ||
| sheetnames = wb.sheetnames | ||
| # let's check if Qiimp, they must be in same order | ||
| first_cell_index = 0 | ||
| is_qiimp_wb = False | ||
| if sheetnames == ["Metadata", "Validation", "Data Dictionary", | ||
| "metadata_schema", "metadata_form", | ||
| "Instructions"]: | ||
| first_cell_index = 1 | ||
| is_qiimp_wb = True | ||
| first_sheet = wb[sheetnames[0]] | ||
| cell_range = range(first_cell_index, first_sheet.max_column) | ||
| _, fp = mkstemp(suffix='.txt') | ||
| with open(fp, 'w') as fh: | ||
| cfh = csv_writer(fh, delimiter='\t') | ||
| for r in first_sheet.rows: | ||
| if is_qiimp_wb: | ||
| # check contents of first column; if they are a zero | ||
| # (not a valid QIIMP sample_id) or a "No more than | ||
| # max samples" message, there are no more valid rows, | ||
| # so don't examine any more rows. | ||
| fcv = str(r[cell_range[0]].value) | ||
| if fcv == "0" or fcv.startswith("No more than"): | ||
| break | ||
| cfh.writerow([r[x].value for x in cell_range]) | ||
| fh, own_fh = open(fp, *args, **kwargs), True | ||
| else: | ||
| fh, own_fh = open(filepath_or, *args, **kwargs), True | ||
| else: | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a show-stopper, but I am concerned because I think the auto-generated exception is inaccurate ... an empty QIIMP template will contain a sample_name column, as shown below, and the tsv produced from it will also include that column--actually, the tsv will contain ONLY a row of column names :) I fear that an error message indicating the sample_name column isn't there when it actually is will mislead users/future maintainers when they try to figure out why their upload failed.

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AmandaBirmingham, good point. The reason this was happening was all columns are empty so all of them are dropped and the final dataframe was empty, and the first test is to check for validity is if sample_name column is there. Anyway, I added a new tests (empty) and now the error should make more sense.