-
Notifications
You must be signed in to change notification settings - Fork 80
Fix analysis tests #1106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ElDeveloper
merged 17 commits into
qiita-spots:relax-md-req
from
josenavas:fix-analysis-tests
Apr 28, 2015
Merged
Fix analysis tests #1106
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
18efa9c
Adding the missing prep template to the DB
josenavas e845794
Fixing type on populate_test_db.sql
josenavas c6147b0
Merge branch 'fix-qiita-db-tests' into fix-analysis-tests
josenavas 0ddb98a
Atatching the new prep template file to the prep template
josenavas 5d7a752
fixing test_setup.py
josenavas e9fab3e
Fixing test_reference.py by removing magic numbers
josenavas ff305f9
Fixing test_job.py by removing magic numbers
josenavas 01ec8bc
Fixing test_prep_template.py by removing magic numbers
josenavas 69cfcc1
Fixing test_meta_util.py
josenavas c5f3107
Adding the qiime mapping file
josenavas 4ef8b42
Fixing all the analysis tests. Fixes partially #247. Fixes #465
josenavas 847485b
Fixing tests due to the addition of the mapping file
josenavas b43c3cc
Flake8
josenavas e7cefb0
Merge branch 'fix-qiita-db-tests' into fix-analysis-tests
josenavas e76a10c
Merge branch 'relax-md-req' of https://github.com/biocore/qiita into …
josenavas d1bee5e
Addressing @ElDeveloper comments
josenavas 8be914f
Removing inference - good catch @ElDeveloper\!
josenavas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,15 +24,17 @@ | |
from future.utils import viewitems | ||
from biom import load_table | ||
from biom.util import biom_open | ||
import pandas as pd | ||
from skbio.util import find_duplicates | ||
|
||
from qiita_core.exceptions import IncompetentQiitaDeveloperError | ||
from .sql_connection import SQLConnectionHandler | ||
from .base import QiitaStatusObject | ||
from .data import ProcessedData, RawData | ||
from .data import ProcessedData | ||
from .study import Study | ||
from .exceptions import QiitaDBStatusError # QiitaDBNotImplementedError | ||
from .exceptions import QiitaDBStatusError, QiitaDBError | ||
from .util import (convert_to_id, get_work_base_dir, | ||
get_mountpoint, get_table_cols, insert_filepaths) | ||
get_mountpoint, insert_filepaths) | ||
|
||
|
||
class Analysis(QiitaStatusObject): | ||
|
@@ -719,81 +721,61 @@ def _build_mapping_file(self, samples, conn_handler=None): | |
Code modified slightly from qiime.util.MetadataMap.__add__""" | ||
conn_handler = conn_handler if conn_handler is not None \ | ||
else SQLConnectionHandler() | ||
# We will keep track of all unique sample_ids and metadata headers | ||
# we have seen as we go, as well as studies already seen | ||
|
||
all_sample_ids = set() | ||
all_headers = set(get_table_cols("required_sample_info", conn_handler)) | ||
all_studies = set() | ||
sql = """SELECT filepath_id, filepath | ||
FROM qiita.filepath | ||
JOIN qiita.prep_template_filepath USING (filepath_id) | ||
JOIN qiita.prep_template_preprocessed_data | ||
USING (prep_template_id) | ||
JOIN qiita.preprocessed_processed_data | ||
USING (preprocessed_data_id) | ||
JOIN qiita.filepath_type USING (filepath_type_id) | ||
WHERE processed_data_id = %s | ||
AND filepath_type = 'qiime_map' | ||
ORDER BY filepath_id DESC""" | ||
_id, fp = get_mountpoint('templates')[0] | ||
to_concat = [] | ||
|
||
merged_data = defaultdict(lambda: defaultdict(lambda: None)) | ||
for pid, samples in viewitems(samples): | ||
if any([all_sample_ids.intersection(samples), | ||
len(set(samples)) != len(samples)]): | ||
# duplicate samples so raise error | ||
raise ValueError("Duplicate sample ids found: %s" % | ||
str(all_sample_ids.intersection(samples))) | ||
all_sample_ids.update(samples) | ||
study_id = ProcessedData(pid).study | ||
|
||
# create a convenience study object | ||
s = Study(study_id) | ||
|
||
# get the ids to retrieve the data from the sample and prep tables | ||
sample_template_id = s.sample_template | ||
# you can have multiple different prep templates but we are only | ||
# using the one for 16S i. e. the last one ... sorry ;l | ||
# see issue https://github.com/biocore/qiita/issues/465 | ||
prep_template_id = RawData(s.raw_data()[0]).prep_templates[-1] | ||
|
||
if study_id in all_studies: | ||
# samples already added by other processed data file | ||
# with the study_id | ||
continue | ||
all_studies.add(study_id) | ||
# add headers to set of all headers found | ||
all_headers.update(get_table_cols("sample_%d" % sample_template_id, | ||
conn_handler)) | ||
all_headers.update(get_table_cols("prep_%d" % prep_template_id, | ||
conn_handler)) | ||
# NEED TO ADD COMMON PREP INFO Issue #247 | ||
sql = ("SELECT rs.*, p.*, ss.* " | ||
"FROM qiita.required_sample_info rs JOIN qiita.sample_{0} " | ||
"ss USING(sample_id) JOIN qiita.prep_{1} p USING(sample_id)" | ||
" WHERE rs.sample_id IN {2} AND rs.study_id = {3}".format( | ||
sample_template_id, prep_template_id, | ||
"(%s)" % ",".join("'%s'" % s for s in samples), | ||
study_id)) | ||
metadata = conn_handler.execute_fetchall(sql) | ||
# add all the metadata to merged_data | ||
for data in metadata: | ||
sample_id = data['sample_id'] | ||
for header, value in viewitems(data): | ||
if header in {'sample_id'}: | ||
continue | ||
merged_data[sample_id][header] = str(value) | ||
|
||
# prep headers, making sure they follow mapping file format rules | ||
all_headers = list(all_headers - {'linkerprimersequence', | ||
'barcodesequence', 'description', 'sample_id'}) | ||
all_headers.sort() | ||
all_headers = ['BarcodeSequence', 'LinkerPrimerSequence'] + all_headers | ||
all_headers.append('Description') | ||
|
||
# write mapping file out | ||
if len(samples) != len(set(samples)): | ||
duplicates = find_duplicates(samples) | ||
raise QiitaDBError("Duplicate sample ids found: %s" | ||
% ', '.join(duplicates)) | ||
# Get the QIIME mapping file | ||
qiime_map_fp = conn_handler.execute_fetchall(sql, (pid,))[0][1] | ||
# Parse the mapping file | ||
qiime_map = pd.read_csv( | ||
join(fp, qiime_map_fp), sep='\t', keep_default_na=False, | ||
na_values=['unknown'], index_col=False, | ||
converters=defaultdict(lambda: str)) | ||
qiime_map.set_index('#SampleID', inplace=True, drop=True) | ||
qiime_map = qiime_map.loc[samples] | ||
|
||
duplicates = all_sample_ids.intersection(qiime_map.index) | ||
if duplicates or len(samples) != len(set(samples)): | ||
# Duplicate samples so raise error | ||
raise QiitaDBError("Duplicate sample ids found: %s" | ||
% ', '.join(duplicates)) | ||
all_sample_ids.update(qiime_map.index) | ||
to_concat.append(qiime_map) | ||
|
||
merged_map = pd.concat(to_concat) | ||
|
||
cols = merged_map.columns.values.tolist() | ||
cols.remove('BarcodeSequence') | ||
cols.remove('LinkerPrimerSequence') | ||
cols.remove('Description') | ||
new_cols = ['BarcodeSequence', 'LinkerPrimerSequence'] | ||
new_cols.extend(cols) | ||
new_cols.append('Description') | ||
merged_map = merged_map[new_cols] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, this is super awesome! ✨ |
||
|
||
# Save the mapping file | ||
_, base_fp = get_mountpoint(self._table)[0] | ||
mapping_fp = join(base_fp, "%d_analysis_mapping.txt" % self._id) | ||
with open(mapping_fp, 'w') as f: | ||
f.write("#SampleID\t%s\n" % '\t'.join(all_headers)) | ||
for sample, metadata in viewitems(merged_data): | ||
data = [sample] | ||
for header in all_headers: | ||
l_head = header.lower() | ||
data.append(metadata[l_head] if | ||
metadata[l_head] is not None else "no_data") | ||
f.write("%s\n" % "\t".join(data)) | ||
|
||
self._add_file("%d_analysis_mapping.txt" % self._id, | ||
"plain_text", conn_handler=conn_handler) | ||
merged_map.to_csv(mapping_fp, index_label='#SampleID', | ||
na_rep='unknown', sep='\t') | ||
|
||
def _add_file(self, filename, filetype, data_type=None, conn_handler=None): | ||
"""adds analysis item to database | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4 changes: 4 additions & 0 deletions
4
qiita_db/support_files/test_data/analysis/1_analysis_mapping_exp.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
#SampleID BarcodeSequence LinkerPrimerSequence center_name center_project_name emp_status experiment_center experiment_design_description experiment_title illumina_technology library_construction_protocol pcr_primers platform run_center run_date run_prefix samp_size sample_center sequencing_meth study_center target_gene target_subfragment altitude anonymized_name assigned_from_geo collection_timestamp common_name country depth description_duplicate elevation env_biome env_feature has_extracted_data has_physical_specimen host_subject_id host_taxid latitude longitude ph physical_location samp_salinity sample_type season_environment taxon_id temp texture tot_nitro tot_org_carb water_content_soil Description | ||
1.SKB8.640193 AGCGCTCACATC GTGCCAGCMGCCGCGGTAA ANL EMP ANL micro biome of soil and rhizosphere of cannabis plants from CA Cannabis Soil Microbiome MiSeq This analysis was done as in Caporaso et al 2011 Genome research. The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA (both bacteria and archaea), which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair amplifies the region 533_786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA_id:470367).] The reverse PCR primer is barcoded with a 12-base error-correcting Golay code to facilitate multiplexing of up to 1,500 samples per lane, and both PCR primers contain sequencer adapter regions. FWD:GTGCCAGCMGCCGCGGTAA; REV:GGACTACHVGGGTWTCTAAT Illumina ANL 8/1/12 s_G1_L001_sequences .25,g ANL Sequencing by synthesis CCME 16S rRNA V4 0.0 SKB8 n 2011-11-11 13:00:00 root metagenome GAZ:United States of America 0.15 Burmese root 114.0 ENVO:Temperate grasslands, savannas, and shrubland biome ENVO:plant-associated habitat True True 1001:M7 3483 74.0894932572 65.3283470202 6.94 ANL 7.15 ENVO:soil winter 1118232 15.0 64.6 sand, 17.6 silt, 17.8 clay 1.41 5.0 0.16399999999999998 Cannabis Soil Microbiome | ||
1.SKD8.640184 TGAGTGGTCTGT GTGCCAGCMGCCGCGGTAA ANL EMP ANL micro biome of soil and rhizosphere of cannabis plants from CA Cannabis Soil Microbiome MiSeq This analysis was done as in Caporaso et al 2011 Genome research. The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA (both bacteria and archaea), which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair amplifies the region 533_786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA_id:470367).] The reverse PCR primer is barcoded with a 12-base error-correcting Golay code to facilitate multiplexing of up to 1,500 samples per lane, and both PCR primers contain sequencer adapter regions. FWD:GTGCCAGCMGCCGCGGTAA; REV:GGACTACHVGGGTWTCTAAT Illumina ANL 8/1/12 s_G1_L001_sequences .25,g ANL Sequencing by synthesis CCME 16S rRNA V4 0.0 SKD8 n 2011-11-11 13:00:00 root metagenome GAZ:United States of America 0.15 Diesel Root 114.0 ENVO:Temperate grasslands, savannas, and shrubland biome ENVO:plant-associated habitat True True 1001:D9 3483 57.571893782 32.5563076447 6.8 ANL 7.1 ENVO:soil winter 1118232 15.0 66 sand, 16.3 silt, 17.7 clay 1.51 4.32 0.17800000000000002 Cannabis Soil Microbiome | ||
1.SKB7.640196 CGGCCTAAGTTC GTGCCAGCMGCCGCGGTAA ANL EMP ANL micro biome of soil and rhizosphere of cannabis plants from CA Cannabis Soil Microbiome MiSeq This analysis was done as in Caporaso et al 2011 Genome research. The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA (both bacteria and archaea), which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair amplifies the region 533_786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA_id:470367).] The reverse PCR primer is barcoded with a 12-base error-correcting Golay code to facilitate multiplexing of up to 1,500 samples per lane, and both PCR primers contain sequencer adapter regions. FWD:GTGCCAGCMGCCGCGGTAA; REV:GGACTACHVGGGTWTCTAAT Illumina ANL 8/1/12 s_G1_L001_sequences .25,g ANL Sequencing by synthesis CCME 16S rRNA V4 0.0 SKB7 n 2011-11-11 13:00:00 root metagenome GAZ:United States of America 0.15 Burmese root 114.0 ENVO:Temperate grasslands, savannas, and shrubland biome ENVO:plant-associated habitat True True 1001:M8 3483 13.089194595 92.5274472082 6.94 ANL 7.15 ENVO:soil winter 1118232 15.0 64.6 sand, 17.6 silt, 17.8 clay 1.41 5.0 0.16399999999999998 Cannabis Soil Microbiome |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading the mapping file with Pandas may not be the best idea here, namely the datatype of a few columns may be inferred and this in turn will cause some problems down the road. The simplest example is a column where the values are 1001, 0001, 0002 and 0003; pandas will cast these values to ints and turn them into 1001, 1, 2 and 3. If there's any way where we can load only the string values and then do all the needed transformations, that would be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally still wary about using the files vs using the database values, but that's a separate conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ElDeveloper I think I'm going to pull out a wrapper function to pandas so all the parsing is always done consistently.
@squirrelo With the changes that I've done in the metadata template, the files and the DB are always sync'd. If we are not going to trust our files, why to store them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josenavas, as long as we can guarantee that all values are loaded as strings, then that should work just fine. Searching on the intertubes I came across this: http://stackoverflow.com/questions/12101113/prevent-pandas-from-automatically-infering-type-in-read-csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking more about this (because we already have the template parsing centralizes and it is smarter). Thanks for the link but I think it is an overkill. I think the easier way is to do use the converters parameters like this:
converters=defaultdict)lambda: str)
so it is always parsing all columns as str, which is what we want here 😄There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it that makes sense, if that's the case then we don't need to infer the datetime or parse dates, would you mind setting those to False or just removing them altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!