Skip to content

remove pyensembl requirement and solve Arabidopsis bug #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 18, 2021

Conversation

yuukiiwa
Copy link
Collaborator

@yuukiiwa yuukiiwa commented May 25, 2021

Here are the xpore-dataprep runs with Human and Arabidopsis references and annotations:

(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh
total 3557856
-rw-------@  1 yukkei  staff   215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r--@  1 yukkei  staff    94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r--@  1 yukkei  staff   1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@  1 yukkei  staff   366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
drwxr-xr-x   6 yukkei  staff   192B May 25 15:06 nanopolish
drwxr-xr-x  16 yukkei  staff   512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % xpore-dataprep \                            
--eventalign nanopolish/eventalign.txt \
--summary nanopolish/summary.txt \
--out_dir human_dataprep \
--genome --gtf_path_or_url Homo_sapiens.GRCh38.91.gtf --transcript_fasta_paths_or_urls Homo_sapiens.GRCh38.cdna.ncrna.fa --merge_transcript_id_version 
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:102: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk_split['line_length'] = np.array(lines)
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:51: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh 
total 4203216
-rw-------@  1 yukkei  staff   215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r--@  1 yukkei  staff    94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r--@  1 yukkei  staff   1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@  1 yukkei  staff   366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
-rw-r--r--   1 yukkei  staff   315M May 25 15:29 Homo_sapiens.GRCh38.cdna.ncrna.fa.pickle
drwxr-xr-x   9 yukkei  staff   288B May 25 15:29 human_dataprep
drwxr-xr-x   6 yukkei  staff   192B May 25 15:06 nanopolish
drwxr-xr-x  16 yukkei  staff   512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh human_dataprep 
total 382408
-rw-r--r--  1 yukkei  staff   141B May 25 15:29 data.index
-rw-r--r--  1 yukkei  staff   953K May 25 15:29 data.json
-rw-r--r--  1 yukkei  staff   145B May 25 15:29 data.log
-rw-r--r--  1 yukkei  staff    98B May 25 15:29 data.readcount
-rw-r--r--  1 yukkei  staff   6.3K May 25 15:29 eventalign.index
-rw-r--r--  1 yukkei  staff   142M May 25 15:29 transcript_id_version_merged.gtf
-rw-r--r--  1 yukkei  staff    41M May 25 15:29 transcript_id_version_merged.gtf.pickle
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % xpore-dataprep \     
--eventalign nanopolish/arabidopsis_eventalign.txt \ 
--summary nanopolish/summary.txt \
--out_dir arabidopsis_dataprep \
--genome --gtf_path_or_url Arabidopsis_thaliana.TAIR10.50.gtf --transcript_fasta_paths_or_urls Arabidopsis_thaliana.TAIR10.cdna.all.fa 
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:102: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk_split['line_length'] = np.array(lines)
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:51: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh
total 4394912
-rw-------@  1 yukkei  staff   215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r--   1 yukkei  staff    10M May 25 15:31 Arabidopsis_thaliana.TAIR10.50.gtf.pickle
-rw-r--r--@  1 yukkei  staff    94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r--   1 yukkei  staff    83M May 25 15:31 Arabidopsis_thaliana.TAIR10.cdna.all.fa.pickle
-rw-r--r--@  1 yukkei  staff   1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@  1 yukkei  staff   366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
-rw-r--r--   1 yukkei  staff   315M May 25 15:29 Homo_sapiens.GRCh38.cdna.ncrna.fa.pickle
drwxr-xr-x   7 yukkei  staff   224B May 25 15:31 arabidopsis_dataprep
drwxr-xr-x   9 yukkei  staff   288B May 25 15:29 human_dataprep
drwxr-xr-x   6 yukkei  staff   192B May 25 15:06 nanopolish
drwxr-xr-x  16 yukkei  staff   512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh arabidopsis_dataprep 
total 2000
-rw-r--r--  1 yukkei  staff   106B May 25 15:31 data.index
-rw-r--r--  1 yukkei  staff   979K May 25 15:31 data.json
-rw-r--r--  1 yukkei  staff   109B May 25 15:31 data.log
-rw-r--r--  1 yukkei  staff    62B May 25 15:31 data.readcount
-rw-r--r--  1 yukkei  staff   5.3K May 25 15:31 eventalign.index

I also did a p-value ranking comparison between the xpore-dataprep on GoekeLab/xpore and the xpore-dataprep on yuukiiwa/xpore, and the two versions of the datapreps generated the same xpore-diffmod results with the demo-testdataset.

I have also crosschecked the xpore-diffmod results with and without transcript versions with the demo test-dataset, which show the same results.

@ploy-np ploy-np merged commit ed1ce6d into GoekeLab:master Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants