Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added snpdata, splitting and merging #25

Closed
wants to merge 2 commits into from
Closed

added snpdata, splitting and merging #25

wants to merge 2 commits into from

Conversation

kose-y
Copy link
Member

@kose-y kose-y commented Jan 12, 2019

I tried to implement #24.

I'm new to Julia, so I might have done something considered unconventional or inefficient in Julia.

@dmbates
Copy link
Collaborator

dmbates commented Jan 14, 2019

May I suggest using CSV.jlor IndexedTables.jl to read the .bim and .fam files? Both of these are well-maintained packages that allow more flexible specifications for the files to be read than does DelimitedFiles. In this case CSV may be more appropriate if you want to produce a DataFrame.

As an example

julia> using CSV, DataFrames

julia> const SNP_INFO_KEYS = [:chromosome, :snpid, :genetic_distance, :position, :allele1, :allele2]

julia> snp_info = categorical!(CSV.read("data/EUR_subset.bim", delim='\t', header=SNP_INFO_KEYS, types=[Int8,String,Float64,Int,String,String]), [:allele1, :allele2])
54051×6 DataFrame
│ Row   │ chromosome │ snpid       │ genetic_distance │ position │ allele1      │ allele2      │
│       │ Int8       │ String      │ Float64          │ Int64    │ Categorical │ Categorical │
├───────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 117         │ rs34151105  │ 0.01665     │ T            │ C            │
│ 217         │ rs143500173 │ 0.02748     │ T            │ A            │
│ 317         │ rs113560219 │ 0.04702     │ T            │ C            │
│ 417         │ rs1882989   │ 5.6e-515222    │ G            │ A            │
│ 517         │ rs8069133   │ 0.00049932311    │ G            │ A            │
│ 617         │ rs112221137 │ 0.00060536405    │ G            │ T            │
│ 717         │ rs34889101  │ 0.0006236975    │ A            │ C            │
│ 817         │ rs35840960  │ 0.00066838827    │ T            │ A            │
│ 917         │ rs144918387 │ 0.00077542965    │ C            │ T            │
│ 1017         │ rs62057022  │ 0.00094849640    │ G            │ A            │
│ 1117         │ rs4890182   │ 0.00094949663    │ C            │ T            │
│ 1217         │ rs1882990   │ 0.00100151696    │ C            │ T            │
│ 1317         │ rs62057050  │ 0.00157365610    │ G            │ T            │
│ 1417         │ rs8081881   │ 0.00214178176    │ A            │ G            │
│ 1517         │ rs11150892  │ 0.00227180772    │ C            │ T            │
│ 1617         │ rs34314694  │ 0.00235182381    │ C            │ T            │
│ 1717         │ rs4890198   │ 0.00239283196    │ C            │ G            │
│ 1817         │ rs182915197 │ 0.00250685472    │ T            │ C            │
│ 1917         │ rs148130198 │ 0.00250885522    │ C            │ T            │

@kose-y
Copy link
Member Author

kose-y commented Jan 16, 2019

@dmbates Thanks for the comment. This looks better. I will update it tomorrow.

@Hua-Zhou
Copy link
Member

@kose-y @dmbates, thanks for the nice contribution!

@kose-y Can you add documentation for the SnpData type and split-merge functionalities to /docs/SnpArrays.ipynb? Package documentation will be generated from this notebook.

Further thought: When working with Biobank data, the sample size is up to 1 million. Analysts are splitting the Plink file into regions even much smaller than a chromosome, because for large chromosomes like 1 and 21 the Plink files can still be too large for current analysis software to handle. We may need to think about a unifying interface for subsetting, splitting, and merging SnpData either along SNPs or individuals.

@kose-y
Copy link
Member Author

kose-y commented Jan 19, 2019

@Hua-Zhou Sure. I will update it sometime next week.

@Hua-Zhou
Copy link
Member

@kose-y I understand packages do not need to include the Manifest.toml file. Julia should figure out manifest file from Project.toml alone. Let me know if I'm wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants