-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added snpdata, splitting and merging #25
Conversation
May I suggest using As an example julia> using CSV, DataFrames
julia> const SNP_INFO_KEYS = [:chromosome, :snpid, :genetic_distance, :position, :allele1, :allele2]
julia> snp_info = categorical!(CSV.read("data/EUR_subset.bim", delim='\t', header=SNP_INFO_KEYS, types=[Int8,String,Float64,Int,String,String]), [:allele1, :allele2])
54051×6 DataFrame
│ Row │ chromosome │ snpid │ genetic_distance │ position │ allele1 │ allele2 │
│ │ Int8 │ String │ Float64 │ Int64 │ Categorical… │ Categorical… │
├───────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 1 │ 17 │ rs34151105 │ 0.0 │ 1665 │ T │ C │
│ 2 │ 17 │ rs143500173 │ 0.0 │ 2748 │ T │ A │
│ 3 │ 17 │ rs113560219 │ 0.0 │ 4702 │ T │ C │
│ 4 │ 17 │ rs1882989 │ 5.6e-5 │ 15222 │ G │ A │
│ 5 │ 17 │ rs8069133 │ 0.000499 │ 32311 │ G │ A │
│ 6 │ 17 │ rs112221137 │ 0.000605 │ 36405 │ G │ T │
│ 7 │ 17 │ rs34889101 │ 0.00062 │ 36975 │ A │ C │
│ 8 │ 17 │ rs35840960 │ 0.000668 │ 38827 │ T │ A │
│ 9 │ 17 │ rs144918387 │ 0.000775 │ 42965 │ C │ T │
│ 10 │ 17 │ rs62057022 │ 0.000948 │ 49640 │ G │ A │
│ 11 │ 17 │ rs4890182 │ 0.000949 │ 49663 │ C │ T │
│ 12 │ 17 │ rs1882990 │ 0.001001 │ 51696 │ C │ T │
│ 13 │ 17 │ rs62057050 │ 0.001573 │ 65610 │ G │ T │
│ 14 │ 17 │ rs8081881 │ 0.002141 │ 78176 │ A │ G │
│ 15 │ 17 │ rs11150892 │ 0.002271 │ 80772 │ C │ T │
│ 16 │ 17 │ rs34314694 │ 0.002351 │ 82381 │ C │ T │
│ 17 │ 17 │ rs4890198 │ 0.002392 │ 83196 │ C │ G │
│ 18 │ 17 │ rs182915197 │ 0.002506 │ 85472 │ T │ C │
│ 19 │ 17 │ rs148130198 │ 0.002508 │ 85522 │ C │ T │ |
@dmbates Thanks for the comment. This looks better. I will update it tomorrow. |
@kose-y @dmbates, thanks for the nice contribution! @kose-y Can you add documentation for the Further thought: When working with Biobank data, the sample size is up to 1 million. Analysts are splitting the Plink file into regions even much smaller than a chromosome, because for large chromosomes like 1 and 21 the Plink files can still be too large for current analysis software to handle. We may need to think about a unifying interface for subsetting, splitting, and merging SnpData either along SNPs or individuals. |
@Hua-Zhou Sure. I will update it sometime next week. |
@kose-y I understand packages do not need to include the |
I tried to implement #24.
I'm new to Julia, so I might have done something considered unconventional or inefficient in Julia.