added snpdata, splitting and merging #25

kose-y · 2019-01-12T05:13:55Z

I tried to implement #24.

I'm new to Julia, so I might have done something considered unconventional or inefficient in Julia.

dmbates · 2019-01-14T17:33:58Z

May I suggest using CSV.jlor IndexedTables.jl to read the .bim and .fam files? Both of these are well-maintained packages that allow more flexible specifications for the files to be read than does DelimitedFiles. In this case CSV may be more appropriate if you want to produce a DataFrame.

As an example

julia> using CSV, DataFrames

julia> const SNP_INFO_KEYS = [:chromosome, :snpid, :genetic_distance, :position, :allele1, :allele2]

julia> snp_info = categorical!(CSV.read("data/EUR_subset.bim", delim='\t', header=SNP_INFO_KEYS, types=[Int8,String,Float64,Int,String,String]), [:allele1, :allele2])
54051×6 DataFrame
│ Row   │ chromosome │ snpid       │ genetic_distance │ position │ allele1      │ allele2      │
│       │ Int8       │ String      │ Float64          │ Int64    │ Categorical… │ Categorical… │
├───────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 1     │ 17         │ rs34151105  │ 0.0              │ 1665     │ T            │ C            │
│ 2     │ 17         │ rs143500173 │ 0.0              │ 2748     │ T            │ A            │
│ 3     │ 17         │ rs113560219 │ 0.0              │ 4702     │ T            │ C            │
│ 4     │ 17         │ rs1882989   │ 5.6e-5           │ 15222    │ G            │ A            │
│ 5     │ 17         │ rs8069133   │ 0.000499         │ 32311    │ G            │ A            │
│ 6     │ 17         │ rs112221137 │ 0.000605         │ 36405    │ G            │ T            │
│ 7     │ 17         │ rs34889101  │ 0.00062          │ 36975    │ A            │ C            │
│ 8     │ 17         │ rs35840960  │ 0.000668         │ 38827    │ T            │ A            │
│ 9     │ 17         │ rs144918387 │ 0.000775         │ 42965    │ C            │ T            │
│ 10    │ 17         │ rs62057022  │ 0.000948         │ 49640    │ G            │ A            │
│ 11    │ 17         │ rs4890182   │ 0.000949         │ 49663    │ C            │ T            │
│ 12    │ 17         │ rs1882990   │ 0.001001         │ 51696    │ C            │ T            │
│ 13    │ 17         │ rs62057050  │ 0.001573         │ 65610    │ G            │ T            │
│ 14    │ 17         │ rs8081881   │ 0.002141         │ 78176    │ A            │ G            │
│ 15    │ 17         │ rs11150892  │ 0.002271         │ 80772    │ C            │ T            │
│ 16    │ 17         │ rs34314694  │ 0.002351         │ 82381    │ C            │ T            │
│ 17    │ 17         │ rs4890198   │ 0.002392         │ 83196    │ C            │ G            │
│ 18    │ 17         │ rs182915197 │ 0.002506         │ 85472    │ T            │ C            │
│ 19    │ 17         │ rs148130198 │ 0.002508         │ 85522    │ C            │ T            │

kose-y · 2019-01-16T14:58:18Z

@dmbates Thanks for the comment. This looks better. I will update it tomorrow.

Hua-Zhou · 2019-01-18T20:37:52Z

@kose-y @dmbates, thanks for the nice contribution!

@kose-y Can you add documentation for the SnpData type and split-merge functionalities to /docs/SnpArrays.ipynb? Package documentation will be generated from this notebook.

Further thought: When working with Biobank data, the sample size is up to 1 million. Analysts are splitting the Plink file into regions even much smaller than a chromosome, because for large chromosomes like 1 and 21 the Plink files can still be too large for current analysis software to handle. We may need to think about a unifying interface for subsetting, splitting, and merging SnpData either along SNPs or individuals.

kose-y · 2019-01-19T06:11:59Z

@Hua-Zhou Sure. I will update it sometime next week.

Hua-Zhou · 2019-01-23T04:06:34Z

@kose-y I understand packages do not need to include the Manifest.toml file. Julia should figure out manifest file from Project.toml alone. Let me know if I'm wrong.

added snpdata, splitting and merging

a9465f3

CSV.jl for SnpData parsing

895434b

kose-y mentioned this pull request Jan 19, 2019

Further subsetting/splitting/merging support #29

Closed

Hua-Zhou mentioned this pull request Jan 23, 2019

Generalization of linear algebra-related routines #28

Merged

Hua-Zhou closed this Jan 23, 2019

kose-y mentioned this pull request Jan 23, 2019

Juliav0.7 snpdata #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added snpdata, splitting and merging #25

added snpdata, splitting and merging #25

kose-y commented Jan 12, 2019

dmbates commented Jan 14, 2019

kose-y commented Jan 16, 2019

Hua-Zhou commented Jan 18, 2019

kose-y commented Jan 19, 2019

Hua-Zhou commented Jan 23, 2019

added snpdata, splitting and merging #25

added snpdata, splitting and merging #25

Conversation

kose-y commented Jan 12, 2019

dmbates commented Jan 14, 2019

kose-y commented Jan 16, 2019

Hua-Zhou commented Jan 18, 2019

kose-y commented Jan 19, 2019

Hua-Zhou commented Jan 23, 2019