Skip to content

Commit 942c86f

Browse files
added DNA-Base Counting using External Python function
1 parent 277c522 commit 942c86f

File tree

4 files changed

+124
-0
lines changed

4 files changed

+124
-0
lines changed

tutorial/dna-basecount/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
11
DNA Base Counting
22
=================
33
* [DNA Base Counting Without In-Mapper Combiner](./dna-basecount.md)
4+
45
* [DNA Base Counting With In-Mapper Combiner](./dna-basecount2.md)
6+
7+
* [DNA Base Counting With External Python Function](./dna-basecount3.md)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/usr/bin/python
2+
3+
def mapper(seq):
4+
freq = dict()
5+
for x in list(seq):
6+
if x in freq:
7+
freq[x] +=1
8+
else:
9+
freq[x] = 1
10+
#
11+
kv = [(x, freq[x]) for x in freq]
12+
return kv
13+
#
14+
#print mapper("ATCGATCGATAT")
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
DNA Base Counting using PySpark
2+
===============================
3+
4+
DNA Base Count Definition
5+
-------------------------
6+
[DNA Base Counting is defined here.](https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch24.html)
7+
8+
Solution in PySpark
9+
-------------------
10+
This solution assumes that each record is a DNA sequence.
11+
This solution emits a ````(base, 1)```` for every base in
12+
a given sequence and then aggregates all frequencies for
13+
unique bases. For this solution we use an external Python
14+
function defined in ````basemapper.py````
15+
16+
* Define Python Function
17+
18+
````
19+
$ export SPARK_HOME=/home/mparsian/spark-1.6.1-bin-hadoop2.6
20+
$ cat $SPARK_HOME/basemapper.py
21+
#!/usr/bin/python
22+
23+
def mapper(seq):
24+
freq = dict()
25+
for x in list(seq):
26+
if x in freq:
27+
freq[x] +=1
28+
else:
29+
freq[x] = 1
30+
#
31+
kv = [(x, freq[x]) for x in freq]
32+
return kv
33+
#
34+
#for testing:
35+
#print mapper("ATCGATCGATAT")
36+
````
37+
* Define Very Basic Sample Input
38+
39+
````
40+
$ cat /home/mparsian/dna_seq.txt
41+
ATATCCCCGGGAT
42+
ATCGATCGATAT
43+
````
44+
45+
* Sample PySpark Run
46+
47+
````
48+
# ./bin/pyspark
49+
Welcome to
50+
____ __
51+
/ __/__ ___ _____/ /__
52+
_\ \/ _ \/ _ `/ __/ '_/
53+
/__ / .__/\_,_/_/ /_/\_\ version 1.6.1
54+
/_/
55+
56+
SparkContext available as sc, HiveContext available as sqlContext.
57+
>>> recs = sc.texFile('file:///home/mparsian/dna_seq.txt')
58+
59+
>>> recs.collect()
60+
[
61+
u'ATATCCCCGGGAT',
62+
u'ATCGATCGATAT'
63+
]
64+
65+
>>> ones = recs.flatMap(lambda x : [(c,1) for c in list(x)])
66+
>>> ones.collect()
67+
[
68+
(u'A', 1),
69+
(u'T', 1),
70+
(u'A', 1),
71+
(u'T', 1),
72+
(u'C', 1),
73+
(u'C', 1),
74+
(u'C', 1),
75+
(u'C', 1),
76+
(u'G', 1),
77+
(u'G', 1),
78+
(u'G', 1),
79+
(u'A', 1),
80+
(u'T', 1),
81+
(u'A', 1),
82+
(u'T', 1),
83+
(u'C', 1),
84+
(u'G', 1),
85+
(u'A', 1),
86+
(u'T', 1),
87+
(u'C', 1),
88+
(u'G', 1),
89+
(u'A', 1),
90+
(u'T', 1),
91+
(u'A', 1),
92+
(u'T', 1)
93+
]
94+
>>> baseCount = rdd.reduceByKey(lambda x,y : x+y)
95+
>>> baseCount.collect()
96+
[
97+
(u'A', 7),
98+
(u'C', 6),
99+
(u'G', 5),
100+
(u'T', 7)
101+
]
102+
>>>
103+
````
104+
105+

tutorial/dna-basecount/dna_seq.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
ATATCCCCGGGAT
2+
ATCGATCGATAT

0 commit comments

Comments
 (0)