Skip to content

Commit 4fb21c7

Browse files
added DNA-Base Counting using External Python function
1 parent 942c86f commit 4fb21c7

File tree

1 file changed

+13
-39
lines changed

1 file changed

+13
-39
lines changed

tutorial/dna-basecount/dna-basecount3.md

Lines changed: 13 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -62,44 +62,18 @@ SparkContext available as sc, HiveContext available as sqlContext.
6262
u'ATCGATCGATAT'
6363
]
6464
65-
>>> ones = recs.flatMap(lambda x : [(c,1) for c in list(x)])
66-
>>> ones.collect()
67-
[
68-
(u'A', 1),
69-
(u'T', 1),
70-
(u'A', 1),
71-
(u'T', 1),
72-
(u'C', 1),
73-
(u'C', 1),
74-
(u'C', 1),
75-
(u'C', 1),
76-
(u'G', 1),
77-
(u'G', 1),
78-
(u'G', 1),
79-
(u'A', 1),
80-
(u'T', 1),
81-
(u'A', 1),
82-
(u'T', 1),
83-
(u'C', 1),
84-
(u'G', 1),
85-
(u'A', 1),
86-
(u'T', 1),
87-
(u'C', 1),
88-
(u'G', 1),
89-
(u'A', 1),
90-
(u'T', 1),
91-
(u'A', 1),
92-
(u'T', 1)
93-
]
65+
>>> basemapper = "/Users/mparsian/spark-1.6.1-bin-hadoop2.6/basemapper.py"
66+
>>> import basemapper
67+
>>> basemapper
68+
<module 'basemapper' from 'basemapper.py'>
69+
>>>
70+
>>> recs = sc.textFile('file:////Users/mparsian/zmp/github/pyspark-tutorial/tutorial/dna-basecount/dna_seq.txt')
71+
>>> rdd = recs.flatMap(basemapper.mapper)
72+
>>> rdd.collect()
73+
[(u'A', 3), (u'C', 4), (u'T', 3), (u'G', 3), (u'A', 4), (u'C', 2), (u'T', 4), (u'G', 2)]
74+
9475
>>> baseCount = rdd.reduceByKey(lambda x,y : x+y)
9576
>>> baseCount.collect()
96-
[
97-
(u'A', 7),
98-
(u'C', 6),
99-
(u'G', 5),
100-
(u'T', 7)
101-
]
102-
>>>
103-
````
104-
105-
77+
[(u'A', 7), (u'C', 6), (u'G', 5), (u'T', 7)]
78+
>>>
79+
````

0 commit comments

Comments
 (0)