Skip to content

Commit

Permalink
added DNA-Base Counting using External Python function
Browse files Browse the repository at this point in the history
  • Loading branch information
pyspark-in-action committed Apr 10, 2016
1 parent 277c522 commit 942c86f
Show file tree
Hide file tree
Showing 4 changed files with 124 additions and 0 deletions.
3 changes: 3 additions & 0 deletions tutorial/dna-basecount/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
DNA Base Counting
=================
* [DNA Base Counting Without In-Mapper Combiner](./dna-basecount.md)

* [DNA Base Counting With In-Mapper Combiner](./dna-basecount2.md)

* [DNA Base Counting With External Python Function](./dna-basecount3.md)
14 changes: 14 additions & 0 deletions tutorial/dna-basecount/basemapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/python

def mapper(seq):
freq = dict()
for x in list(seq):
if x in freq:
freq[x] +=1
else:
freq[x] = 1
#
kv = [(x, freq[x]) for x in freq]
return kv
#
#print mapper("ATCGATCGATAT")
105 changes: 105 additions & 0 deletions tutorial/dna-basecount/dna-basecount3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
DNA Base Counting using PySpark
===============================

DNA Base Count Definition
-------------------------
[DNA Base Counting is defined here.](https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch24.html)

Solution in PySpark
-------------------
This solution assumes that each record is a DNA sequence.
This solution emits a ````(base, 1)```` for every base in
a given sequence and then aggregates all frequencies for
unique bases. For this solution we use an external Python
function defined in ````basemapper.py````

* Define Python Function

````
$ export SPARK_HOME=/home/mparsian/spark-1.6.1-bin-hadoop2.6
$ cat $SPARK_HOME/basemapper.py
#!/usr/bin/python
def mapper(seq):
freq = dict()
for x in list(seq):
if x in freq:
freq[x] +=1
else:
freq[x] = 1
#
kv = [(x, freq[x]) for x in freq]
return kv
#
#for testing:
#print mapper("ATCGATCGATAT")
````
* Define Very Basic Sample Input

````
$ cat /home/mparsian/dna_seq.txt
ATATCCCCGGGAT
ATCGATCGATAT
````

* Sample PySpark Run

````
# ./bin/pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.1
/_/
SparkContext available as sc, HiveContext available as sqlContext.
>>> recs = sc.texFile('file:///home/mparsian/dna_seq.txt')
>>> recs.collect()
[
u'ATATCCCCGGGAT',
u'ATCGATCGATAT'
]
>>> ones = recs.flatMap(lambda x : [(c,1) for c in list(x)])
>>> ones.collect()
[
(u'A', 1),
(u'T', 1),
(u'A', 1),
(u'T', 1),
(u'C', 1),
(u'C', 1),
(u'C', 1),
(u'C', 1),
(u'G', 1),
(u'G', 1),
(u'G', 1),
(u'A', 1),
(u'T', 1),
(u'A', 1),
(u'T', 1),
(u'C', 1),
(u'G', 1),
(u'A', 1),
(u'T', 1),
(u'C', 1),
(u'G', 1),
(u'A', 1),
(u'T', 1),
(u'A', 1),
(u'T', 1)
]
>>> baseCount = rdd.reduceByKey(lambda x,y : x+y)
>>> baseCount.collect()
[
(u'A', 7),
(u'C', 6),
(u'G', 5),
(u'T', 7)
]
>>>
````


2 changes: 2 additions & 0 deletions tutorial/dna-basecount/dna_seq.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
ATATCCCCGGGAT
ATCGATCGATAT

0 comments on commit 942c86f

Please sign in to comment.