Skip to content

Commit

Permalink
added DNA base count
Browse files Browse the repository at this point in the history
  • Loading branch information
pyspark-in-action committed Jan 29, 2016
1 parent 2cd208a commit 85c037d
Show file tree
Hide file tree
Showing 3 changed files with 165 additions and 0 deletions.
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ to be used for production environment.

PySpark Examples and Tutorials
==============================
* [DNA Base Counting Without In-Mapper Combiner](./tutorial/dna-basecount.md)
* [DNA Base Counting With In-Mapper Combiner](./tutorial/dna-basecount2.md)
* [Classic Word Count](./tutorial/wordcount)
* [Find Frequency of Bigrams](./tutorial/bigrams)
* [Join of Two Relations R(K, V<sub>1</sub>), S(K, V<sub>2</sub>)](./tutorial/basic-join)
Expand All @@ -28,6 +30,13 @@ PySpark Examples and Tutorials
[How to Minimize the Verbosity of Spark](./howto/minimize_verbosity.md)
=======================================================================

More PySpark Tutorial References...
===================================
* [Getting started with PySpark - Part 1](http://www.mccarroll.net/blog/pyspark/)
* [Getting started with PySpark - Part 2](http://www.mccarroll.net/blog/pyspark2/index.html)
* [A really really fast introduction to PySpark](http://www.slideshare.net/hkarau/a-really-really-fast-introduction-to-py-spark-lightning-fast-cluster-computing-with-python-1)
* [PySpark](http://www.slideshare.net/thegiivee/pysaprk?qid=81cf1b31-8b19-4570-89a5-21d03cad6ecd&v=default&b=&from_search=9)

Questions/Comments
==================
* [View Mahmoud Parsian's profile on LinkedIn](http://www.linkedin.com/in/mahmoudparsian)
Expand Down
80 changes: 80 additions & 0 deletions tutorial/dna-basecount.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
DNA Base Counting using PySpark
===============================

DNA Base Count Definition
-------------------------
[DNA Base Counting is defined here.](https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch24.html)

Solution in PySpark
-------------------
This solution assumes that each record is a DNA sequence.
This solution emits a ````(base, 1)```` for every base in
a given sequence and then aggregates all frequencies for
unique bases.


````
$ cat /home/mparsian/dna_seq.txt
ATATCCCCGGGAT
ATCGATCGATAT
# ./bin/pyspark
Python 2.7.10 (default, Aug 22 2015, 20:33:39)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
SparkContext available as sc, HiveContext available as sqlContext.
>>> recs = sc.texFile('file:///Users/mparsian/dna_seq.txt')
>>> recs.collect()
[
u'ATATCCCCGGGAT',
u'ATCGATCGATAT'
]
>>> ones = recs.flatMap(lambda x : [(c,1) for c in list(x)])
>>> ones.collect()
[
(u'A', 1),
(u'T', 1),
(u'A', 1),
(u'T', 1),
(u'C', 1),
(u'C', 1),
(u'C', 1),
(u'C', 1),
(u'G', 1),
(u'G', 1),
(u'G', 1),
(u'A', 1),
(u'T', 1),
(u'A', 1),
(u'T', 1),
(u'C', 1),
(u'G', 1),
(u'A', 1),
(u'T', 1),
(u'C', 1),
(u'G', 1),
(u'A', 1),
(u'T', 1),
(u'A', 1),
(u'T', 1)
]
>>> baseCount = rdd.reduceByKey(lambda x,y : x+y)
>>> baseCount.collect()
[
(u'A', 7),
(u'C', 6),
(u'G', 5),
(u'T', 7)
]
>>>
````


76 changes: 76 additions & 0 deletions tutorial/dna-basecount2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
DNA Base Counting using PySpark Using In-Mapper Combiner
========================================================

DNA Base Count Definition
-------------------------
[DNA Base Counting is defined here.](https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch24.html)

Solution in PySpark
-------------------
This solution assumes that each record is a DNA sequence.
This solution uses "In-Mapper Combiner" design pattern
and aggregates bases for each sequence before full
aggregation of all frequencies for unique bases.


````
$ cat /home/mparsian/dna_seq.txt
ATATCCCCGGGAT
ATCGATCGATAT
# ./bin/pyspark
Python 2.7.10 (default, Aug 22 2015, 20:33:39)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
SparkContext available as sc, HiveContext available as sqlContext.
>>> recs = sc.texFile('file:///Users/mparsian/dna_seq.txt')
>>> recs.collect()
[
u'ATATCCCCGGGAT',
u'ATCGATCGATAT'
]
>>> def mapper(seq):
... freq = dict()
... for x in list(seq):
... if x in freq:
... freq[x] +=1
... else:
... freq[x] = 1
... #
... kv = [(x, freq[x]) for x in freq]
... return kv
... ^D
>>> rdd = recs.flatMap(mapper)
>>> rdd.collect()
[
(u'A', 3),
(u'C', 4),
(u'T', 3),
(u'G', 3),
(u'A', 4),
(u'C', 2),
(u'T', 4),
(u'G', 2)
]
>>> baseCount = rdd.reduceByKey(lambda x,y : x+y)
>>> baseCount.collect()
[
(u'A', 7),
(u'C', 6),
(u'G', 5),
(u'T', 7)
]
>>>
````


0 comments on commit 85c037d

Please sign in to comment.