Skip to content

Feature/scala code/ch02 biman #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Jan 7, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
75e693e
added kmers for FASTA and FASTQ formats
mahmoudparsian Dec 27, 2021
6ec2ea2
DNA Based count in scala
deepakmca05 Dec 27, 2021
e1ceb10
Indentation fix
deepakmca05 Dec 27, 2021
863b527
improved documentation
mahmoudparsian Dec 27, 2021
e8c6d28
ch02
deepakmca05 Dec 29, 2021
606ad70
Feature/scala code/ch01 (#5)
deepakmca05 Dec 29, 2021
ccfcd22
updated README.md
mahmoudparsian Dec 29, 2021
f4082db
updated README.md
mahmoudparsian Dec 29, 2021
077af38
ch02-changes
deepakmca05 Dec 30, 2021
01df972
ch02-changes
deepakmca05 Dec 30, 2021
c3bbf35
Feature/scala code/ch01 missing class gradle (#7)
deepakmca05 Dec 30, 2021
4c96c7c
added bonus chapter correlation
mahmoudparsian Dec 30, 2021
db3647b
added bonus chapter correlation
mahmoudparsian Dec 30, 2021
1e18e86
updated docs
mahmoudparsian Dec 30, 2021
fa2c1ab
updated docs
mahmoudparsian Dec 30, 2021
7669234
updated docs
mahmoudparsian Dec 30, 2021
a70edfa
updated docs
mahmoudparsian Dec 30, 2021
8207fcf
updated docs
mahmoudparsian Dec 30, 2021
a717c9a
updated docs
mahmoudparsian Dec 30, 2021
a348f11
updated docs
mahmoudparsian Dec 30, 2021
3da5bd5
updated docs
mahmoudparsian Dec 30, 2021
5358172
updated docs
mahmoudparsian Dec 30, 2021
b343f4d
updated docs
mahmoudparsian Dec 30, 2021
fa9eb2a
improved documentation
mahmoudparsian Dec 31, 2021
b36a229
improved documentation
mahmoudparsian Dec 31, 2021
bf2a4ba
improved documentation
mahmoudparsian Dec 31, 2021
4e3b63e
improved documentation
mahmoudparsian Dec 31, 2021
f24f095
improved documentation
mahmoudparsian Dec 31, 2021
dc1e22e
improved documentation
mahmoudparsian Dec 31, 2021
a02276a
improved documentation
mahmoudparsian Dec 31, 2021
4755a19
improved documentation
mahmoudparsian Dec 31, 2021
9d12125
improved documentation
mahmoudparsian Dec 31, 2021
ecc2cb5
improved documentation
mahmoudparsian Dec 31, 2021
efcf612
improved documentation
mahmoudparsian Dec 31, 2021
067596d
improved documentation
mahmoudparsian Dec 31, 2021
cb4048c
improved documentation
mahmoudparsian Dec 31, 2021
c8ef9b9
improved documentation
mahmoudparsian Jan 1, 2022
f6747e9
DNABaseCountFastq
bimanmandal Jan 1, 2022
9f9353c
resolved merge conflict
bimanmandal Jan 1, 2022
d7a1116
added the code changes for chapter 2
bimanmandal Jan 7, 2022
2b59199
added the run_spark_applications_scripts
bimanmandal Jan 7, 2022
6cd297f
added the conditions for 1GB data
bimanmandal Jan 7, 2022
37b2eaf
added the readme file
bimanmandal Jan 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
DNA Based count in scala
  • Loading branch information
deepakmca05 committed Dec 27, 2021
commit 6ec2ea2bc8638bc2f52b3c506d63358d55475b79
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
/code/chap02/scala/.idea/
.DS_Store
.idea
build
gradle
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.
2 changes: 2 additions & 0 deletions code/chap01/scala/.gradle/buildOutputCleanup/cache.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#Sat Dec 25 20:45:37 IST 2021
gradle.version=6.8
Binary file not shown.
Binary file added code/chap01/scala/.gradle/checksums/checksums.lock
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.
2 changes: 2 additions & 0 deletions code/chap02/scala/.gradle/buildOutputCleanup/cache.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#Mon Dec 27 15:39:02 IST 2021
gradle.version=6.8
Binary file not shown.
Binary file added code/chap02/scala/.gradle/checksums/checksums.lock
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Empty file.
21 changes: 21 additions & 0 deletions code/chap02/scala/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
apply plugin: 'scala'

sourceCompatibility = JavaVersion.VERSION_1_8
targetCompatibility = JavaVersion.VERSION_1_8

ext.scalaClassifier = '2.13'
ext.scalaVersion = '2.13.7'

group 'com.spark.algos.data'
version '1.0-SNAPSHOT'

repositories {
mavenLocal()
mavenCentral()
}

dependencies {
implementation group: "org.scala-lang", name: "scala-library", version: "2.13.7"
implementation group: "org.apache.spark", name: "spark-core_2.13", version: "3.2.0"
implementation group: "org.apache.spark", name: "spark-sql_2.13", version: "3.2.0"
}
185 changes: 185 additions & 0 deletions code/chap02/scala/gradlew
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
#!/usr/bin/env sh

#
# Copyright 2015 the original author or authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

##############################################################################
##
## Gradle start up script for UN*X
##
##############################################################################

# Attempt to set APP_HOME
# Resolve links: $0 may be a link
PRG="$0"
# Need this for relative symlinks.
while [ -h "$PRG" ] ; do
ls=`ls -ld "$PRG"`
link=`expr "$ls" : '.*-> \(.*\)$'`
if expr "$link" : '/.*' > /dev/null; then
PRG="$link"
else
PRG=`dirname "$PRG"`"/$link"
fi
done
SAVED="`pwd`"
cd "`dirname \"$PRG\"`/" >/dev/null
APP_HOME="`pwd -P`"
cd "$SAVED" >/dev/null

APP_NAME="Gradle"
APP_BASE_NAME=`basename "$0"`

# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'

# Use the maximum available, or set MAX_FD != -1 to use that value.
MAX_FD="maximum"

warn () {
echo "$*"
}

die () {
echo
echo "$*"
echo
exit 1
}

# OS specific support (must be 'true' or 'false').
cygwin=false
msys=false
darwin=false
nonstop=false
case "`uname`" in
CYGWIN* )
cygwin=true
;;
Darwin* )
darwin=true
;;
MINGW* )
msys=true
;;
NONSTOP* )
nonstop=true
;;
esac

CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar


# Determine the Java command to use to start the JVM.
if [ -n "$JAVA_HOME" ] ; then
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
# IBM's JDK on AIX uses strange locations for the executables
JAVACMD="$JAVA_HOME/jre/sh/java"
else
JAVACMD="$JAVA_HOME/bin/java"
fi
if [ ! -x "$JAVACMD" ] ; then
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
else
JAVACMD="java"
which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi

# Increase the maximum file descriptors if we can.
if [ "$cygwin" = "false" -a "$darwin" = "false" -a "$nonstop" = "false" ] ; then
MAX_FD_LIMIT=`ulimit -H -n`
if [ $? -eq 0 ] ; then
if [ "$MAX_FD" = "maximum" -o "$MAX_FD" = "max" ] ; then
MAX_FD="$MAX_FD_LIMIT"
fi
ulimit -n $MAX_FD
if [ $? -ne 0 ] ; then
warn "Could not set maximum file descriptor limit: $MAX_FD"
fi
else
warn "Could not query maximum file descriptor limit: $MAX_FD_LIMIT"
fi
fi

# For Darwin, add options to specify how the application appears in the dock
if $darwin; then
GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\""
fi

# For Cygwin or MSYS, switch paths to Windows format before running java
if [ "$cygwin" = "true" -o "$msys" = "true" ] ; then
APP_HOME=`cygpath --path --mixed "$APP_HOME"`
CLASSPATH=`cygpath --path --mixed "$CLASSPATH"`

JAVACMD=`cygpath --unix "$JAVACMD"`

# We build the pattern for arguments to be converted via cygpath
ROOTDIRSRAW=`find -L / -maxdepth 1 -mindepth 1 -type d 2>/dev/null`
SEP=""
for dir in $ROOTDIRSRAW ; do
ROOTDIRS="$ROOTDIRS$SEP$dir"
SEP="|"
done
OURCYGPATTERN="(^($ROOTDIRS))"
# Add a user-defined pattern to the cygpath arguments
if [ "$GRADLE_CYGPATTERN" != "" ] ; then
OURCYGPATTERN="$OURCYGPATTERN|($GRADLE_CYGPATTERN)"
fi
# Now convert the arguments - kludge to limit ourselves to /bin/sh
i=0
for arg in "$@" ; do
CHECK=`echo "$arg"|egrep -c "$OURCYGPATTERN" -`
CHECK2=`echo "$arg"|egrep -c "^-"` ### Determine if an option

if [ $CHECK -ne 0 ] && [ $CHECK2 -eq 0 ] ; then ### Added a condition
eval `echo args$i`=`cygpath --path --ignore --mixed "$arg"`
else
eval `echo args$i`="\"$arg\""
fi
i=`expr $i + 1`
done
case $i in
0) set -- ;;
1) set -- "$args0" ;;
2) set -- "$args0" "$args1" ;;
3) set -- "$args0" "$args1" "$args2" ;;
4) set -- "$args0" "$args1" "$args2" "$args3" ;;
5) set -- "$args0" "$args1" "$args2" "$args3" "$args4" ;;
6) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" ;;
7) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" ;;
8) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" ;;
9) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" "$args8" ;;
esac
fi

# Escape application args
save () {
for i do printf %s\\n "$i" | sed "s/'/'\\\\''/g;1s/^/'/;\$s/\$/' \\\\/" ; done
echo " "
}
APP_ARGS=`save "$@"`

# Collect all arguments for the java command, following the shell quoting and substitution rules
eval set -- $DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS"

exec "$JAVACMD" "$@"
89 changes: 89 additions & 0 deletions code/chap02/scala/gradlew.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
@rem
@rem Copyright 2015 the original author or authors.
@rem
@rem Licensed under the Apache License, Version 2.0 (the "License");
@rem you may not use this file except in compliance with the License.
@rem You may obtain a copy of the License at
@rem
@rem https://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.
@rem

@if "%DEBUG%" == "" @echo off
@rem ##########################################################################
@rem
@rem Gradle startup script for Windows
@rem
@rem ##########################################################################

@rem Set local scope for the variables with windows NT shell
if "%OS%"=="Windows_NT" setlocal

set DIRNAME=%~dp0
if "%DIRNAME%" == "" set DIRNAME=.
set APP_BASE_NAME=%~n0
set APP_HOME=%DIRNAME%

@rem Resolve any "." and ".." in APP_HOME to make it shorter.
for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi

@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
set DEFAULT_JVM_OPTS="-Xmx64m" "-Xms64m"

@rem Find java.exe
if defined JAVA_HOME goto findJavaFromJavaHome

set JAVA_EXE=java.exe
%JAVA_EXE% -version >NUL 2>&1
if "%ERRORLEVEL%" == "0" goto execute

echo.
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
echo.
echo Please set the JAVA_HOME variable in your environment to match the
echo location of your Java installation.

goto fail

:findJavaFromJavaHome
set JAVA_HOME=%JAVA_HOME:"=%
set JAVA_EXE=%JAVA_HOME%/bin/java.exe

if exist "%JAVA_EXE%" goto execute

echo.
echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%
echo.
echo Please set the JAVA_HOME variable in your environment to match the
echo location of your Java installation.

goto fail

:execute
@rem Setup the command line

set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar


@rem Execute Gradle
"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %*

:end
@rem End local scope for the variables with windows NT shell
if "%ERRORLEVEL%"=="0" goto mainEnd

:fail
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
rem the _cmd.exe /c_ return code!
if not "" == "%GRADLE_EXIT_CONSOLE%" exit 1
exit /b 1

:mainEnd
if "%OS%"=="Windows_NT" endlocal

:omega
1 change: 1 addition & 0 deletions code/chap02/scala/settings.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
rootProject.name = 'data-algos-with-spark-ch02'
12 changes: 12 additions & 0 deletions code/chap02/scala/src/main/resources/input.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
>seq1
cGTAaccaataaaaaaacaagcttaacctaattc
>seq2
agcttagTTTGGatctggccgggg
>seq3
gcggatttactcCCCCCAAAAANNaggggagagcccagataaatggagtctgtgcgtccaca
gaattcgcacca
AATAAAACCTCACCCAT
agagcccagaatttactcCCC
>seq4
gcggatttactcaggggagagcccagGGataaatggagtctgtgcgtccaca
gaattcgcacca
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
package org.data.algorithms.spark.ch02

import org.apache.spark.sql.SparkSession

import scala.sys.exit

/*
-----------------------------------------------------
Version-1
This is a DNA-Base-Count in PySpark.
The goal is to show how "DNA-Base-Count" works.
------------------------------------------------------
Input Parameters:
argv[1]: String, input path
-------------------------------------------------------
@author Deepak Kumar
-------------------------------------------------------
*/
object DNABaseCountVER1 {

def processFASTARecord(fastaRecord:String) :Map[String,Int] = {
var keyValueList = Map[String,Int]()
if(fastaRecord.startsWith(">"))
keyValueList += ("z" -> 1)
else {
var chars = fastaRecord.toLowerCase
for(c <- chars)
keyValueList += c.toString -> 1
}
return keyValueList
}

def main(args: Array[String]) = {
if(args.length !=2) {
println("Usage:" + DNABaseCountVER1 + " <input-path> " )
exit(-1)
}
//create an instance of SparkSession object
val spark = SparkSession.builder().appName("DNABaseCountVER1").master("local[*]").getOrCreate()
println("spark initialised")
val inputPath = args(1)
println("inputPath :"+ inputPath)
val recordsRDD = spark.sparkContext.textFile(inputPath)
println("recordsRDD.count() : "+ recordsRDD.count())
val recordsAsList = recordsRDD.collect()
print("recordsAsList : ", recordsAsList)
// if you do not have enough RAM, then do the following
// MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)
//recordsRDD.persist(StorageLevel(True, True, False, False, 1))
//
val pairsRDD = recordsRDD.flatMap(processFASTARecord)
pairsRDD.collect.foreach(println)

val frequenciesRDD = pairsRDD.reduceByKey((x,y)=> (x+y))
println("frequenciesRDD : debug")
val frequenciesAsList = frequenciesRDD.collect()
println("frequenciesAsList : " + frequenciesAsList.foreach(println))
}

}