Download, Install, and Run PySpark

1. For macbook users: Enable "Remote Login"

  System Preferences --> Sharing --> enable "Remote Login" service

2. Make Sure Java 8 is Installed Properly

java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)

3. Download

Download the latest binary Spark from the following URL:

https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

4. Open the Downloaded File

Assuming that I have downloaded my file in /home/mparsian/spark-3.3.0-bin-hadoop3.tgz

cd /home/mparsian

tar zvfx  spark-3.3.0-bin-hadoop3.tgz
x spark-3.3.0-bin-hadoop3/
x spark-3.3.0-bin-hadoop3/NOTICE
x spark-3.3.0-bin-hadoop3/CHANGES.txt
...

5. Start the Spark Cluster

cd /home/mparsian/spark-3.3.0-bin-hadoop3/

./sbin/start-all.sh

NOTE: If you are going to run Spark in your pc/macbook/windows, 
then you do NOT need to start cluster at all. Invoking
./bin/pyspark, your laptop is considered as your cluster

6. Check Master and Worker

Make sure that Master and Worker processes are running:

jps
1347 Master
1390 Worker

7. Check The Spark URL

http://localhost:8080

8. Define Very Basic Python Program

Python program: /home/mparsian/spark-3.3.0-bin-hadoop3/test.py

  #!/usr/bin/python
  import sys

  for line in sys.stdin:
  	print "hello " + line

Python program: /home/mparsian/spark-3.3.0-bin-hadoop3/test2.py

  #!/usr/bin/python
  def fun2(str):
  	str2 = str + " zaza"
  	return str2

9. Start and Run pyspark

	cd /home/mparsian/spark-3.3.0-bin-hadoop3/
	./bin/pyspark
			...
	...
	Welcome to
		  ____              __
		 / __/__  ___ _____/ /__
		_\ \/ _ \/ _ `/ __/  '_/
	   /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
		  /_/

	>>> data = ["john","paul","george","ringo"]
	>>> data
	['john', 'paul', 'george', 'ringo']

	>>> rdd = sc.parallelize(data)
	>>> rdd.collect()
	['john', 'paul', 'george', 'ringo']


	>>> test = "/home/mparsian/spark-3.3.0-bin-hadoop3/test.py"
	>>> test2 = "/home/mparsian/spark-3.3.0-bin-hadoop3/test2.py"
	>>> import test
	>>> import test2


	>>> pipeRDD =  rdd.pipe(test)
	>>> pipeRDD.collect()
	[u'hello john', u'', u'hello paul', u'', u'hello george', u'', u'hello ringo', u'']


	>>> rdd.collect()
	['john', 'paul', 'george', 'ringo']


	>>> rdd2 = rdd.map(lambda x : test2.fun2(x))
	>>> rdd2.collect()
	['john zaza', 'paul zaza', 'george zaza', 'ringo zaza']
	>>>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download_install_run_spark.md

download_install_run_spark.md

Download, Install, and Run PySpark

1. For macbook users: Enable "Remote Login"

2. Make Sure Java 8 is Installed Properly

3. Download

4. Open the Downloaded File

5. Start the Spark Cluster

6. Check Master and Worker

7. Check The Spark URL

8. Define Very Basic Python Program

9. Start and Run pyspark

Files

download_install_run_spark.md

Latest commit

History

download_install_run_spark.md

File metadata and controls

Download, Install, and Run PySpark

1. For macbook users: Enable "Remote Login"

2. Make Sure Java 8 is Installed Properly

3. Download

4. Open the Downloaded File

5. Start the Spark Cluster

6. Check Master and Worker

7. Check The Spark URL

8. Define Very Basic Python Program

9. Start and Run pyspark