dijkstra-spark

Implementation of Dijkstra’s shortest path algorithm using Apache Spark RDDs for distributed graph processing. Developed as part of a cloud computing project.

Dijkstra's Algorithm Using Apache Spark

Overview

This project implements Dijkstra’s shortest path algorithm using Apache Spark RDDs for distributed graph processing. It was developed as part of a cloud computing assignment focused on distributed systems and big data frameworks.

The goal is to compute the shortest distance from a source node to all other nodes in a large graph using parallel data processing techniques.

Features

Computes shortest distances from a source node to all other nodes.
Marks unreachable nodes as "INF".
Utilizes Spark RDD transformations (map, reduce, filter, etc.).
Designed to handle large graphs with 10,000 nodes and 100,000 edges.

Technologies Used

Apache Spark 3.4.4
Python 3.10
MacOS (Apple M3 Chip, 8-core CPU, 8GB RAM)
Visual Studio Code

Project Structure

File	Description
`dijkstra_spark.py`	Main Python script implementing Dijkstra’s algorithm with Spark RDDs.
`weighted_graph.txt`	Input graph file containing edges and weights.
`README.md`	Documentation and project instructions.

How to Run

Install Dependencies:
```
pip install pyspark
```
Set Required Environment Variables:
```
export PYTHONHASHSEED=0
```
Run the Program:
```
python3 dijkstra_spark.py
```

⚡ Tip: Ensure you are in the correct virtual environment if using one (like spark_env).

Input Format

The input graph must be provided in edge list format:

num_nodes num_edges
u1 v1 weight1
u2 v2 weight2
...

Example:

Where:

Each line describes a directed edge from node u to node v with a given weight.

Output Format

The output displays the shortest distance from the source node to every other node.

Example Output:

Shortest distances from node 0:
Node 0: 0
Node 1: 7
Node 2: 3
Node 3: 10
Node 4: 7

If a node is unreachable from the source, its distance is reported as "INF".

Notes

Recommended number of RDD partitions: 16 partitions (for best performance on an 8-core CPU).
Execution time may be slower when running large graphs locally due to limited RAM and CPU parallelism.
Installing psutil (optional) can improve Spark shuffle operation performance:
```
pip install psutil
```

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Programming Assignment_ Dijkstra's Algorithm with Apache Spark - Report.docx		Programming Assignment_ Dijkstra's Algorithm with Apache Spark - Report.docx
Programming Assignment_ Dijkstra's Algorithm with Apache Spark - Report.pdf		Programming Assignment_ Dijkstra's Algorithm with Apache Spark - Report.pdf
README.md		README.md
dijkstra_spark.py		dijkstra_spark.py
weighted_graph.txt		weighted_graph.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dijkstra-spark

Dijkstra's Algorithm Using Apache Spark

Overview

Features

Technologies Used

Project Structure

How to Run

Input Format

Output Format

Notes

About

Uh oh!

Releases

Packages

Languages

memefiele1/dijkstra-spark

Folders and files

Latest commit

History

Repository files navigation

dijkstra-spark

Dijkstra's Algorithm Using Apache Spark

Overview

Features

Technologies Used

Project Structure

How to Run

Input Format

Output Format

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages