GitHub - umar-7w4/Distributed-File-Storage-System: The Distributed File System is designed to manage and store data across multiple networked nodes, ensuring high availability, reliability, and scalability. This project simulates a simplified version of a DFS, commonly used in large-scale computing environments to distribute data storage and processing across several servers.

Distributed File System

Introduction

Project Overview

The Distributed File System (DFS) is designed to manage and store data across multiple networked nodes, ensuring high availability, reliability, and scalability. This project simulates a simplified version of a DFS, commonly used in large scale computing environments to distribute data storage and processing across many servers, enabling efficient data management and access.

Purpose:

The primary purpose of this project is to demonstrate the implementation of a distributed file system that can handle client requests for reading and writing data. The system is designed to distribute data blocks across multiple DataNodes and coordinate these operations through a central NameNode. This setup is inspired by real-world distributed systems like Hadoop's HDFS (Hadoop Distributed File System).

Component Tree

Distributed File System
├── Client
│   ├── startConnection(String ip, int port)
│   ├── sendMessage(String msg)
│   ├── stopConnection()
│   ├── handleReadCommand(String filename)
│   ├── handleAppendCommand(String filename, String content)
│   └── handleShutdownCommand()
├── NameNode
│   ├── start(int port)
│   ├── initiateShutdown()
│   ├── append(String filename, String content, NameNodeHandlerClient dataNodeClient)
│   ├── read(String filename, NameNodeHandlerClient dataNodeClient)
│   ├── stop()
│   ├── NameNodeHandler
│   │   ├── run()
│   │   ├── shutdown()
│   │   ├── append(String filename, String content, NameNodeHandlerClient dataNodeClient)
│   │   ├── read(String filename, NameNodeHandlerClient dataNodeClient)
│   │   └── sendResponse(String message)
│   └── NameNodeHandlerClient
│       ├── startConnection(String ip, int port)
│       ├── sendMessage(String msg)
│       └── stopConnection()
├── DataNode
│   ├── start(int port)
│   ├── alloc()
│   ├── read(int blk_id)
│   ├── write(int blk_id, String contents)
│   ├── stop()
│   ├── DataNodeHandler
│   │   ├── run()
│   │   ├── shutdown()
│   │   └── handleCommand(String command)
│   └── Helper Methods
│       ├── isFull()
│       ├── numEmptyBlks()
│       ├── print()
│       ├── mockRun()
│       └── parseCmdLine(String[] args)
├── Block
│   ├── getFilename()
│   ├── getRLock()
│   ├── getWLock()
├── Central
│   ├── main(String[] args)
├── Pair
│   ├── getDataNode()
│   ├── setDataNode(String DNum)
│   ├── getBlockNode()
│   └── setBlockNode(int BNum)
├── StartDataNodes
│   ├── main(String[] args)
│   └── startNode(int port, String logFile)

Goals and Objectives

Main Goals:

Develop a Functional Distributed File System: Implement a DFS that can distribute file storage across multiple DataNodes and coordinate these operations through a NameNode.
Ensure Data Availability and Reliability: Distribute data blocks across multiple nodes to ensure data redundancy and reliability.
Handle Concurrent Client Requests: Implement mechanisms to manage multiple client requests simultaneously, ensuring thread safety and data consistency.
Implement Robust Error Handling: Ensure the system can gracefully handle errors and maintain data integrity.

Specific Objectives:

NameNode Implementation:
- Manage metadata about file locations and data blocks.
- Coordinate read and write operations between clients and DataNodes.
- Handle client connections and process requests.
DataNode Implementation:
- Store and manage data blocks.
- Handle read and write operations as directed by the NameNode.
- Ensure data consistency and integrity.
Client Interface:
- Provide a user friendly interface for interacting with the DFS.
- Allow users to perform read, write, and shutdown operations.
- Handle network communication with the NameNode.
Networking and Concurrency:
- Implement socket programming for communication between nodes and clients.
- Ensure thread safety and manage concurrent client requests using multithreading.
- Implement synchronization mechanisms to prevent race conditions and data inconsistencies.
Robust Error Handling and Graceful Shutdown:
- Implement error handling to manage network failures, data inconsistencies, and other potential issues.
- Implement a graceful shutdown mechanism to safely terminate the system, ensuring all operations are completed and resources are released.

Architecture

System Architecture

Overview:

The Distributed File System is designed with a client-server architecture consisting of three main components: the NameNode, DataNodes, and Client. The system manages file storage and retrieval across multiple networked nodes, ensuring high availability, reliability, and scalability.

Architecture Diagram:

Explanation:

Client:
- The Client is the interface through which users interact with the DFS. It sends requests to the NameNode to read or append data to files.
NameNode:
- The NameNode acts as the central coordinator. It manages metadata about file locations and coordinates communication with DataNodes to handle client requests.
DataNodes:
- DataNodes are storage nodes responsible for storing actual data blocks. They handle read and write operations as directed by the NameNode.

Workflow:

Client Request:
- The Client sends a request to the NameNode (e.g., read, append, or shutdown).
Request Handling:
- The NameNode processes the request, updates metadata, and coordinates with DataNodes for data operations.
Data Operations:
- DataNodes perform the necessary data storage or retrieval operations and communicate the results back to the NameNode.
Response:
- The NameNode sends a response back to the Client, completing the operation.

Modules and Components

1. NameNode

Responsibilities:

Manage metadata about file locations and data blocks.
Coordinate read and write operations between clients and DataNodes.
Handle client connections and process requests.

Components:

ServerSocket: Listens for incoming client connections.
Handler Threads: Manages individual client requests in separate threads.
Metadata Storage: Stores information about file locations and associated data blocks.

Key Methods:

start(int port): Starts the NameNode server on the specified port.
initiateShutdown(): Initiates the shutdown process.
append(String filename, String content, NameNodeHandlerClient dataNodeClient): Appends content to a file.
read(String filename, NameNodeHandlerClient dataNodeClient): Reads content from a file.

2. DataNode

Responsibilities:

Store and manage data blocks.
Handle read and write operations as directed by the NameNode.
Ensure data consistency and integrity.

Components:

ServerSocket: Listens for incoming connections from the NameNode.
Handler Threads: Manages individual requests from the NameNode in separate threads.
Block Storage: Stores data blocks in files.

Key Methods:

alloc(): Allocates a new data block.
read(int blk_id): Reads data from a specified block.
write(int blk_id, String contents): Writes data to a specified block.
start(int port): Starts the DataNode server on the specified port.

3. Client

Responsibilities:

Provide a user-friendly interface for interacting with the DFS.
Allow users to perform read, write, and shutdown operations.
Handle network communication with the NameNode.

Components:

Socket: Establishes a connection to the NameNode.
Input/Output Streams: Sends requests to and receives responses from the NameNode.

Key Methods:

startConnection(String ip, int port): Starts a connection to the NameNode.
sendMessage(String msg): Sends a message to the NameNode and returns the response.
stopConnection(): Closes the connection to the NameNode.
handleReadCommand(String filename): Handles read operations.
handleAppendCommand(String filename, String content): Handles append operations.
handleShutdownCommand(): Handles shutdown operations.

Relationships:

Client to NameNode:
- The Client sends read, append, and shutdown requests to the NameNode.
NameNode to DataNodes:
- The NameNode coordinates with DataNodes to perform data storage and retrieval operations.
DataNodes:
- DataNodes are independent of each other but work collectively to store data blocks distributed by the NameNode.

Getting Started

Prerequisites

Software Requirements:

Java Development Kit (JDK):
- Version: JDK 8 or higher
- Download: Oracle JDK or OpenJDK
Integrated Development Environment (IDE):
- Recommended: IntelliJ IDEA, Eclipse, or Visual Studio Code
Build Tools:
- Apache Maven (optional, if you prefer using Maven for dependency management and build automation)
- Download: Apache Maven
Version Control System (optional):
- Git
- Download: Git

Hardware Requirements:

Processor: Intel i5 or equivalent
RAM: 4 GB minimum (8 GB recommended for smoother performance)
Disk Space: 500 MB for project files and dependencies

Other Requirements:

Network Connection: Required for downloading dependencies and for testing network communication between nodes

Installation Guide

Step 1: Install JDK

Download and install the JDK from the Oracle or OpenJDK website.
Set the JAVA_HOME environment variable to point to the JDK installation directory.
Add the JDK bin directory to your system's PATH variable.

Step 2: Install an IDE

Download and install your preferred IDE (IntelliJ IDEA, Eclipse, or Visual Studio Code).
Configure the IDE to use the installed JDK.

Step 3: Set Up the Project Directory

Create a new directory for the project.

mkdir DistributedFileSystem
cd DistributedFileSystem

Initialize a Git repository (optional).
```
git init
```

Step 4: Download Project Files

Clone the project repository from GitHub (if available).
```
git clone <repository-url>
```
If not using Git, download the project files and place them in the project directory.

Step 5: Compile the Project

Open a terminal or command prompt.
Navigate to the project directory.
Compile the Java files.
```
javac -d bin src/*.java
```

Quick Start

Step 1: Start the NameNode

Open a terminal or command prompt.
Navigate to the project directory.
Run the NameNode.
```
java -cp bin NameNode
```

Step 2: Start DataNodes

Open additional terminal windows for each DataNode.
Navigate to the project directory in each terminal.

Run the DataNode instances with different ports.

java -cp bin DataNode 65530
java -cp bin DataNode 65531
java -cp bin DataNode 65532

Step 3: Run the Client

Open another terminal window.
Navigate to the project directory.
Run the Client and follow the prompts to perform operations.
```
java -cp bin Client
```

Example Client Commands:

Append Data to a File:
```
::append file.txt Hi man, How are you?
```
Read Data from a File:
```
::read file.txt
```
Shutdown the System:
```
::shutdown
```

Future Work

Data Replication for Fault Tolerance:
- Replicate Data Across Multiple DataNodes: Implement a replication mechanism where each data block is stored on multiple DataNodes to ensure redundancy and fault tolerance. In case of a DataNode failure, the system can retrieve data from a replica on another node.
- Configurable Replication Factor: Allow the system administrator to configure the replication factor, enabling flexibility in determining how many copies of each block should be stored for redundancy.
NameNode Clustering for High Availability:
- Primary and Standby NameNodes: Develop a cluster of NameNodes with one primary and multiple standby nodes. The standby nodes will be able to take over automatically in the event of a failure of the primary, ensuring continuous operation of the file system.
- Metadata Replication: Synchronize metadata (block locations, file information) across all NameNodes in the cluster to ensure a seamless failover process and avoid data loss.
Leader Election for NameNode Failover:
- Zookeeper-Based Coordination: Use Zookeeper or a similar coordination service to handle leader election among the NameNodes. This ensures that when the primary NameNode fails, a new leader is elected automatically, minimizing downtime.
Data Rebalancing Across DataNodes:
- Automatic Data Rebalancing: Implement a rebalancing mechanism to redistribute data blocks across DataNodes when new nodes are added or removed. This will optimize storage utilization and improve performance by evenly distributing the load.

License

This project is licensed under the @2024 Umar Mohammad

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
Distributed File System.jpeg		Distributed File System.jpeg
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed File System

Introduction

Project Overview

Table of Contents

Component Tree

Goals and Objectives

Architecture

System Architecture

Modules and Components

Getting Started

Prerequisites

Installation Guide

Quick Start

Future Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

umar-7w4/Distributed-File-Storage-System

Folders and files

Latest commit

History

Repository files navigation

Distributed File System

Introduction

Project Overview

Table of Contents

Component Tree

Goals and Objectives

Architecture

System Architecture

Modules and Components

Getting Started

Prerequisites

Installation Guide

Quick Start

Future Work

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages