Adds comprehensive top-level documentation to the repository #52656

GizzZmo · 2025-10-19T17:39:08Z

This pull request adds comprehensive top-level documentation to the repository, making it easier for both users and developers to find, understand, and navigate Spark's documentation and architecture. The main changes include the addition of an in-depth architecture overview, a complete documentation index, and improvements to the README.md to point to these new resources and module-level documentation.

Major documentation enhancements:

Documentation structure and discoverability:

Added DOCUMENTATION_INDEX.md, providing a thorough, organized index of all documentation in the repository, including links and descriptions for user, developer, administrator, and API docs.
Updated README.md to include direct links to the new architecture overview, development guide, contribution guide, and all major module and tool documentation, improving onboarding and navigation for new users.
Architecture overview:

Added a new ARCHITECTURE.md file with a detailed description of Spark’s architecture, including core components, execution model, key subsystems (memory, shuffle, storage, serialization), module structure, performance tuning, and contribution guidelines.
These changes significantly improve the accessibility and clarity of Spark's documentation, making it easier for contributors and users to understand the project structure and find the resources they need.

Bumps the pip group with 1 update in the /dev directory: [black](https://github.com/psf/black). Updates `black` from 23.12.1 to 24.3.0 - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](psf/black@23.12.1...24.3.0) --- updated-dependencies: - dependency-name: black dependency-version: 24.3.0 dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <support@github.com>

Bump black from 23.12.1 to 24.3.0 in /dev in the pip group across 1 directory

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>

…n guide Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>

…ation Add comprehensive documentation for Spark architecture, modules, and tools

GizzZmo · 2025-10-19T17:42:57Z

This pull request significantly improves the documentation structure and discoverability for the Apache Spark project. It introduces a comprehensive architecture overview, a detailed documentation index, and enhances the main README.md to help users and contributors quickly find relevant guides and references.

Documentation improvements:

Added a new ARCHITECTURE.md file that provides an in-depth overview of Spark's architecture, core components, execution model, key subsystems, data flow, module structure, and performance tuning tips.
Introduced DOCUMENTATION_INDEX.md, a complete index of all documentation in the repository, organized by topic, audience, and module, with links to official guides, module READMEs, and external resources.
Updated the README.md to reference the new architecture and documentation index files, and to provide direct links to module-level and tool documentation for easier navigation.

These changes make it much easier for both new users and contributors to understand Spark's architecture and to find the documentation they need.

Copilot

Pull Request Overview

Adds comprehensive, top-level documentation and module READMEs to improve discoverability and onboarding, plus a docs index and architecture overview.

New ARCHITECTURE.md and DOCUMENTATION_INDEX.md
Adds/updates module-level READMEs (core, common, streaming, mllib, graphx, launcher, examples, resource-managers, sbin)
Root README now links to the new docs; dev requirements update Black to 24.3.0

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
streaming/README.md	Adds an in-depth guide for DStreams and Structured Streaming with examples, build/test, and migration guidance
sbin/README.md	Documents standalone admin scripts, HA, security, and troubleshooting
resource-managers/README.md	Describes YARN, Kubernetes, Mesos integrations, configs, and comparisons
mllib/README.md	Covers MLlib APIs, pipelines, algorithms, and examples
launcher/README.md	Documents programmatic launching APIs and examples
graphx/README.md	Provides GraphX concepts, APIs, algorithms, and examples
examples/README.md	Explains running and organizing example apps across languages
dev/requirements.txt	Bumps Black formatter to 24.3.0
core/README.md	Documents Spark Core architecture, APIs, and internals
common/README.md	Documents common modules (network, kvstore, sketch, etc.)
README.md	Adds pointers to architecture, development, and module docs
DOCUMENTATION_INDEX.md	Introduces a comprehensive index of repository and official docs
DEVELOPMENT.md	Adds developer setup, build, testing, debugging, and contribution guidance
CODE_DOCUMENTATION_GUIDE.md	Establishes code documentation standards across languages
ARCHITECTURE.md	Adds a high-level architecture overview and module structure

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-19T17:44:24Z

streaming/README.md

+## Configuration
+
+Key configuration parameters:
+


spark.streaming.checkpoint.interval is not a valid Spark configuration. Checkpoint interval for DStreams is set per DStream in code via dstream.checkpoint(Seconds(n)). Please remove this property from the configuration block and, if desired, add a code example showing dstream.checkpoint(Seconds(10)).

Copilot · 2025-10-19T17:44:24Z

sbin/README.md

+| Script | Purpose |
+|--------|---------|
+| `start-all.sh` | Start master and all workers |
+| `stop-all.sh` | Stop master and all workers |
+| `start-master.sh` | Start master on current machine |
+| `stop-master.sh` | Stop master |
+| `start-worker.sh` | Start worker on current machine |
+| `stop-worker.sh` | Stop worker |
+| `start-workers.sh` | Start workers on all machines in `conf/workers` |
+| `stop-workers.sh` | Stop all workers |
+| `start-history-server.sh` | Start history server |
+| `stop-history-server.sh` | Stop history server |


The markdown table is malformed due to double leading pipes and an incorrect separator row, which will render improperly. Replace with a standard markdown table format as shown below.

Copilot · 2025-10-19T17:44:24Z

sbin/README.md

+| Script | Purpose |
+|--------|---------|


Suggested fix for the table header and separator:

Copilot · 2025-10-19T17:44:24Z

sbin/README.md

+| Script | Purpose |
+|--------|---------|
+| `start-all.sh` | Start master and all workers |
+| `stop-all.sh` | Stop master and all workers |
+| `start-master.sh` | Start master on current machine |
+| `stop-master.sh` | Stop master |
+| `start-worker.sh` | Start worker on current machine |
+| `stop-worker.sh` | Stop worker |
+| `start-workers.sh` | Start workers on all machines in `conf/workers` |
+| `stop-workers.sh` | Stop all workers |
+| `start-history-server.sh` | Start history server |
+| `stop-history-server.sh` | Stop history server |


Replace the above with:

Copilot · 2025-10-19T17:44:24Z

sbin/README.md

+| Script | Purpose |
+|--------|---------|
+| `start-all.sh` | Start master and all workers |
+| `stop-all.sh` | Stop master and all workers |
+| `start-master.sh` | Start master on current machine |
+| `stop-master.sh` | Stop master |
+| `start-worker.sh` | Start worker on current machine |
+| `stop-worker.sh` | Stop worker |
+| `start-workers.sh` | Start workers on all machines in `conf/workers` |
+| `stop-workers.sh` | Stop all workers |
+| `start-history-server.sh` | Start history server |
+| `stop-history-server.sh` | Stop history server |


Corrected table formatting.

Copilot · 2025-10-19T17:44:25Z

graphx/README.md

+
+### EdgeTriplet
+
+Represents a edge with its source and destination vertex properties.


Fix the grammatical error: 'a edge' should be 'an edge'.

Suggested change

Represents a edge with its source and destination vertex properties.

Represents an edge with its source and destination vertex properties.

Copilot · 2025-10-19T17:44:25Z

mllib/README.md

+
+MLlib provides distributed linear algebra through Breeze.
+
+**Location**: `src/main/scala/org/apache/spark/mllib/linalg/`


This section mixes package paths: the stated location points to mllib/linalg (RDD-based), while the example imports org.apache.spark.ml.linalg (DataFrame-based, located under mllib-local). Please clarify by either updating the location to mllib-local/src/main/scala/org/apache/spark/ml/linalg for ml.linalg types, or change the example to import org.apache.spark.mllib.linalg if you intend to reference the RDD-based API.

Suggested change

**Location**: `src/main/scala/org/apache/spark/mllib/linalg/`

**Location**: `mllib-local/src/main/scala/org/apache/spark/ml/linalg/`

Copilot · 2025-10-19T17:44:25Z

DEVELOPMENT.md

+
+```bash
+# Enable remote debugging
+export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"


For spark-shell and spark-submit, SPARK_SUBMIT_OPTS is the standard way to pass JVM options; SPARK_JAVA_OPTS may not be respected. Recommend replacing with SPARK_SUBMIT_OPTS, e.g.: export SPARK_SUBMIT_OPTS='-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005'.

Suggested change

export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"

export SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"

HyukjinKwon · 2025-10-20T00:41:25Z

I think this would need an SPIP (https://spark.apache.org/improvement-proposals.html).

dependabot bot and others added 8 commits October 19, 2025 15:31

Merge pull request #2 from GizzZmo/dependabot/pip/dev/pip-81350e123e

b7b8e8d

Bump black from 23.12.1 to 24.3.0 in /dev in the pip group across 1 directory

Initial plan

3db5efb

Add comprehensive documentation files for Spark architecture and modules

87a1a22

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>

Add README files for launcher, resource-managers, and sbin directories

6293b49

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>

Update main README with documentation links and add code documentatio…

8c06e81

…n guide Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>

Add comprehensive documentation index for easy navigation

fa25e0c

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>

Merge pull request #4 from GizzZmo/copilot/add-comprehensive-document…

b380b76

…ation Add comprehensive documentation for Spark architecture, modules, and tools

Copilot AI review requested due to automatic review settings October 19, 2025 17:39

github-actions bot added MLLIB DEPLOY GRAPHX BUILD EXAMPLES DOCS CORE DSTREAM labels Oct 19, 2025

Copilot AI reviewed Oct 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds comprehensive top-level documentation to the repository #52656

Adds comprehensive top-level documentation to the repository #52656

GizzZmo commented Oct 19, 2025

Uh oh!

GizzZmo commented Oct 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

HyukjinKwon commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### EdgeTriplet

		Represents a edge with its source and destination vertex properties.

	Represents a edge with its source and destination vertex properties.
	Represents an edge with its source and destination vertex properties.


		MLlib provides distributed linear algebra through Breeze.

		Location: `src/main/scala/org/apache/spark/mllib/linalg/`

	Location: `src/main/scala/org/apache/spark/mllib/linalg/`
	Location: `mllib-local/src/main/scala/org/apache/spark/ml/linalg/`

	export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
	export SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"

Adds comprehensive top-level documentation to the repository #52656

Are you sure you want to change the base?

Adds comprehensive top-level documentation to the repository #52656

Conversation

GizzZmo commented Oct 19, 2025

Uh oh!

GizzZmo commented Oct 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants