Skip to content

Conversation

GizzZmo
Copy link

@GizzZmo GizzZmo commented Oct 19, 2025

This pull request adds comprehensive top-level documentation to the repository, making it easier for both users and developers to find, understand, and navigate Spark's documentation and architecture. The main changes include the addition of an in-depth architecture overview, a complete documentation index, and improvements to the README.md to point to these new resources and module-level documentation.

Major documentation enhancements:

Documentation structure and discoverability:

Added DOCUMENTATION_INDEX.md, providing a thorough, organized index of all documentation in the repository, including links and descriptions for user, developer, administrator, and API docs.
Updated README.md to include direct links to the new architecture overview, development guide, contribution guide, and all major module and tool documentation, improving onboarding and navigation for new users.
Architecture overview:

Added a new ARCHITECTURE.md file with a detailed description of Spark’s architecture, including core components, execution model, key subsystems (memory, shuffle, storage, serialization), module structure, performance tuning, and contribution guidelines.
These changes significantly improve the accessibility and clarity of Spark's documentation, making it easier for contributors and users to understand the project structure and find the resources they need.

dependabot bot and others added 8 commits October 19, 2025 15:31
Bumps the pip group with 1 update in the /dev directory: [black](https://github.com/psf/black).


Updates `black` from 23.12.1 to 24.3.0
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](psf/black@23.12.1...24.3.0)

---
updated-dependencies:
- dependency-name: black
  dependency-version: 24.3.0
  dependency-type: direct:production
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
Bump black from 23.12.1 to 24.3.0 in /dev in the pip group across 1 directory
Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
…n guide

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
…ation

Add comprehensive documentation for Spark architecture, modules, and tools
@GizzZmo
Copy link
Author

GizzZmo commented Oct 19, 2025

This pull request significantly improves the documentation structure and discoverability for the Apache Spark project. It introduces a comprehensive architecture overview, a detailed documentation index, and enhances the main README.md to help users and contributors quickly find relevant guides and references.

Documentation improvements:

  • Added a new ARCHITECTURE.md file that provides an in-depth overview of Spark's architecture, core components, execution model, key subsystems, data flow, module structure, and performance tuning tips.
  • Introduced DOCUMENTATION_INDEX.md, a complete index of all documentation in the repository, organized by topic, audience, and module, with links to official guides, module READMEs, and external resources.
  • Updated the README.md to reference the new architecture and documentation index files, and to provide direct links to module-level and tool documentation for easier navigation.

These changes make it much easier for both new users and contributors to understand Spark's architecture and to find the documentation they need.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds comprehensive, top-level documentation and module READMEs to improve discoverability and onboarding, plus a docs index and architecture overview.

  • New ARCHITECTURE.md and DOCUMENTATION_INDEX.md
  • Adds/updates module-level READMEs (core, common, streaming, mllib, graphx, launcher, examples, resource-managers, sbin)
  • Root README now links to the new docs; dev requirements update Black to 24.3.0

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
streaming/README.md Adds an in-depth guide for DStreams and Structured Streaming with examples, build/test, and migration guidance
sbin/README.md Documents standalone admin scripts, HA, security, and troubleshooting
resource-managers/README.md Describes YARN, Kubernetes, Mesos integrations, configs, and comparisons
mllib/README.md Covers MLlib APIs, pipelines, algorithms, and examples
launcher/README.md Documents programmatic launching APIs and examples
graphx/README.md Provides GraphX concepts, APIs, algorithms, and examples
examples/README.md Explains running and organizing example apps across languages
dev/requirements.txt Bumps Black formatter to 24.3.0
core/README.md Documents Spark Core architecture, APIs, and internals
common/README.md Documents common modules (network, kvstore, sketch, etc.)
README.md Adds pointers to architecture, development, and module docs
DOCUMENTATION_INDEX.md Introduces a comprehensive index of repository and official docs
DEVELOPMENT.md Adds developer setup, build, testing, debugging, and contribution guidance
CODE_DOCUMENTATION_GUIDE.md Establishes code documentation standards across languages
ARCHITECTURE.md Adds a high-level architecture overview and module structure

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

## Configuration

Key configuration parameters:

Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.streaming.checkpoint.interval is not a valid Spark configuration. Checkpoint interval for DStreams is set per DStream in code via dstream.checkpoint(Seconds(n)). Please remove this property from the configuration block and, if desired, add a code example showing dstream.checkpoint(Seconds(10)).

Copilot uses AI. Check for mistakes.

Comment on lines +491 to +502
| Script | Purpose |
|--------|---------|
| `start-all.sh` | Start master and all workers |
| `stop-all.sh` | Stop master and all workers |
| `start-master.sh` | Start master on current machine |
| `stop-master.sh` | Stop master |
| `start-worker.sh` | Start worker on current machine |
| `stop-worker.sh` | Stop worker |
| `start-workers.sh` | Start workers on all machines in `conf/workers` |
| `stop-workers.sh` | Stop all workers |
| `start-history-server.sh` | Start history server |
| `stop-history-server.sh` | Stop history server |
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown table is malformed due to double leading pipes and an incorrect separator row, which will render improperly. Replace with a standard markdown table format as shown below.

Copilot uses AI. Check for mistakes.

Comment on lines +491 to +492
| Script | Purpose |
|--------|---------|
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested fix for the table header and separator:

Copilot uses AI. Check for mistakes.

Comment on lines +491 to +502
| Script | Purpose |
|--------|---------|
| `start-all.sh` | Start master and all workers |
| `stop-all.sh` | Stop master and all workers |
| `start-master.sh` | Start master on current machine |
| `stop-master.sh` | Stop master |
| `start-worker.sh` | Start worker on current machine |
| `stop-worker.sh` | Stop worker |
| `start-workers.sh` | Start workers on all machines in `conf/workers` |
| `stop-workers.sh` | Stop all workers |
| `start-history-server.sh` | Start history server |
| `stop-history-server.sh` | Stop history server |
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the above with:

Copilot uses AI. Check for mistakes.

Comment on lines +491 to +502
| Script | Purpose |
|--------|---------|
| `start-all.sh` | Start master and all workers |
| `stop-all.sh` | Stop master and all workers |
| `start-master.sh` | Start master on current machine |
| `stop-master.sh` | Stop master |
| `start-worker.sh` | Start worker on current machine |
| `stop-worker.sh` | Stop worker |
| `start-workers.sh` | Start workers on all machines in `conf/workers` |
| `stop-workers.sh` | Stop all workers |
| `start-history-server.sh` | Start history server |
| `stop-history-server.sh` | Stop history server |
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected table formatting.

Copilot uses AI. Check for mistakes.


### EdgeTriplet

Represents a edge with its source and destination vertex properties.
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the grammatical error: 'a edge' should be 'an edge'.

Suggested change
Represents a edge with its source and destination vertex properties.
Represents an edge with its source and destination vertex properties.

Copilot uses AI. Check for mistakes.


MLlib provides distributed linear algebra through Breeze.

**Location**: `src/main/scala/org/apache/spark/mllib/linalg/`
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section mixes package paths: the stated location points to mllib/linalg (RDD-based), while the example imports org.apache.spark.ml.linalg (DataFrame-based, located under mllib-local). Please clarify by either updating the location to mllib-local/src/main/scala/org/apache/spark/ml/linalg for ml.linalg types, or change the example to import org.apache.spark.mllib.linalg if you intend to reference the RDD-based API.

Suggested change
**Location**: `src/main/scala/org/apache/spark/mllib/linalg/`
**Location**: `mllib-local/src/main/scala/org/apache/spark/ml/linalg/`

Copilot uses AI. Check for mistakes.


```bash
# Enable remote debugging
export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spark-shell and spark-submit, SPARK_SUBMIT_OPTS is the standard way to pass JVM options; SPARK_JAVA_OPTS may not be respected. Recommend replacing with SPARK_SUBMIT_OPTS, e.g.: export SPARK_SUBMIT_OPTS='-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005'.

Suggested change
export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
export SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"

Copilot uses AI. Check for mistakes.

@HyukjinKwon
Copy link
Member

I think this would need an SPIP (https://spark.apache.org/improvement-proposals.html).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants