-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Adds comprehensive top-level documentation to the repository #52656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Bumps the pip group with 1 update in the /dev directory: [black](https://github.com/psf/black). Updates `black` from 23.12.1 to 24.3.0 - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](psf/black@23.12.1...24.3.0) --- updated-dependencies: - dependency-name: black dependency-version: 24.3.0 dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] <support@github.com>
Bump black from 23.12.1 to 24.3.0 in /dev in the pip group across 1 directory
Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
…n guide Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
…ation Add comprehensive documentation for Spark architecture, modules, and tools
This pull request significantly improves the documentation structure and discoverability for the Apache Spark project. It introduces a comprehensive architecture overview, a detailed documentation index, and enhances the main Documentation improvements:
These changes make it much easier for both new users and contributors to understand Spark's architecture and to find the documentation they need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds comprehensive, top-level documentation and module READMEs to improve discoverability and onboarding, plus a docs index and architecture overview.
- New ARCHITECTURE.md and DOCUMENTATION_INDEX.md
- Adds/updates module-level READMEs (core, common, streaming, mllib, graphx, launcher, examples, resource-managers, sbin)
- Root README now links to the new docs; dev requirements update Black to 24.3.0
Reviewed Changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 8 comments.
Show a summary per file
File | Description |
---|---|
streaming/README.md | Adds an in-depth guide for DStreams and Structured Streaming with examples, build/test, and migration guidance |
sbin/README.md | Documents standalone admin scripts, HA, security, and troubleshooting |
resource-managers/README.md | Describes YARN, Kubernetes, Mesos integrations, configs, and comparisons |
mllib/README.md | Covers MLlib APIs, pipelines, algorithms, and examples |
launcher/README.md | Documents programmatic launching APIs and examples |
graphx/README.md | Provides GraphX concepts, APIs, algorithms, and examples |
examples/README.md | Explains running and organizing example apps across languages |
dev/requirements.txt | Bumps Black formatter to 24.3.0 |
core/README.md | Documents Spark Core architecture, APIs, and internals |
common/README.md | Documents common modules (network, kvstore, sketch, etc.) |
README.md | Adds pointers to architecture, development, and module docs |
DOCUMENTATION_INDEX.md | Introduces a comprehensive index of repository and official docs |
DEVELOPMENT.md | Adds developer setup, build, testing, debugging, and contribution guidance |
CODE_DOCUMENTATION_GUIDE.md | Establishes code documentation standards across languages |
ARCHITECTURE.md | Adds a high-level architecture overview and module structure |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
## Configuration | ||
|
||
Key configuration parameters: | ||
|
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.streaming.checkpoint.interval is not a valid Spark configuration. Checkpoint interval for DStreams is set per DStream in code via dstream.checkpoint(Seconds(n)). Please remove this property from the configuration block and, if desired, add a code example showing dstream.checkpoint(Seconds(10)).
Copilot uses AI. Check for mistakes.
| Script | Purpose | | ||
|--------|---------| | ||
| `start-all.sh` | Start master and all workers | | ||
| `stop-all.sh` | Stop master and all workers | | ||
| `start-master.sh` | Start master on current machine | | ||
| `stop-master.sh` | Stop master | | ||
| `start-worker.sh` | Start worker on current machine | | ||
| `stop-worker.sh` | Stop worker | | ||
| `start-workers.sh` | Start workers on all machines in `conf/workers` | | ||
| `stop-workers.sh` | Stop all workers | | ||
| `start-history-server.sh` | Start history server | | ||
| `stop-history-server.sh` | Stop history server | |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The markdown table is malformed due to double leading pipes and an incorrect separator row, which will render improperly. Replace with a standard markdown table format as shown below.
Copilot uses AI. Check for mistakes.
| Script | Purpose | | ||
|--------|---------| |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested fix for the table header and separator:
Copilot uses AI. Check for mistakes.
| Script | Purpose | | ||
|--------|---------| | ||
| `start-all.sh` | Start master and all workers | | ||
| `stop-all.sh` | Stop master and all workers | | ||
| `start-master.sh` | Start master on current machine | | ||
| `stop-master.sh` | Stop master | | ||
| `start-worker.sh` | Start worker on current machine | | ||
| `stop-worker.sh` | Stop worker | | ||
| `start-workers.sh` | Start workers on all machines in `conf/workers` | | ||
| `stop-workers.sh` | Stop all workers | | ||
| `start-history-server.sh` | Start history server | | ||
| `stop-history-server.sh` | Stop history server | |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the above with:
Copilot uses AI. Check for mistakes.
| Script | Purpose | | ||
|--------|---------| | ||
| `start-all.sh` | Start master and all workers | | ||
| `stop-all.sh` | Stop master and all workers | | ||
| `start-master.sh` | Start master on current machine | | ||
| `stop-master.sh` | Stop master | | ||
| `start-worker.sh` | Start worker on current machine | | ||
| `stop-worker.sh` | Stop worker | | ||
| `start-workers.sh` | Start workers on all machines in `conf/workers` | | ||
| `stop-workers.sh` | Stop all workers | | ||
| `start-history-server.sh` | Start history server | | ||
| `stop-history-server.sh` | Stop history server | |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected table formatting.
Copilot uses AI. Check for mistakes.
|
||
### EdgeTriplet | ||
|
||
Represents a edge with its source and destination vertex properties. |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the grammatical error: 'a edge' should be 'an edge'.
Represents a edge with its source and destination vertex properties. | |
Represents an edge with its source and destination vertex properties. |
Copilot uses AI. Check for mistakes.
|
||
MLlib provides distributed linear algebra through Breeze. | ||
|
||
**Location**: `src/main/scala/org/apache/spark/mllib/linalg/` |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section mixes package paths: the stated location points to mllib/linalg (RDD-based), while the example imports org.apache.spark.ml.linalg (DataFrame-based, located under mllib-local). Please clarify by either updating the location to mllib-local/src/main/scala/org/apache/spark/ml/linalg for ml.linalg types, or change the example to import org.apache.spark.mllib.linalg if you intend to reference the RDD-based API.
**Location**: `src/main/scala/org/apache/spark/mllib/linalg/` | |
**Location**: `mllib-local/src/main/scala/org/apache/spark/ml/linalg/` |
Copilot uses AI. Check for mistakes.
|
||
```bash | ||
# Enable remote debugging | ||
export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For spark-shell and spark-submit, SPARK_SUBMIT_OPTS is the standard way to pass JVM options; SPARK_JAVA_OPTS may not be respected. Recommend replacing with SPARK_SUBMIT_OPTS, e.g.: export SPARK_SUBMIT_OPTS='-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005'.
export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" | |
export SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" |
Copilot uses AI. Check for mistakes.
I think this would need an SPIP (https://spark.apache.org/improvement-proposals.html). |
This pull request adds comprehensive top-level documentation to the repository, making it easier for both users and developers to find, understand, and navigate Spark's documentation and architecture. The main changes include the addition of an in-depth architecture overview, a complete documentation index, and improvements to the README.md to point to these new resources and module-level documentation.
Major documentation enhancements:
Documentation structure and discoverability:
Added DOCUMENTATION_INDEX.md, providing a thorough, organized index of all documentation in the repository, including links and descriptions for user, developer, administrator, and API docs.
Updated README.md to include direct links to the new architecture overview, development guide, contribution guide, and all major module and tool documentation, improving onboarding and navigation for new users.
Architecture overview:
Added a new ARCHITECTURE.md file with a detailed description of Spark’s architecture, including core components, execution model, key subsystems (memory, shuffle, storage, serialization), module structure, performance tuning, and contribution guidelines.
These changes significantly improve the accessibility and clarity of Spark's documentation, making it easier for contributors and users to understand the project structure and find the resources they need.