Skip to content

Commit 84b33b8

Browse files
committed
HADOOP-18470. index.md update for 3.3.5 release
1 parent 8a9bdb1 commit 84b33b8

File tree

3 files changed

+86
-196
lines changed

3 files changed

+86
-196
lines changed

hadoop-common-project/hadoop-common/src/site/markdown/ClusterSetup.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,17 @@ Purpose
2222

2323
This document describes how to install and configure Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes. To play with Hadoop, you may first want to install it on a single machine (see [Single Node Setup](./SingleCluster.html)).
2424

25-
This document does not cover advanced topics such as [Security](./SecureMode.html) or High Availability.
25+
This document does not cover advanced topics such as High Availability.
26+
27+
*Important*: all production Hadoop clusters use Kerberos to authenticate callers
28+
and secure access to HDFS data as well as restriction access to computation
29+
services (YARN etc.).
30+
31+
These instructions do not cover integration with any Kerberos services,
32+
-everyone bringing up a production cluster should include connecting to their
33+
organisation's Kerberos infrastructure as a key part of the deployment.
34+
35+
See [Security](./SecureMode.html) for details on how to secure a cluster.
2636

2737
Prerequisites
2838
-------------

hadoop-common-project/hadoop-common/src/site/markdown/SingleCluster.md.vm

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,22 @@ Purpose
2626

2727
This document describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).
2828

29+
30+
*Important*: all production Hadoop clusters use Kerberos to authenticate callers
31+
and secure access to HDFS data as well as restriction access to computation
32+
services (YARN etc.).
33+
34+
These instructions do not cover integration with any Kerberos services,
35+
-everyone bringing up a production cluster should include connecting to their
36+
organisation's Kerberos infrastructure as a key part of the deployment.
37+
2938
Prerequisites
3039
-------------
3140

3241
$H3 Supported Platforms
3342

3443
* GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
3544

36-
* Windows is also a supported platform but the followings steps are for Linux only. To set up Hadoop on Windows, see [wiki page](http://wiki.apache.org/hadoop/Hadoop2OnWindows).
37-
3845
$H3 Required Software
3946

4047
Required software for Linux include:

hadoop-project/src/site/markdown/index.md.vm

Lines changed: 66 additions & 193 deletions
Original file line numberDiff line numberDiff line change
@@ -15,226 +15,99 @@
1515
Apache Hadoop ${project.version}
1616
================================
1717

18-
Apache Hadoop ${project.version} incorporates a number of significant
19-
enhancements over the previous major release line (hadoop-2.x).
18+
Apache Hadoop ${project.version} is an update to the Hadoop 3.3.x release branch.
2019

21-
This release is generally available (GA), meaning that it represents a point of
22-
API stability and quality that we consider production-ready.
23-
24-
Overview
25-
========
20+
Overview of Changes
21+
===================
2622

2723
Users are encouraged to read the full set of release notes.
2824
This page provides an overview of the major changes.
2925

30-
Minimum required Java version increased from Java 7 to Java 8
31-
------------------
26+
Vectored IO API
27+
---------------
3228

33-
All Hadoop JARs are now compiled targeting a runtime version of Java 8.
34-
Users still using Java 7 or below must upgrade to Java 8.
29+
The `PositionedReadable` interface has now added an operation for
30+
Vectored (also known as Scatter/Gather IO):
3531

36-
Support for erasure coding in HDFS
37-
------------------
32+
```java
33+
void readVectored(List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate)
34+
```
3835

39-
Erasure coding is a method for durably storing data with significant space
40-
savings compared to replication. Standard encodings like Reed-Solomon (10,4)
41-
have a 1.4x space overhead, compared to the 3x overhead of standard HDFS
42-
replication.
36+
All the requested ranges will be retrieved into the supplied byte buffers -possibly asynchronously,
37+
possibly in parallel, with results potentially coming in out-of-order.
4338

44-
Since erasure coding imposes additional overhead during reconstruction
45-
and performs mostly remote reads, it has traditionally been used for
46-
storing colder, less frequently accessed data. Users should consider
47-
the network and CPU overheads of erasure coding when deploying this
48-
feature.
39+
1. The default implementation uses a series of `readFully()` calls, so delivers
40+
equivalent performance.
41+
2. The local filesystem uses java native IO calls for higher performance reads than `readFully()`
42+
3. The S3A filesystem issues parallel HTTP GET requests in different threads.
4943

50-
More details are available in the
51-
[HDFS Erasure Coding](./hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html)
52-
documentation.
44+
Benchmarking of (modified) ORC and Parquet clients through `file://` and `s3a://`
45+
show tangible improvements in query times.
5346

54-
YARN Timeline Service v.2
55-
-------------------
47+
Further Reading: [FsDataInputStream](./hadoop-project-dist/hadoop-common/filesystem/fsdatainputstream.html).
5648

57-
We are introducing an early preview (alpha 2) of a major revision of YARN
58-
Timeline Service: v.2. YARN Timeline Service v.2 addresses two major
59-
challenges: improving scalability and reliability of Timeline Service, and
60-
enhancing usability by introducing flows and aggregation.
49+
Manifest Committer for Azure ABFS and google GCS performance
50+
------------------------------------------------------------
6151

62-
YARN Timeline Service v.2 alpha 2 is provided so that users and developers
63-
can test it and provide feedback and suggestions for making it a ready
64-
replacement for Timeline Service v.1.x. It should be used only in a test
65-
capacity.
52+
A new "intermediate manifest committer" uses a manifest file
53+
to commit the work of successful task attempts, rather than
54+
renaming directories.
55+
Job commit is matter of reading all the manifests, creating the
56+
destination directories (parallelized) and renaming the files,
57+
again in parallel.
58+
59+
This is fast and correct on Azure Storage and Google GCS,
60+
and should be used there instead of the classic v1/v2 file
61+
output committers.
62+
63+
It is also safe to use on HDFS, where it should be faster
64+
than the v1 committer. It is however optimized for
65+
cloud storage where list and rename operations are significantly
66+
slower; the benefits may be less.
6667

6768
More details are available in the
68-
[YARN Timeline Service v.2](./hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
69+
[manifest committer](./hadoop-mapreduce-client/hadoop-mapreduce-client-core/manifest_committer.html).
6970
documentation.
7071

71-
Shell script rewrite
72-
-------------------
72+
Transitive CVE fixes
73+
--------------------
7374

74-
The Hadoop shell scripts have been rewritten to fix many long-standing
75-
bugs and include some new features. While an eye has been kept towards
76-
compatibility, some changes may break existing installations.
75+
A lot of dependencies have been upgraded to address recent CVEs.
76+
Many of the CVEs were not actually exploitable through the Hadoop
77+
so much of this work is just due diligence.
78+
However applications which have all the library is on a class path may
79+
be vulnerable, and the ugprades should also reduce the number of false
80+
positives security scanners report.
7781

78-
Incompatible changes are documented in the release notes, with related
79-
discussion on [HADOOP-9902](https://issues.apache.org/jira/browse/HADOOP-9902).
82+
We have not been able to upgrade every single dependency to the latest
83+
version there is. Some of those changes are just going to be incompatible.
84+
If you have concerns about the state of a specific library, consult the apache JIRA
85+
issue tracker to see what discussions have taken place about the library in question.
8086

81-
More details are available in the
82-
[Unix Shell Guide](./hadoop-project-dist/hadoop-common/UnixShellGuide.html)
83-
documentation. Power users will also be pleased by the
84-
[Unix Shell API](./hadoop-project-dist/hadoop-common/UnixShellAPI.html)
85-
documentation, which describes much of the new functionality, particularly
86-
related to extensibility.
87-
88-
Shaded client jars
89-
------------------
90-
91-
The `hadoop-client` Maven artifact available in 2.x releases pulls
92-
Hadoop's transitive dependencies onto a Hadoop application's classpath.
93-
This can be problematic if the versions of these transitive dependencies
94-
conflict with the versions used by the application.
95-
96-
[HADOOP-11804](https://issues.apache.org/jira/browse/HADOOP-11804) adds
97-
new `hadoop-client-api` and `hadoop-client-runtime` artifacts that
98-
shade Hadoop's dependencies into a single jar. This avoids leaking
99-
Hadoop's dependencies onto the application's classpath.
100-
101-
Support for Opportunistic Containers and Distributed Scheduling.
102-
--------------------
87+
As an open source project, contributions in this area are always welcome,
88+
especially in testing the active branches, testing applications downstream of
89+
those branches and of whether updated dependencies trigger regressions.
10390

104-
A notion of `ExecutionType` has been introduced, whereby Applications can
105-
now request for containers with an execution type of `Opportunistic`.
106-
Containers of this type can be dispatched for execution at an NM even if
107-
there are no resources available at the moment of scheduling. In such a
108-
case, these containers will be queued at the NM, waiting for resources to
109-
be available for it to start. Opportunistic containers are of lower priority
110-
than the default `Guaranteed` containers and are therefore preempted,
111-
if needed, to make room for Guaranteed containers. This should
112-
improve cluster utilization.
113-
114-
Opportunistic containers are by default allocated by the central RM, but
115-
support has also been added to allow opportunistic containers to be
116-
allocated by a distributed scheduler which is implemented as an
117-
AMRMProtocol interceptor.
118-
119-
Please see [documentation](./hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html)
120-
for more details.
121-
122-
MapReduce task-level native optimization
123-
--------------------
91+
HDFS: Router Based Federation
92+
-----------------------------
12493

125-
MapReduce has added support for a native implementation of the map output
126-
collector. For shuffle-intensive jobs, this can lead to a performance
127-
improvement of 30% or more.
94+
A lot of effort has been invested into stabilizing/improving the HDFS Router Based Federation feature.
12895

129-
See the release notes for
130-
[MAPREDUCE-2841](https://issues.apache.org/jira/browse/MAPREDUCE-2841)
131-
for more detail.
96+
1. HDFS-13522, HDFS-16767 & Related Jiras: Allow Observer Reads in HDFS Router Based Federation.
97+
2. HDFS-13248: RBF supports Client Locality
13298

133-
Support for more than 2 NameNodes.
134-
--------------------
13599

136-
The initial implementation of HDFS NameNode high-availability provided
137-
for a single active NameNode and a single Standby NameNode. By replicating
138-
edits to a quorum of three JournalNodes, this architecture is able to
139-
tolerate the failure of any one node in the system.
140-
141-
However, some deployments require higher degrees of fault-tolerance.
142-
This is enabled by this new feature, which allows users to run multiple
143-
standby NameNodes. For instance, by configuring three NameNodes and
144-
five JournalNodes, the cluster is able to tolerate the failure of two
145-
nodes rather than just one.
146-
147-
The [HDFS high-availability documentation](./hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html)
148-
has been updated with instructions on how to configure more than two
149-
NameNodes.
150-
151-
Default ports of multiple services have been changed.
152-
------------------------
153-
154-
Previously, the default ports of multiple Hadoop services were in the
155-
Linux ephemeral port range (32768-61000). This meant that at startup,
156-
services would sometimes fail to bind to the port due to a conflict
157-
with another application.
158-
159-
These conflicting ports have been moved out of the ephemeral range,
160-
affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our
161-
documentation has been updated appropriately, but see the release
162-
notes for [HDFS-9427](https://issues.apache.org/jira/browse/HDFS-9427) and
163-
[HADOOP-12811](https://issues.apache.org/jira/browse/HADOOP-12811)
164-
for a list of port changes.
165-
166-
Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
167-
---------------------
168-
169-
Hadoop now supports integration with Microsoft Azure Data Lake and
170-
Aliyun Object Storage System as alternative Hadoop-compatible filesystems.
171-
172-
Intra-datanode balancer
173-
-------------------
174-
175-
A single DataNode manages multiple disks. During normal write operation,
176-
disks will be filled up evenly. However, adding or replacing disks can
177-
lead to significant skew within a DataNode. This situation is not handled
178-
by the existing HDFS balancer, which concerns itself with inter-, not intra-,
179-
DN skew.
180-
181-
This situation is handled by the new intra-DataNode balancing
182-
functionality, which is invoked via the `hdfs diskbalancer` CLI.
183-
See the disk balancer section in the
184-
[HDFS Commands Guide](./hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)
185-
for more information.
186-
187-
Reworked daemon and task heap management
188-
---------------------
189-
190-
A series of changes have been made to heap management for Hadoop daemons
191-
as well as MapReduce tasks.
192-
193-
[HADOOP-10950](https://issues.apache.org/jira/browse/HADOOP-10950) introduces
194-
new methods for configuring daemon heap sizes.
195-
Notably, auto-tuning is now possible based on the memory size of the host,
196-
and the `HADOOP_HEAPSIZE` variable has been deprecated.
197-
See the full release notes of HADOOP-10950 for more detail.
198-
199-
[MAPREDUCE-5785](https://issues.apache.org/jira/browse/MAPREDUCE-5785)
200-
simplifies the configuration of map and reduce task
201-
heap sizes, so the desired heap size no longer needs to be specified
202-
in both the task configuration and as a Java option.
203-
Existing configs that already specify both are not affected by this change.
204-
See the full release notes of MAPREDUCE-5785 for more details.
205-
206-
HDFS Router-Based Federation
207-
---------------------
208-
HDFS Router-Based Federation adds a RPC routing layer that provides a federated
209-
view of multiple HDFS namespaces. This is similar to the existing
210-
[ViewFs](./hadoop-project-dist/hadoop-hdfs/ViewFs.html)) and
211-
[HDFS Federation](./hadoop-project-dist/hadoop-hdfs/Federation.html)
212-
functionality, except the mount table is managed on the server-side by the
213-
routing layer rather than on the client. This simplifies access to a federated
214-
cluster for existing HDFS clients.
215-
216-
See [HDFS-10467](https://issues.apache.org/jira/browse/HDFS-10467) and the
217-
HDFS Router-based Federation
218-
[documentation](./hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html) for
219-
more details.
220-
221-
API-based configuration of Capacity Scheduler queue configuration
222-
----------------------
223-
224-
The OrgQueue extension to the capacity scheduler provides a programmatic way to
225-
change configurations by providing a REST API that users can call to modify
226-
queue configurations. This enables automation of queue configuration management
227-
by administrators in the queue's `administer_queue` ACL.
228-
229-
See [YARN-5734](https://issues.apache.org/jira/browse/YARN-5734) and the
230-
[Capacity Scheduler documentation](./hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html) for more information.
231-
232-
YARN Resource Types
233-
---------------
100+
HDFS: Dynamic Datanode Reconfiguration
101+
--------------------------------------
102+
103+
HDFS-16400, HDFS-16399, HDFS-16396, HDFS-16397, HDFS-16413, HDFS-16457.
234104

235-
The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.
105+
A number of Datanode configuration options can be changed without having to restart
106+
the datanode. This makes it possible to tune deployment configurations without
107+
cluster-wide Datanode Restarts.
236108

237-
See [YARN-3926](https://issues.apache.org/jira/browse/YARN-3926) and the [YARN resource model documentation](./hadoop-yarn/hadoop-yarn-site/ResourceModel.html) for more information.
109+
See [DataNode.java](https://github.com/apache/hadoop/blob/branch-3.3.5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L346-L361)
110+
for the list of dynamically reconfigurable attributes.
238111

239112
Getting Started
240113
===============

0 commit comments

Comments
 (0)