|
15 | 15 | Apache Hadoop ${project.version}
|
16 | 16 | ================================
|
17 | 17 |
|
18 |
| -Apache Hadoop ${project.version} incorporates a number of significant |
19 |
| -enhancements over the previous major release line (hadoop-2.x). |
| 18 | +Apache Hadoop ${project.version} is an update to the Hadoop 3.3.x release branch. |
20 | 19 |
|
21 |
| -This release is generally available (GA), meaning that it represents a point of |
22 |
| -API stability and quality that we consider production-ready. |
23 |
| - |
24 |
| -Overview |
25 |
| -======== |
| 20 | +Overview of Changes |
| 21 | +=================== |
26 | 22 |
|
27 | 23 | Users are encouraged to read the full set of release notes.
|
28 | 24 | This page provides an overview of the major changes.
|
29 | 25 |
|
30 |
| -Minimum required Java version increased from Java 7 to Java 8 |
31 |
| ------------------- |
| 26 | +Vectored IO API |
| 27 | +--------------- |
32 | 28 |
|
33 |
| -All Hadoop JARs are now compiled targeting a runtime version of Java 8. |
34 |
| -Users still using Java 7 or below must upgrade to Java 8. |
| 29 | +The `PositionedReadable` interface has now added an operation for |
| 30 | +Vectored (also known as Scatter/Gather IO): |
35 | 31 |
|
36 |
| -Support for erasure coding in HDFS |
37 |
| ------------------- |
| 32 | +```java |
| 33 | +void readVectored(List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate) |
| 34 | +``` |
38 | 35 |
|
39 |
| -Erasure coding is a method for durably storing data with significant space |
40 |
| -savings compared to replication. Standard encodings like Reed-Solomon (10,4) |
41 |
| -have a 1.4x space overhead, compared to the 3x overhead of standard HDFS |
42 |
| -replication. |
| 36 | +All the requested ranges will be retrieved into the supplied byte buffers -possibly asynchronously, |
| 37 | +possibly in parallel, with results potentially coming in out-of-order. |
43 | 38 |
|
44 |
| -Since erasure coding imposes additional overhead during reconstruction |
45 |
| -and performs mostly remote reads, it has traditionally been used for |
46 |
| -storing colder, less frequently accessed data. Users should consider |
47 |
| -the network and CPU overheads of erasure coding when deploying this |
48 |
| -feature. |
| 39 | +1. The default implementation uses a series of `readFully()` calls, so delivers |
| 40 | + equivalent performance. |
| 41 | +2. The local filesystem uses java native IO calls for higher performance reads than `readFully()` |
| 42 | +3. The S3A filesystem issues parallel HTTP GET requests in different threads. |
49 | 43 |
|
50 |
| -More details are available in the |
51 |
| -[HDFS Erasure Coding](./hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html) |
52 |
| -documentation. |
| 44 | +Benchmarking of (modified) ORC and Parquet clients through `file://` and `s3a://` |
| 45 | +show tangible improvements in query times. |
53 | 46 |
|
54 |
| -YARN Timeline Service v.2 |
55 |
| -------------------- |
| 47 | +Further Reading: [FsDataInputStream](./hadoop-project-dist/hadoop-common/filesystem/fsdatainputstream.html). |
56 | 48 |
|
57 |
| -We are introducing an early preview (alpha 2) of a major revision of YARN |
58 |
| -Timeline Service: v.2. YARN Timeline Service v.2 addresses two major |
59 |
| -challenges: improving scalability and reliability of Timeline Service, and |
60 |
| -enhancing usability by introducing flows and aggregation. |
| 49 | +Manifest Committer for Azure ABFS and google GCS performance |
| 50 | +------------------------------------------------------------ |
61 | 51 |
|
62 |
| -YARN Timeline Service v.2 alpha 2 is provided so that users and developers |
63 |
| -can test it and provide feedback and suggestions for making it a ready |
64 |
| -replacement for Timeline Service v.1.x. It should be used only in a test |
65 |
| -capacity. |
| 52 | +A new "intermediate manifest committer" uses a manifest file |
| 53 | +to commit the work of successful task attempts, rather than |
| 54 | +renaming directories. |
| 55 | +Job commit is matter of reading all the manifests, creating the |
| 56 | +destination directories (parallelized) and renaming the files, |
| 57 | +again in parallel. |
| 58 | + |
| 59 | +This is fast and correct on Azure Storage and Google GCS, |
| 60 | +and should be used there instead of the classic v1/v2 file |
| 61 | +output committers. |
| 62 | + |
| 63 | +It is also safe to use on HDFS, where it should be faster |
| 64 | +than the v1 committer. It is however optimized for |
| 65 | +cloud storage where list and rename operations are significantly |
| 66 | +slower; the benefits may be less. |
66 | 67 |
|
67 | 68 | More details are available in the
|
68 |
| -[YARN Timeline Service v.2](./hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html) |
| 69 | +[manifest committer](./hadoop-mapreduce-client/hadoop-mapreduce-client-core/manifest_committer.html). |
69 | 70 | documentation.
|
70 | 71 |
|
71 |
| -Shell script rewrite |
72 |
| -------------------- |
| 72 | +Transitive CVE fixes |
| 73 | +-------------------- |
73 | 74 |
|
74 |
| -The Hadoop shell scripts have been rewritten to fix many long-standing |
75 |
| -bugs and include some new features. While an eye has been kept towards |
76 |
| -compatibility, some changes may break existing installations. |
| 75 | +A lot of dependencies have been upgraded to address recent CVEs. |
| 76 | +Many of the CVEs were not actually exploitable through the Hadoop |
| 77 | +so much of this work is just due diligence. |
| 78 | +However applications which have all the library is on a class path may |
| 79 | +be vulnerable, and the ugprades should also reduce the number of false |
| 80 | +positives security scanners report. |
77 | 81 |
|
78 |
| -Incompatible changes are documented in the release notes, with related |
79 |
| -discussion on [HADOOP-9902](https://issues.apache.org/jira/browse/HADOOP-9902). |
| 82 | +We have not been able to upgrade every single dependency to the latest |
| 83 | +version there is. Some of those changes are just going to be incompatible. |
| 84 | +If you have concerns about the state of a specific library, consult the apache JIRA |
| 85 | +issue tracker to see what discussions have taken place about the library in question. |
80 | 86 |
|
81 |
| -More details are available in the |
82 |
| -[Unix Shell Guide](./hadoop-project-dist/hadoop-common/UnixShellGuide.html) |
83 |
| -documentation. Power users will also be pleased by the |
84 |
| -[Unix Shell API](./hadoop-project-dist/hadoop-common/UnixShellAPI.html) |
85 |
| -documentation, which describes much of the new functionality, particularly |
86 |
| -related to extensibility. |
87 |
| - |
88 |
| -Shaded client jars |
89 |
| ------------------- |
90 |
| - |
91 |
| -The `hadoop-client` Maven artifact available in 2.x releases pulls |
92 |
| -Hadoop's transitive dependencies onto a Hadoop application's classpath. |
93 |
| -This can be problematic if the versions of these transitive dependencies |
94 |
| -conflict with the versions used by the application. |
95 |
| - |
96 |
| -[HADOOP-11804](https://issues.apache.org/jira/browse/HADOOP-11804) adds |
97 |
| -new `hadoop-client-api` and `hadoop-client-runtime` artifacts that |
98 |
| -shade Hadoop's dependencies into a single jar. This avoids leaking |
99 |
| -Hadoop's dependencies onto the application's classpath. |
100 |
| - |
101 |
| -Support for Opportunistic Containers and Distributed Scheduling. |
102 |
| --------------------- |
| 87 | +As an open source project, contributions in this area are always welcome, |
| 88 | +especially in testing the active branches, testing applications downstream of |
| 89 | +those branches and of whether updated dependencies trigger regressions. |
103 | 90 |
|
104 |
| -A notion of `ExecutionType` has been introduced, whereby Applications can |
105 |
| -now request for containers with an execution type of `Opportunistic`. |
106 |
| -Containers of this type can be dispatched for execution at an NM even if |
107 |
| -there are no resources available at the moment of scheduling. In such a |
108 |
| -case, these containers will be queued at the NM, waiting for resources to |
109 |
| -be available for it to start. Opportunistic containers are of lower priority |
110 |
| -than the default `Guaranteed` containers and are therefore preempted, |
111 |
| -if needed, to make room for Guaranteed containers. This should |
112 |
| -improve cluster utilization. |
113 |
| - |
114 |
| -Opportunistic containers are by default allocated by the central RM, but |
115 |
| -support has also been added to allow opportunistic containers to be |
116 |
| -allocated by a distributed scheduler which is implemented as an |
117 |
| -AMRMProtocol interceptor. |
118 |
| - |
119 |
| -Please see [documentation](./hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html) |
120 |
| -for more details. |
121 |
| - |
122 |
| -MapReduce task-level native optimization |
123 |
| --------------------- |
| 91 | +HDFS: Router Based Federation |
| 92 | +----------------------------- |
124 | 93 |
|
125 |
| -MapReduce has added support for a native implementation of the map output |
126 |
| -collector. For shuffle-intensive jobs, this can lead to a performance |
127 |
| -improvement of 30% or more. |
| 94 | +A lot of effort has been invested into stabilizing/improving the HDFS Router Based Federation feature. |
128 | 95 |
|
129 |
| -See the release notes for |
130 |
| -[MAPREDUCE-2841](https://issues.apache.org/jira/browse/MAPREDUCE-2841) |
131 |
| -for more detail. |
| 96 | +1. HDFS-13522, HDFS-16767 & Related Jiras: Allow Observer Reads in HDFS Router Based Federation. |
| 97 | +2. HDFS-13248: RBF supports Client Locality |
132 | 98 |
|
133 |
| -Support for more than 2 NameNodes. |
134 |
| --------------------- |
135 | 99 |
|
136 |
| -The initial implementation of HDFS NameNode high-availability provided |
137 |
| -for a single active NameNode and a single Standby NameNode. By replicating |
138 |
| -edits to a quorum of three JournalNodes, this architecture is able to |
139 |
| -tolerate the failure of any one node in the system. |
140 |
| - |
141 |
| -However, some deployments require higher degrees of fault-tolerance. |
142 |
| -This is enabled by this new feature, which allows users to run multiple |
143 |
| -standby NameNodes. For instance, by configuring three NameNodes and |
144 |
| -five JournalNodes, the cluster is able to tolerate the failure of two |
145 |
| -nodes rather than just one. |
146 |
| - |
147 |
| -The [HDFS high-availability documentation](./hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html) |
148 |
| -has been updated with instructions on how to configure more than two |
149 |
| -NameNodes. |
150 |
| - |
151 |
| -Default ports of multiple services have been changed. |
152 |
| ------------------------- |
153 |
| - |
154 |
| -Previously, the default ports of multiple Hadoop services were in the |
155 |
| -Linux ephemeral port range (32768-61000). This meant that at startup, |
156 |
| -services would sometimes fail to bind to the port due to a conflict |
157 |
| -with another application. |
158 |
| - |
159 |
| -These conflicting ports have been moved out of the ephemeral range, |
160 |
| -affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our |
161 |
| -documentation has been updated appropriately, but see the release |
162 |
| -notes for [HDFS-9427](https://issues.apache.org/jira/browse/HDFS-9427) and |
163 |
| -[HADOOP-12811](https://issues.apache.org/jira/browse/HADOOP-12811) |
164 |
| -for a list of port changes. |
165 |
| - |
166 |
| -Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors |
167 |
| ---------------------- |
168 |
| - |
169 |
| -Hadoop now supports integration with Microsoft Azure Data Lake and |
170 |
| -Aliyun Object Storage System as alternative Hadoop-compatible filesystems. |
171 |
| - |
172 |
| -Intra-datanode balancer |
173 |
| -------------------- |
174 |
| - |
175 |
| -A single DataNode manages multiple disks. During normal write operation, |
176 |
| -disks will be filled up evenly. However, adding or replacing disks can |
177 |
| -lead to significant skew within a DataNode. This situation is not handled |
178 |
| -by the existing HDFS balancer, which concerns itself with inter-, not intra-, |
179 |
| -DN skew. |
180 |
| - |
181 |
| -This situation is handled by the new intra-DataNode balancing |
182 |
| -functionality, which is invoked via the `hdfs diskbalancer` CLI. |
183 |
| -See the disk balancer section in the |
184 |
| -[HDFS Commands Guide](./hadoop-project-dist/hadoop-hdfs/HDFSCommands.html) |
185 |
| -for more information. |
186 |
| - |
187 |
| -Reworked daemon and task heap management |
188 |
| ---------------------- |
189 |
| - |
190 |
| -A series of changes have been made to heap management for Hadoop daemons |
191 |
| -as well as MapReduce tasks. |
192 |
| - |
193 |
| -[HADOOP-10950](https://issues.apache.org/jira/browse/HADOOP-10950) introduces |
194 |
| -new methods for configuring daemon heap sizes. |
195 |
| -Notably, auto-tuning is now possible based on the memory size of the host, |
196 |
| -and the `HADOOP_HEAPSIZE` variable has been deprecated. |
197 |
| -See the full release notes of HADOOP-10950 for more detail. |
198 |
| - |
199 |
| -[MAPREDUCE-5785](https://issues.apache.org/jira/browse/MAPREDUCE-5785) |
200 |
| -simplifies the configuration of map and reduce task |
201 |
| -heap sizes, so the desired heap size no longer needs to be specified |
202 |
| -in both the task configuration and as a Java option. |
203 |
| -Existing configs that already specify both are not affected by this change. |
204 |
| -See the full release notes of MAPREDUCE-5785 for more details. |
205 |
| - |
206 |
| -HDFS Router-Based Federation |
207 |
| ---------------------- |
208 |
| -HDFS Router-Based Federation adds a RPC routing layer that provides a federated |
209 |
| -view of multiple HDFS namespaces. This is similar to the existing |
210 |
| -[ViewFs](./hadoop-project-dist/hadoop-hdfs/ViewFs.html)) and |
211 |
| -[HDFS Federation](./hadoop-project-dist/hadoop-hdfs/Federation.html) |
212 |
| -functionality, except the mount table is managed on the server-side by the |
213 |
| -routing layer rather than on the client. This simplifies access to a federated |
214 |
| -cluster for existing HDFS clients. |
215 |
| - |
216 |
| -See [HDFS-10467](https://issues.apache.org/jira/browse/HDFS-10467) and the |
217 |
| -HDFS Router-based Federation |
218 |
| -[documentation](./hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html) for |
219 |
| -more details. |
220 |
| - |
221 |
| -API-based configuration of Capacity Scheduler queue configuration |
222 |
| ----------------------- |
223 |
| - |
224 |
| -The OrgQueue extension to the capacity scheduler provides a programmatic way to |
225 |
| -change configurations by providing a REST API that users can call to modify |
226 |
| -queue configurations. This enables automation of queue configuration management |
227 |
| -by administrators in the queue's `administer_queue` ACL. |
228 |
| - |
229 |
| -See [YARN-5734](https://issues.apache.org/jira/browse/YARN-5734) and the |
230 |
| -[Capacity Scheduler documentation](./hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html) for more information. |
231 |
| - |
232 |
| -YARN Resource Types |
233 |
| ---------------- |
| 100 | +HDFS: Dynamic Datanode Reconfiguration |
| 101 | +-------------------------------------- |
| 102 | + |
| 103 | +HDFS-16400, HDFS-16399, HDFS-16396, HDFS-16397, HDFS-16413, HDFS-16457. |
234 | 104 |
|
235 |
| -The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources. |
| 105 | +A number of Datanode configuration options can be changed without having to restart |
| 106 | +the datanode. This makes it possible to tune deployment configurations without |
| 107 | +cluster-wide Datanode Restarts. |
236 | 108 |
|
237 |
| -See [YARN-3926](https://issues.apache.org/jira/browse/YARN-3926) and the [YARN resource model documentation](./hadoop-yarn/hadoop-yarn-site/ResourceModel.html) for more information. |
| 109 | +See [DataNode.java](https://github.com/apache/hadoop/blob/branch-3.3.5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L346-L361) |
| 110 | +for the list of dynamically reconfigurable attributes. |
238 | 111 |
|
239 | 112 | Getting Started
|
240 | 113 | ===============
|
|
0 commit comments