Skip to content

Latest commit

 

History

History
1230 lines (858 loc) · 68.3 KB

styleguide.md

File metadata and controls

1230 lines (858 loc) · 68.3 KB

Coding Standards for Apache Hadoop

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.

(This is a WiP, with a goal of getting it in to the Hadoop docs. As it is, I'm still tuning it, especially the section on testing).

Introduction

Apache Hadoop is one of the most complicated open source projects being developed. Its code is nothing compared to the scale of the Linux Kernel and device drivers, yet it has some of the same characteristics: a filesystem API and implementations(s) (HDFS), a scheduler for executing work along with a means of submitting work to it (YARN), programs to process data on it (MAPREDUCE, Apache Tez), databases (Apache HBase, Apache Accumulo) and management and monitoring tools, both open source and closed).

Where it is complicated is that it is designed to run across hundreds to thousands of individual servers, to store tens of petabytes of data, and to execute those scheduled programs in the face of failures of those servers.

It is both a distributed system, and a filesystem and OS for data & datacenter-scale applications. It is possibly the most ambitious "unified" distributed computing projects ever attempted. (HTTP servers and browsers atop the network stack are significantly larger, but developed in a loosely couple way that reflects their layering and distribution). The core of Hadoop is a single system; the key apps that run atop it and developed alongside it, very tightly coupled to that underlying Hadoop application.

For that reason, the project has high expectations and standards of changes to the code. This document attempts to describe them.

It is also intended to provide guidelines for anyone writing applications on top of Hadoop, including YARN applications.

Submitting Pull Requests

Github is now where Hadoop patches are reviewed.

  1. Submit changes as a pull request against apache/hadoop.
  2. You MUST have an existing Apache JIRA for the PR: Github PRs without JIRAs tend to get lost.
  3. You MUST use the JIRA ID in the title. This is used to automatically cross-link the PR.
  4. Object Store patches: you MUST run all integration tests and declare the store endpoint you ran against.
  5. You MUST NOT assign other people as reviewers without asking them first. It's too impersonal a way to seek assistance and just alienates the assignee.

Here are some things which scare the developers when they arrive in JIRA:

  • Vast changes which arrive without warning and are accompanied by press releases. Please, get on the developer list, create the JIRAs, start collaboratively developing things. It's how we get confidence that the code is good —and that you can follow a collaborative development process.
  • Large patches which span the project. They are a nightmare to review and can change the source tree enough to stop other patches applying.
  • Patches which delve into the internals of critical classes. The HDFS NameNode, Edit log and YARN schedulers stand out here. Any mistake here can cost data (HDFS) or so much CPU time (the schedulers) that it has tangible performance impact of the big Hadoop users.
  • Changes to the key public APIs of Hadoop. That includes the FileSystem and FileContext APIs, YARN submission protocols, MapReduce APIs, and the like.
  • Patches that reorganise the code as part of the diff. That includes imports. They make the patch bigger (hence harder to review) and may make it harder to merge in other patches.
  • Big patches without tests.
  • Medium sized patches without tests.
  • Small patches without tests unless they are simple things like spelling corrections in strings or comments.

Things that are appreciated:

  • Documentation, in javadocs and in the main/site packages.
  • Good tests.
  • For code that is delving into the internals of the concurrency/consensus logic, well argued explanations of how the code works in all possible circumstances. State machine models and even TLA+ specifications can be used here to argue your case.
  • Any description of what scale/workload your code was tested with. If it was a big test, that reassures people that this scales well. And if not, at least you are being open about it.

Designing for Scale; Designing for Fail

Hadoop is designed to work at the scale of thousands of machines. Some of them will fail on large jobs or jobs that take time. Code needs to be designed for both the scale and the failure modes.

Scale

Hadoop applications need to be designed to scale to thousands —even tens of thousands— of distributed components.

  1. Data structures must be designed to scale, ideally at O(1), O(log2(n)) or similar. At the scale of Hadoop, O(n) is too expensive.
  2. The algorithms to work with the data structures must also scale well.
  3. Services need to be designed to handle the challenge of many thousands of services reporting near-simultaneously. The "cluster restart" scenario is a classic example here.
  4. Cluster operations MUST be designed to be O(1). That is: no harder to run a cluster of a thousand machines than a cluster of 10. Equally importantly, that single-node cluster should be easy to operate.
  5. Do not have one-thread-per-request service architectures without size-limited thread pools or other form of thread sharing.

Datasets may be measured in tens or hundreds of Terabytes.

  1. Algorithms to work with these datasets must be designed to scale across the cluster: no single node will suffice.
  2. In-memory storage is too expensive for data of this scale. While the cost will fall over time, the dataset size is likely to grow at a similar rate. Design algorithms to work efficiently when data is stored on a (slower) persistent storage, be it Hard disk or, in future, SSD.
  3. Hadoop is moving towards heterogenous data storage; data may be stored at different levels in the hierarchy, with these details being potentially transparent to the application. Consider using this rather than trying to implement your own memory caches —the HDFS system is designed to support administrator-managed quotas, and distribute the cached layers around the cluster so as to make more efficient use of the storage tiers.
  4. Applications need to be careful not to accidentally overload other parts of the infrastructure. Making too many simultaneous requests to the Hadoop Namenode (i.e directory listing and file status queries) can impact HDFS. Even something as "harmless" as a DNS lookup can be disruptive if a single DNS server is used to service the requests of a large cluster.
  5. A sensible size of number for storing things like counts is long rather than int.

Failure

  1. Don't expect everything to complete: rely on timeouts. YARN applications will have failures reported to their Application Master —this has to handle them.
  2. Slow processes/YARN container-hosted applications can be as troublesome as failing ones, as they can slow down entire workflows. An example of this is the MapReduce "straggler". MapReduce handles these by supporting speculative execution of the trailing work, and by blacklisting nodes. Different applications will need different strategies.
  3. Failures can just as easily be caused by bad application configurations and data: tracking repeated failures of part of an application can be used with some heuristics to conclude that there may be a problem. Here MapReduce can be set to react by skipping specific portions of its input data. Other applications may be able to follow this strategy, or they may have to react differently.
  4. Failure at scale can have its own problems, especially in the specific case of a Hadoop cluster partition. An application may get simultaneous reports of hundreds or thousands components failing within a few seconds or minutes —it has to respond to these, possibly by recognising the cluster partition simply by the size of the failure report. (HDFS itself has known weaknesses here).
  5. YARN applications: do not assume that reallocated containers will come up on the same nodes, or be able to open the same network ports.
  6. YARN applications: the AM itself may fail. The default policy in this situation is that YARN kills all the containers then relaunches the AM somewhere. It is possible to request an alternative policy of "retain containers over AM restarts". In this case it becomes the duty of the AM to rebuild its state on restart. Long lived YARN applications should consider this policy.
  7. Dependent services such as HDFS and YARN may themselves fail. While the HA modes should recover from the failures, the cost of this recovery is delays until the failover is complete. Applications need to consider whether they want the default mode of operation (block until failover/restart) or be designed to detect the failures and react in some way. Failure detection is relatively straightforward: configure the IPC layer to report problems rather than block. Reacting to the failure is a significantly harder issue which must be addressed on a case-by-case basis —even within an application.
  8. At sufficiently large scale, data corruption may set in. This is why HDFS checksums its data —and why Hadoop may add its own checksums to Hadoop's IPC layers before long.

Hard disks fail more often than individual servers —they are mechanical and fail over time, and as there are many HDDs to a server, statistics catches up with them. SSDs have their own failure modes as they wear out.

In production clusters, disk and server failures tend to surface when a cluster is restarted; this is time that disks are checked most thoroughly by the OS, and power cycles themselves can cause problems. Applications do not have to worry about this startup phase, but HDFS does.

Code Style

For Java code, start with the original "Sun" guidelines.

  1. Two spaces are used for indentation. Not tabs, not four spaces, not eight spaces. Two.
  2. Eighty characters wide. This may seem dated, but it aids patch reviewing through side-by-side comparisons.
  3. Don't use .* imports except for import static imports. This means: turn off IDE features that automatically update and re-order imports. This feature makes merging patches harder.

Comments

  1. SHOULD use javadoc above methods, class, and field declarations.

  2. SHOULD Avoid explaining how a method works: it will only become obsolete in time and just confuse people. Write down what the method is expected to do.

  3. SHOULD use // within methods. On their own lines, indented to match the source. Not: at end of line. Not:

  4. MUST NOT: mix // comments within multiline code

     String bad = evaluate()
                 //  update interval
                 + conf.getUpdateInterval()
    
  5. MUST NOT: use comments to turn log commands on or off. Downgrade them or delete them; do not just comment them out.

Configuration options

  1. Declare configuration options as constants, instead of inline

       public static final string KEY_REGISTRY_ZK_QUORUM = "hadoop.registry.zk.quorum";
    
  2. Give them meaningful names, scoped by the service on which they operate.

  3. Code which retrieves string options SHOULD use Configuration.getTrimmed() unless they have a specific need to include leading and trailing strings in the values.

  4. Provide meaningful default values in the java source

  5. Provide valid default values in core-default.xml and HDFS/YARN equivalents, with meaningful text to accompany the values. These XML files are used to automatically generate documentation about configuration values. If no XML entry or text is provided, the configuration option will remain an undocumented mystery.

Public, Private and Limited Private code

The Hadoop codebase consists of internal implementation classes, public Java-level classes and APIs, and public IPC protocol interfaces. The java language scope annotations: public, private, protected and package-scoped aren't sufficient to describe these —we need to distinguish a class that may be public, yet intended for internal use, from something that is for external callers. We also need to warn those external callers where an API is considered stable, versus an API that may change from release to release, as it stabilizes or evolves.

For this reason, the hadoop-annotations module/JAR contains some java @ attributes for use in declaring the scope and stability of classes.

The interface audience is defined in the org.apache.hadoop.classification.InterfaceAudience

public class InterfaceAudience {
  /**
   * Intended for use by any project or application.
   */
  public @interface Public {};

  /**
   * Intended only for the project(s) specified in the annotation.
   * For example, "Common", "HDFS", "MapReduce", "ZooKeeper", "HBase".
   */
  public @interface LimitedPrivate {
    String[] value();
  };
  /**
   * Intended for use only within Hadoop itself.
   */
  public @interface Private {};

The @Private and @Public annotations resemble those in the Java code. @Private means within the hadoop source tree ONLY. @Public means any application MAY use the class —though the stability annotation may be used as a warning about whether that interface is something the code may rely on.

The unusual one is @LimitedPrivate. This is used to declare that a class or interface is intended for use by specific modules and applications.

The @LimitedPrivate attribute is a troublesome one. It implies that the Hadoop core codebase is not only aware of applications downstream, but that it is explicitly adding features purely for those applications and nothing else. As well as showing favoritism to those applications, it's potentially dangerous. Those special features created for specific applications are effectively commitments to maintain the special feature indefinitely. In which case: why the special scope? Why not add a public API and make it a good one? This is not a hypothetical question, because special support was added into HDFS for HBase support -an append operation, and an atomic create-a-directory-but-not-its-parent method. The append operation eventually evolved into a public method, with HBase then needing to transition from its back-door operation to the public API. And HADOOP-10995 showed where some operations thought to be unused/deprecated were pulled —only to discover that HBase stopped compiling.

What does that mean? Try to avoid this notion altogether.

Stability

The stability of a class's public methods is declared via the org.apache.hadoop.classification.InterfaceStability annotation, which has three values, Unstable, Stable and Evolving.

public class InterfaceStability {
  /**
   * Can evolve while retaining compatibility for minor release boundaries.;
   * can break compatibility only at major release (ie. at m.0).
   */
  public @interface Stable {};

  /**
   * Evolving, but can break compatibility at minor release (i.e. m.x)
   */
  public @interface Evolving {};

  /**
   * No guarantee is provided as to reliability or stability across any
   * level of release granularity.
   */
  public @interface Unstable {};
}

It is a requirement that: all public interfaces must have a stability annotation

  1. All classes that are annotated with Public orLimitedPrivate MUST have an InterfaceStability annotation.
  2. Classes that are Private MUST be considered unstable unless a different InterfaceStability annotation states otherwise.
  3. Incompatible changes MUST NOT be made to classes marked as stable.

What does that mean?

  1. The interface of a class is, according to the Hadoop Compatibility Guidelines, defined as the API level binding, the signature, and the actual behavior of the methods, the semantics. A stable interface not only has to be compatible at the source and binary level, it has to work the same.
  2. During development of a new feature, tag the public APIs as Unstable or Evolving. Declaring that something new is Stable is unrealistic. There will be changes, so not constrain yourself by declaring that it isn't going to change.
  3. There's also an implicit assumption that any class or interface that does not have any scope attribute is private. Even so, there is no harm in explicitly stating this.

LimitedPrivate is falling out of favor, as it is unclear what it means. It originally usually marked internal APIs (MapReduce, YARN), but many APIs turn out to be necessary for all YARN applications. New LimitedPrivate APIs SHOULD NOT be added, or if are, MUST come with justification. New patches SHOULD remove this attribute if broader access is required, rather than just add new applications to the list.

Submitted patches which provide new APIs for use within the Hadoop stack MUST have scope attributes for all public APIs.

Concurrency

Hadoop services are highly concurrent. This codebase is not a place to learn about Java concurrency —developers are expected to have acquired knowledge and experience of Java threading and concurrency before getting involved in Hadoop internals.

Background: the Java memory model:

  1. Use the java.utils.concurrent classes wherever possible.
  2. If simple datatypes are shared across threads outside of synchronized blocks, they MUST be marked volatile.
  3. If you want atomic get/set or increment operations, use the AtomicBoolean and AtomicInteger/AtomicLong classes Exception: you don't want the cost of these blocking operations and can tolerate a race condition in the increment operation. Some of the metrics do this.

Note that Java volatile types are more expensive than C/C++ (they are memory barriers), so should not be used for types which are not synchronized, or which are only accessed within synchronization blocks.

  1. In the core services, try to avoid overreaching service-wide locks.
  2. Consider operation-specific locks through having (final) fields which can be locked for access to specific areas.
  3. If-and-only-if-absolutely-necessary, lock on the .class object. This is justifiable if the operation would affect all instances of a class.
  4. Avoid calling native code in locked regions.
  5. Avoid calling expensive operations (including System.currentTimeMillis()) in a locked region. If these operations do not need to be synchronized, consider calling them in advance and cache the results.
  6. Be aware: System.getProperty() uses a synchronized HashTable, so if done in a synchronized block will block other threads. Pull up to be outside the block.
  7. Code MUST NOT ignore an InterruptedException —it is a sign that part of the system wishes to stop that thread, often on a process shutdown. Wrap it in an InterruptedIOException if you need to convert to an IOException.
  8. Code MUST NOT start threads inside constructors. These may execute before the class is fully constructed, leading to bizarre failure conditions.
  9. Use Callable<V> over Runnable, as it can not only return a value —it can raise an exception.
  10. Use Executors over Threads
  11. Use fixed pool executors over creating a new thread per request/operation.

Key java.utils.concurrent classes include

  • ExecutorService
  • Callable<V>
  • Queues, including the BlockingQueue.

There is a reasonable amount of code that can be considered dated in Hadoop, using threads and runnables. These should be cleaned up at some point —rather than mimicked.

Warning signs of dated code

  1. Subclassing Thread.
  2. Spawning off threads, rather than using thread pools via executors.
  3. Using volatile fields rather than Atomic* classes.
  4. Using Object.notify() and Object.wait() to signal a thread. If the notification signal is lost, there is a risk that the waiting thread may miss things. Much better: have a concurrent queue with the sending thread posting something to the queue; the receiver fetching it blocking with take() or via a time limited poll().

Scale problems of threaded code.

One danger in Hadoop code is to write anything which can spawn unlimited threads. That is, for every incoming network call, IO operation or other action, a new thread is created. Under heavy workloads, the number of threads created will exhaust the amount of memory available.

This is not just server side; it has surfaced in pure-client-side code, such as the S3A filesystem Client HADOOP-11446). A good design for a small application, one that works in the test cases, can fail dramatically in the field.

1 Any asynchronous code which is not directly part of a system singleton class must be launched in a thread pool. Specifically the simple "one thread per JVM" strategy is acceptable for singleton items such as UserGroupInformation keytab renewer, Namenode token renewer and similar classes for which a single instance is only expected in production use. What is unacceptable is anything spawning unlimited threads off proportional to the number of callers or recipients of a remote API call, per numbers of users, blocks, hosts or other variable which scales with cluster size, or anything which can scale with large local-JVM workflows.

Asynchronous code spawned off in this way must

  1. Be started in a thread pool of limited size. (Choosing that size is a hard problem; they are usually configurable).
  2. Have a policy to handle exhausted thread pools. Blocking callers, for example, maybe failing with explicit retry-later messages. Expect the thread pool to be exhausted in day-to-day operation, so design the code to handle it.
  3. Have some means of being stopped.

volatile vs. non-volatile vs. java.util.concurrent.atomic

Java's volatile keyword indicates that a field may change in different threads. It has the following semantics:

  1. read/write operations are not re-ordered above or below volatile access. That also applies to access of non-volatile data.
  2. values are not cached
  3. volatile accesses to int, byte char, short, boolean, float, double, long and all object references are guaranteed to be atomic. What is not guaranteed to be atomic are accesses to long and double fields. The atomicity of their accesses may depend upon the architecture of the CPU running the Java code, and possibly byte alignment of the data. You can be confident that volatile access to fields wider than the underlying CPU will not be atomic. Even on a 64-bit CPU, long and double accesses may not be atomic.
  4. operations such aa ++, --, +=, -=, &=, |=, ^= are not atomic.

The fact that some types are atomic, and others are not is dangerous: even if the original code was written by people who knew that volatile int access was atomic, maintainers in future may need to expand that to a volatile long to handle scale -at which point atomicity of access is lost. Which means that a race condition which can lead to invalid data is entirely possible.

Because reads are not not cached, and because volatile accesses are "barrier" operations, accesses to volatile fields is still less efficient than non-volatile accesses. The latter can be rearranged by the compiler, pulled out of loops or cached for re-use, and accessed out-of-order in any CPU capable of out-of-order execution. (all conventional modern non-embedded CPUs).

Use Atomic classes instead

Dangerous unless used with caution

volatile int counter;
volatile boolean finished;
volatile float average;
volatile Object handle;

Forbidden operations on volatile values

counter++;
counter += 1;
counter = counter + 1;
finished &= updated;
average /= 4;

// has a race condition
if (!finished) { finished=true; doSomething() }}

All those operations are non-atomic. Essentially, volatile fields may only be used for simple get/set operations, not for any operation which uses the current value of the field as part of the operation.

Good

final AtomicInteger counter = new AtomicInteger(0);
final AtomicBoolean finished = new AtomicBoolean(false)
final AtomicLong counter2 = new AtomicLong(0);
final AtomicDouble average = new AtomicDouble(0);
final AtomicReference<MyType> handle = new AtomicReference<>(null);

Accessing the atomic types is more expensive than volatiles; there's a lot more code underneath, and potentially mutual exclusion operations. If you want to have a simple near-static datastructure that is shared across threads and guaranteed not to cause deadlock, use a volatile.

Logging

There's a number of audiences for Hadoop logging:

  • People who are new to Hadoop and trying to get a single node cluster to work.
  • Hadoop sysadmins who don't want to have to become experts in reading Java stack traces to diagnose local cluster problems.
  • People who help other people's Hadoop clusters to work (that includes those companies that provide some form of Hadoop support).
  • Hadoop JIRA bug reports.
  • Hadoop developers.
  • Programs that analyze the logs to extract information.

Hadoop's logging could be improved —there's a bias towards logging for Hadoop developers than for other people, because it's the developers who add the logs they need to get things to work.

Areas for improvement include: including some links to diagnostics pages on the wiki, including more URLs for the hadoop services just brought up, and printing out some basic statistics.

Hadoop is also (slowly) migrating from the apache commons-logging API to SLF4J. This style guide covers SLF4J only:

  1. SLF4J is the logging API to use for all new classes. It SHOULD be used.
  2. The main body of the code uses Commons Logging APIs. These can be migrated —though check for tests which access the logs, and grab the matching log4j appender by way of class casts. These uses cannot be migrated easily.
  3. With commons logging, all log calls must be guarded with a check for log level.
  4. Production code MUST NOT assume that log4J is the logger behind the log APIs.
  5. Test code MAY assume that Log4J is in use and can tune back log levels in certain circumstances. They MUST use the methods in {{GenericTestUtils}} to do this, for ease of future maintenance.

SLF4J specific issues:

  1. ERROR, WARN and INFO level events SHOULD be logged without guarding their invocation.

  2. DEBUG-level log calls MAY be guarded. This eliminates the need to construct string instances.

  3. Logging which creates complex string output (e.g. iterates through a list to build the message) MUT be guarded.

  4. Unguarded statements MUST NOT use string concatenation operations to build the log string, as these are called even if the logging does not take place. Use {} clauses in the log string. Bad:

     LOG.info("Operation "+ action + " outcome: " +outcome)
    

Good:

    LOG.info("Operation {} outcome: {}", action, outcome)
  1. Classes SHOULD have lightweight toString() operators to aid logging. These MUST be robust against failures if some of the inner fields are null.

  2. Exceptions should be logged with exception included as a final argument, for the trace.

  3. Do not uprate debug level messages to LOG.info() level for your own debugging. Edit the log4J configuration files instead. It is what they are for.

  4. Exception.toString() MUST be used instead of Exception.getMessage(), as some classes have a null message.

      LOG.error("Failed to start: {}", ex, ex)
    

There is also the opportunity to add logs for machines to parse, not people. The HDFS Audit Log is an example of this —it enables off-line analysis of what a filesystem has been used for, including by interesting programs to detect which files are popular, and which files never get used after a few hours -and so can be deleted. Any contributions here are welcome.

Metrics and Monitoring

Hadoop's Metrics 2 is the framework for collecting and publishing statistics on running applications.

All pieces of code collecting information which may need monitoring should consider adding metrics 2 support. Note that as these can be used in tests and debugging, adding them early aids development.

There is ongoing work in adding htrace tracing across the Hadoop stack, so that it will become easy to correlate load issues in one layer (such as HDFS) with work going on at a higher layer (e.g. specific Apache Hive queries). Contributions in this area are welcome. Any component which performs work for callers —especially those which then invoke remote work on those caller's behalf, should be htrace-enabled

Cross-Platform Support

Hadoop is used in Production server-side on Linux and Windows, with some Solaris and AIX deployments. Code needs to work on all of these.

Java 8 is the minmum version branch-2 and branch-3 target. Hadoop is used client-side on Linux, Windows, OS/X and other systems.

CPUs may be 32-bit or 64-bit, x86, PPC, ARM or other parts.

JVMs may be: the classic "sun" JVM; OpenJDK, IBM JDK, or other JVMs based off the sun source tree. These tend to differ in

  • Heap management.
  • Non-standard libraries (com.sun, com.ibm, ...). Some parts of the code —in particularly the Kerberos support— has to use reflection to make use of these JVM-specific libraries.
  • Garbage collection implementation, pauses and such like.
  • Error Messagesa

Operating Systems vary more, with key areas being:

  • Case sensitivity and semantics of underlying native filesystem. Example, Windows NTFS does not support rename operations while a file in a path is open.
  • Native paths. Again, windows with its drive letter c:\path structure stands out. Hadoop's Path class contains logic to handle these...sometimes test construct paths by going
File file = something();
Path p = new Path("file://" + file.toString())

use

Path p = new Path(file.toURI());
  • Process execution. Example: as OS/X does not support process groups, YARN containers do not automatically destroy all children when the container's parent (launcher) process terminates.
  • Process signalling
  • Environment variables. Names and case may be different.
  • Arguments to the standard Unix commands such as ls, ps, kill, top and suchlike.

Performance

Hadoop prioritizes correctness over performance. It absolutely prioritizes data preservation over performance. Data must not get lost or corrupted.

That said, Hadoop is designed to scale to tens of thousands of machines. Algorithms should be designed to scale: ideally O(1), failing that something O(lg(n)).

Protobuf and RPC

  • New fields on existing messages must be optional, otherwise all existing clients will be unable to talk to the far end.
  • Accordingly, services must be designed to have a valid default for a new field, handling the absence of a field value gracefully.

Security

Hadoop supports insecure clusters and secure "Kerberized" clusters. The latter uses Kerberos to authenticate services as well as users. This means it is critical that Hadoop code works in secure environments.

As a developer, that means you need to understand (a) how to set up a secure Hadoop cluster and (b) how to write code that works in a secure Hadoop cluster.

Set up a machine/VM as a Kerberos Domain Controller (KDC) and use this to create the keytabs needed for Hadoop run in in secure mode. This can take a couple of hours, hours in which you will learn the basics of Kerberos.

Insecure clusters run in-cluster code in different accounts from the users submitting work. Access to HDFS propagates by passing the HADOOP_USER environment variable around. This variable is picked up by programs which use the Hadoop HDFS client libraries and used to impersonate that user (in an unsecured cluster, obviously).

YARN applications MUST set this environment variable when launching an application in an insecure cluster.

Secure clusters use Kerberos and require each user submitting work to have an account of the same name in the cluster.

Prerequisite Knowledge

  • Basic Kerberos concepts and architecture.
  • How to set up a secure Hadoop cluster (at least a VM) —including SPNEGO authenticated HDFS and RM web .
  • How to read and edit a krb5.conf file.
  • What SPNEGO is; how to set your web browser up to use it.
  • What Hadoop Delegation Tokens are for; how they differ from Authentication tokens, and when to use them.
  • Web services: how to use the Authentication Filter and how to offer a token renewal service.
  • REST clients: how to set up Jersey for SPNEGO & how to react to authentication failures.
  • RPC services: how to declare the principal for communications in META-INF/services/org.apache.hadoop.security.SecurityInfo and the matching SecurityInfo subclass.
  • YARN applications: how to get delegation tokens from clients to your application; which ones you will need.
  • Long-lived YARN services: how to work with keytabs
  • The meaning of obscure GSS API error messages (you can always learn these as you go along).

Is this a lot to know? Yes. Is it intimidating? Don't worry: we all find Kerberos hard. But do not think you can get away without this knowledge -all you are doing is putting your learning off.

References

Read these. You do need to know the details.

  1. Hadoop and Kerberos: The Madness beyond the Gat
  2. Adding Security to Apache Hadoop
  3. The Role of Delegation Tokens in Apache Hadoop Security
  4. Chapter 8. Secure Apache HBase
  5. Hadoop Operations 1st edition, p135+
  6. Java 8 Kerberos Requirements
  7. Troubleshooting Kerberos on Java 8
  8. JAAS Configuration (Java 8)

Coding equirements

  1. DO NOT PUT SECURITY AND KERBEROS OFF UNTIL THE END OF YOUR WORK
  2. Do not assume that user names are simple "unix" names; they may have spaces and kerberos realms in them.
  3. Use the UserGroupInformation class to manage user identities; it's doAs() operation to perform actions as a specific user.
  4. Test against both secure and insecure clusters. The MiniKDC server provides a basic in-JVM Kerberos controller for tests.
  5. Some parts of the Hadoop stack (e.g. Zookeeper) are also controlled by JVM properties. Be careful when setting up such applications to set the properties before.

Key has to be the first point: Security cannot be an afterthought. If you put it off you will find your near-deadline time spent trying to debug security issues in code that hasn't been designed to be secure, while struggling to read those documents referenced above and learn the associated concepts in a hurry. Avoid this by putting in the effort early.

Imports

The ordering of imports in hadoop was somehow agreed on a long time ago and and never formally documented.

import java.*

import javax.*

import non.org.apache.*

import org.apache.*

import static *
  • Non-static imports MUST NOT use wildcards
  • Wildcards MAY be used in static imports.

Imports are where false-positive patch merge conflict arises: Patch A adds one import; patch B adds another nearby: the patches refuse to coexist. Trying to cherry-pick patches across branches is equally painful.

  1. Please do not reorder imports, especially in the IDE "clean up imports" feature. It only complicates all of this.
  2. Similarly: if your IDE automatically creates .* package imports at a certain threshold, set that threshold to be "999".
  3. If you are patching some code and the imports are complete mess of randomness, do at least try to add new imports adjacent to their existing packages.
  4. And always new static imports at the bottom.

It's interesting to see how Spark's commit checker reviews all its patches and enforces the ordering. It'd be good to know if that reduces merge and backport confict.

Exceptions

Exceptions are a critical form of diagnostics on system failures.

  • They should be designed to provide enough information to enable experienced Hadoop operators to identify the problem.
  • They should to provide enough information to enable new Hadoop users to identify problems starting or connecting to their cluster.
  • They need to provide information for the Hadoop developers too.
  • Information MUST NOT be lost as the exception is passed up the stack.

Exceptions written purely for the benefit of developers are not what end users or operations teams need —and in some cases can be misleading. As an example, the java network stack returns errors such as java.net.ConnectionRefusedException which returns none of the specifics about what connection was being refused, especially the destination host and port, and can be misinterpreted by people unfamiliar with Java exceptions or the sockets API as a Java-side problem.

This is why Hadoop wraps the standard socket exceptions in NetUtils.wrapException()

  1. These extend the normal error messages with host and port information for the experts,
  2. They add links to Hadoop wiki pages for the newbies who interpret "Connection Refused" as the namenode refusing connections, rather than them getting their destination port misconfigured.
  3. It retains all the existing socket classes. The aren't just wrapped in a general IOException —they are wrapped in new instances of the same exception class. This ensures that catch() clauses can select on exception types.

In general:

  1. Try to use Hadoop's existing exception classes where possible.
  2. Except for some special cases, exceptions MUST be derived from IOException
  3. Use specific exceptions in preference to the generic IOException
  4. IllegalArgumentException SHOULD be used for some checking of input parameters, but only where consistent with other parts of the stack.
  5. The Guava Preconditions methods MAY be used for argument checking, but MUST have meaningful messages on failure, e.g. Preconditions.checkArgument(readahead >= 0, "Negative readahead value")
    1. If an exception is generated from another exception, the inner exception must be included, either via the constructor or through Exception.initCause().
  6. If an exception is wrapping another exception, the inner exception text MUST be included in the message of the outer exception.
  7. The text value of an exception MUST be extracted from Exception.toString.
  8. Exception.getMessage() MUST NOT be used. For some exceptions this returns null.
  9. Where Hadoop adds support for extra exception diagnostics (such as with NetUtils.wrapException()) —use it.
  10. Exceptions should provide any extra information to aid diagnostics, including —but not limited to— paths, remote hosts, and any details about the operation being attempted.
  11. Where Hadoop doesn't add support for extra diagnostics —try implementing it.

The requirement to derive exceptions from IOException means that developers SHOULD NOT use Guava's Preconditions.checkState() check as these throw IllegalStateException instances. That said: sometimes it is.

Internationalization

  1. Error messages use EN_US, US English in their text messages.
  2. Log messages must use US English as their text.
  3. Classes, methods and variables must use US English in their names.
  4. Names that are misspelled can be near-impossible to remove: please check with a spell checker set to the US if you have any doubts about your spelling.
  5. Code must use String.toLowerCase(EN_US).equals() rather than String.equalsIgnoreCase(). Otherwise the comparison will fail in some locales (example: Turkey).
  6. Similarly, use String.toLowerCase(EN_US) and String.toUpperCase(EN_US) to change the case of text.

Main functions

A static main(String[] args) method routine can be invoked via the /bin/hadoop script, a script which will set up the classpath and other environment variables consistently.

Hadoop uses its ToolRunner class as the entry point to code —both client and server.

This provides a standard set of configuration parameters, including adding the ability to define and extend the Hadoop XML configuration to use.

Entry points SHOULD use the TestRunner interface for their entry point logic.

This is not mandatory; there may be valid reasons to avoid doing this, a key one being that the application may be using an alternative CLI parser such as JCommander. If an alternative CLI parser library is used, the main() routine SHOULD support the standard command line options, especially the -D name=value syntax.

Tests

Tests are a critical part of the Hadoop codebase.

What are those tests for? More fundamentally: what is any test for?

Tests are there to show that a feature works, that it recognises and reacts to invalid/illegal conditions, and fails acceptably in the face of problems.

That is: some tests verify that a feature behaves as expected given valid inputs and starting state of the system. Other tests should be designed to break that behaviour, to create invalid states, generate failure conditions, pass in invalid values —and then verify that the component under test is robust to such states.

The quality standard for tests is as high as for production itself. They must be easy for successor developers to maintain, good at finding bugs and regressions in the production software —while not being buggy themselves. A bad test -be it unreliable, brittle to changes in implementation details, prone to race conditions is troublesome, and may end being cut, disabled or simply ignored. When a test fails, it has not be "crying wolf" —it needs to be showing there is a real problem, and provide as much information about the problem that it can.

Tests MUST be

  • Executable by anyone, on any of the supported development platforms (Linux, Windows and OS/X).
  • Be automated. They MUST validate the correctness of the system through assertions and operations, not by relying on someone to read the logs.
  • Designed to show something works as intended even in the face of failure. That doesn't just mean "shows that given the right parameters, you get the right answers", it means "given the wrong args/state, it fails as expected". Good tests try to break the code under test.
  • Be fast to run. Slow tests hamper the entire build and waste people's time.
  • Be designed for failures to be diagnosed purely from the assertion failure text and generated logs. Everything needed to understand why a test failed should be determinable from the results of a remote Jenkins/CI tool-managed run, the generated test reports and any log data collected.
  • Be deterministic. Everyone hates tests that fail intermittently.
  • Be designed for maintenance. Comments, meaningful names, etc. Equally importantly: not contained hard coded strings in log and exception searches, Instead use constants in the classes under test and refer to them.

Tests MUST

  • Have meaningful names, in both classname and test method.

  • Use the prefix test for test cases. This avoids confusion about what is an entry point vs helper method.

  • Use directories under the property test.dir for temporary data. The way to get this dir dynamically is:

      new File(System.getProperty("test.dir", "target"));
    
  • Shut down services after the test run.

  • Not leave threads running or services running irrespective of whether they fail or not. That is: always clean up afterwards, either in try {} finally {} clauses or @After and @AfterClass methods. This teardown code must also be robust against incomplete state, such as null object references.

  • Work on OS/X, non-Intel platforms and Windows. There's a field in org.apache.hadoop.util.Shell which can be used in an Assert.assume() clause to skip test cases which do not work here.

  • Use port '0' for registering TCP & UDP endpoints, or scan for a free port with ServerSocketUtil.

Tests MUST NOT

  • Use the prefix test for any method other than test cases. This avoids confusion about what is an entry point vs helper method.
  • Contain any assumptions about the ordering of previous tests —such as expecting a prior test to have set up the system. Tests may run in different orders, or purely standalone. (there is a special JUnit @FixMethodOrder attribute if ordering really is needed)
  • Rely on a specific log-level for generating output that is then analyzed. Some tests do this, and they are problematic. The preference is to move away from these and instrument the classes better.
  • Require specific timings of operations, including the execution performance or ordering of asynchronous operations.
  • Have hard-coded network ports. This causes problems in parallel runs, especially on the Apache Jenkins servers. Either use port 0, or scan for a free port. ServerSocketUtil has code to pick a free port: tests should use this.
  • Have hard coded sleep times. They may work for you, but will fail intermittently elsewhere. They also encourage people to address those failures by extending the time, which makes for longer tests. Use GenericTestUtils.waitFor() to block for a condition being met.
  • Take long times to complete. There are some in the codebase which are slow; these do not begin with the word Test to stop them being run except when explicitly requested to do so.
  • Assume they are running on a Unix system, with /unix/style/paths.
  • Require human intervention to validate the test outcome.
  • Store data in /tmp, or the temp dir suggested by java.io.createTempFile(String, String). All temporary data must be created under the directory ./target. This will be cleaned up in test runs, and not interfere with parallel test runs.
  • Use the user's home directory or assume write access to it.
  • Require internet access. That includes DNS lookup of remote sites. It also included expecting lookups of non-resolvable hosts to fail —some ISPs return a search site in this situation, so an nslookup invalid.example.org does return an IP address.
  • Run up large bills against remote cloud storage infrastructures by default. The object store client test suites are automatically skipped for this reason.
  • Require cloud infrastructure keys be added into SCM-managed files for test runs. This makes it all to easy to accidentally commit AWS login credentials to public repositories, which can be an expensive mistake.

Tests MAY

  • Assume the test machine is well-configured. That is, the machine knows its own name, has adequate disk space.

Tests SHOULD

  • Provide extra logging information for debugging failures.
  • Use loopback addresses localhost rather than hostnames, because the hostname to IP mapping may loop through the external network, where a firewall may then block network access.
  • Clean up after themselves.

Assertions

Here are the best-practice requirements of JUnit assertions. We are now adopting AssertJ, but haven't used them enough to have any best practices there.

  1. Complex assertions should be broken into smaller ones, so that failure causes can be more easily determined.
  2. Assertions should use the more specific assertXXX checks over assertTrue() and assertFalse().
  3. Uses of assertTrue() and assertFalse() MUST include an error message which provides information on what has failed and why -a message which itself must be robust against null pointers.
  4. Other assertions SHOULD provide an error message for extra diagnostics.
  5. Checks for equality should use assertEquals(expected, actual) and assertNotEquals(expected, actual).
  6. Checks for equality of double and float MUST use assertEquals(expected, actual, delta) and assertNotEquals(expected, actual, delta).
  7. Array Equality checks should use assertArrayEquals(expected, actual).

assertTrue() and assertFalse()

These are the default assertions. While simple, they provide minimal diagnostics. The value or purpose of the test are unknown, with test reports only stating the line which failed. This is not enough.

A good tactic for coming up with some meaningful messages and assertions is to imagine that a test run has failed on an assertion, and all you have for diagnostics is the exception stack trace, not any logs. Ask yourself: "If I had to fix this, is there enough information for me to even start to guess what has failed".

By default, assertTrue() utterly fails this test.

Bad

assertTrue(client != null && client.connected() && client.remoteHost().equals("localhost"))

This is bad because if there is a failure, there's no clue as to what it was. Furthermore, it contains the assumption that when client.connected() is true, client.remoteHost() is never null. That may be something to check for explicitly.

Good

assertNotNull("null client", client)
assertTrue("Not connected: " + client, client.connected())
assertEquals("Wrong remote host:" + client, "localhost", client.remoteHost())

This pulls out the three checks and orders them such that the successive assertions can rely on the predecessors being true. There is no need to add an assertNotNull("no remote host: "+ client, client.remoteHost()) check, as the assertEquals() assertion does that automatically.

With three separate assertions, we'll know immediately which one failed. With a stack trace, we can even get the specific line.

assertEquals() and assertNotEquals()

These assertions SHOULD be used in place of assertTrue(expected.equals(actual)) and assertFalse(expected.equals(actual))

  1. They handle null values in the equality check (though there, assertNull() and assertNotNull() should be used wherever the expected value is known to be null).
  2. They automatically generate error messages listing the expected and actual values, highlighting the difference.
  3. There's a special variant for asserting equality of floating point numbers, where a delta value is provided; two float or double arguments are considered equal if the difference between them is less than the delta

For assertEquals() and and assertNotEquals(), the expected value MUST be the first element listed. This because when the test fails, the assertion thrown states that the first value was the expected one. It gets confusing trying to diagnose the failure when the messages are misleading.

Bad

assertEquals(counter.get(), 7);

If the counter was only "3", the error text would be

expected:<3> but was:<7>

Good

assertEquals(7, counter.get());

For the counter==3 error condition, the text would now be

expected:<7> but was:<3>

Which correctly describes the problem.

Best

assertEquals("current counter", 7, counter.get());

This would give an error message

current counter expected:<7> but was:<3>

Because there assertion already has a useful error message, adding a string is only a SHOULD or a MAY, not a MUST. Its purpose is to provide some more information on the test report, rather than actually show the values of the arguments.

Extra text is critical if the assertion is only on a fraction of the actual data structure being compared, and on a failure you will need the whole structure to debug things

Assert.assertNotEquals("Modification time of raw matches that of guarded"
       +"\nraw=" + rawFileStatus
      + " guarded=" + guardedFileStatus,
        rawFileStatus.getModificationTime(),
        guardedFileStatus.getModificationTime());

fail()

Assert.fail() can be called to explicitly fail something. It should always be called with meaningful text

old example

try {
  String s = operationExpectedToFail(null);
  Assert.fail("expected a failure -but got " + s)
} catch {IOException expected} {
    // expected
}

That specific use has generally been superceded in new code by LambdaTestUtils.intercept which handles the generation of the assertion and returns the caught exception for further analysis.

intercept(IOException.class, 
  () -> operationExpectedToFail());

Assertion Text

Tests SHOULD provide meaningful text on assertion failures. The best metric is "can someone looking at the test results get any idea of what failed without chasing up the stack trace in an IDE?"

That means adding meaningful text to assertions, along with any diagnostics information that can be picked up.

Bad

assertTrue(target.exec());

This is utterly meaningless on a failure

Good

assertTrue("exec() failed on " + target, target.exec());

Bad

assertTrue(target.update() == 0);

Good

assertEquals("update() failed on " + target, 0, target.update());

Such assertions are aided by having the classes under test having meaningful toString() operators —which is why these should be implemented.

Sidenote: this is als why toString() values must not throw exceptions of their own.

Timeouts

Test MUST have timeouts. The test runner will kill tests that take too long -but this loses information about which test failed. By giving the individual tests a timeout, diagnostics are improved.

Add a timeout to the test method

@Test(timeout = 90000)

Better: declare a test rule which is picked by all test methods

@Rule
public final Timeout testTimeout = new Timeout(90000);

Important: make it a long enough timeout that only failing tests fail. The target machine here is not your own, it is an underpowered Linux VM spun up by Jenkins somewhere.

Keeping tests fast

Slow tests don't get run. A big cause of delay in Hadoop's current test suite is the time to start up new Miniclusters ... the time to wait for an instance to start up can often take longer than the rest of the test.

Exposing class methods for testing

Sometimes classes in Hadoop expose internal methods for use in testing.

There are three ways of exporting private methods in a production class for this

  1. Make public and mark @VisibleForTesting. Easiest, but risks implicitly becoming part of the API.
  2. Mark @VisibleForTesting, but make package scoped. This allows tests in the same package to use the method, so is unlikely to become part of the API. It does require all tests to be in the same package, which can be a constraint.
  3. Mark @VisibleForTesting, but make protected, then subclass in the test source tree with a class that exposes the methods. This adds a new risk: that it is subclassed in production. It may also add more maintenance costs.

There is no formal Hadoop policy here.

There is however, another strategy: don't expose your internal methods for testing. The test cases around a class are the test of that classes APIs —and often the code people start with to use the code.

If the functionality of a class cannot be used without diving below that API, it's a sign that the API is limited. People will end up using your test methods, either because they needed to, or just because they copied your code as a starting point —and it was using those methods.

For any class which is designed for external invocation, this implies you should think about improving that public API for testability, rather than sneaking into the internals.

If your class isn't designed for direct public consultation, but instead a small part of a service remotely accessible over the network —then yes, exposing the internals may be justifiable. Just bear in mind that you are increasing the costs of maintaining the code: someone will need to update all the tests as changes are made to the internals of a class.

Further Reading

Testing with Exceptions

Catching and validating exceptions are an essential aspect of testing failure modes. However, it is dangerously easy to write brittle tests which fail the moment anyone changes the exception text.

To avoid this:

  1. Use specific exception classes, then catch purely those exceptions.
  2. Use constants when defining error strings, so that the test can look for the same text
  3. When looking for the text, use String.contains() over String.equals() or String.startsWith().

Bad

try {
  doSomething("arg")
  Assert.fail("should not have got here")
} catch(Exception e) {
  Assert.assertEquals("failure on path /tmp", e.getMessage());
}

This is way to brittle and doesn't help you find out what is going on on a failure.

Good

@Test()
public void testSomething {
  try {
    doSomething("arg")
    Assert.fail("should not have got here")
  } catch(PathNotFoundException e) {
    Assert.assertTrue(
      "did not find " + Errors.FAILURE_ON_PATH + " in " + e,
      e.toString().contains(Errors.FAILURE_ON_PATH);
  }
}

Here a constant is used to define what is looked for (obviously, one used in the exception's constructor). It also uses the String.contains() operator —so if extra details are added to the exception, the assertion still holds.

Good

@Test(expected = PathNotFoundException.class)
public void testSomething {
    doSomething("arg")
}

This takes advantage of JUnit's ability to expect a specific exception class, and looks for it. This is nice and short. Where it is weak is that it doesn't let you check the contents of the exception. If the exception is sufficiently unique within the actions, that may be enough.

Good: examine the contents of the exception as well as the type. Rethrow the exception if it doesn't match, after adding a log message explaining why it was rethrown:

@Test
public void testSomething {
  try {
    result = doSomething("arg")
    Assert.fail("expected a failure, got: " + result)
  } catch(PathNotFoundException e) {
    if (!e.toString().contains(Errors.FAILURE_ON_PATH) {
      LOG.error("No " + Errors.FAILURE_ON_PATH + " in {}" ,e, e)
      throw e;
    }
  }
}

This implementation ensures all exception information is propagated. If it does not fail, the return value of the operation is included in the failure exception, to aid debugging. As the test is failing because the code in question is not behaving as expected, having a stack trace in the test results can be invaluable.

Even better, rather than write your own handler (repeatedly), use the one in org.apache.hadoop.test.GenericTestUtils, which is bundled in hadoop-common-test JAR:

@Test
public void testSomething {
  try {
    Object result = doSomething("arg")
    Assert.fail("expected a failure, got: " + result)
  } catch(PathNotFoundException e) {
    GenericTestUtils.assertExceptionContains(Errors.FAILURE_ON_PATH, e);
  }
}

In comparing the various options, the JUnit 4 expected will be less informative, but it makes for a much easier to understand test. For it to be a good test, some conditions must be met

  1. There's only one place in the test case where raising the expected exception can happen. If the exception could get raised before or after the core operation being tested, then the test could be failing in the wrong place —with the test runner not picking it up.
  2. The type of the exception is sufficient to verify that the failure was as expected. A high level Exception or IOException is unlikely to be adequate. Otherwise, go for the GenericTestUtils one.

Testing with Java 8 closures through LambdaTestUtils

An evolving class in the hadoop common test JAR is org.apache.hadoop.test.LambdaTestUtils. This contains a set of static methods intended to make it easy to test with Java 8 closures. It is based on concepts in Scalatest, to the extent of using the same method names.

A key method is intercept(), which takes the class of an exception, an optional piece of text required to be in the exception, and, finally, a closure:

intercept(IllegalArgumentException.class, 
  Errors.FAILURE_ON_PATH,
  () -> Paths.getLocalTaskAttemptTempDir(conf, jobUUID, tac.getTaskAttemptID()));

The closure MUST throw the exception of the specified type, and, if specified, the containing text. Any other exception is rethrown; if the text is missing an assertion is raised. And, if the closure does not fail, the assertion raised includes the output of the operation in the error text. The overall goal is to make it easy to make assertions without try/catch clauses, and including as much diagnostics as it can: including the stack traces.

As with the Scalatest intercept() method, if the desired exception is raised, then Hadoop's intercept() returns it for further checks. So far our tests haven't made use of that feature.

A more complex call is eventually(); again based on its Scalatest namesake, mixing in experience using a Groovy test mechanism in Apache Slider (incubating). Its aim is to support retrying for a condition to be met in a more structured way than today.

public static <T> T eventually(int timeoutMillis,
    Callable<T> eval,
    Callable<Integer> retry) throws Exception;

This will repeated invoke the closure until it stops throwing any exception, at which point the result is returned. The retry policy defines how to react to a failure: simple sleep, exponential backoff, etc.

If the timeout is reached, and the closure has not yet succeeded, the last exception the closure threw is rethrown. There are also two ways to exit the closure with a failure ahead of the timeout: by throwing InterruptedException or a org.apache.hadoop.test.LambdaTestUtils.FailFastException instance. The latter is useful to raise if the closure concludes that the condition it is waiting for is never going to be reached.

The aim of eventually() is to support retrying for a condition to be met in a more structured way than today.

Here is an example in which an eventually consistent object store is polled for thirty seconds, awaiting the length property of a path to catch up with the state of the last write.

long len = eventually(30 * 1000, 1000,
  () -> assertEquals(shortLen, fs.getFileStatus(testpath).getLen()));

There are more operations in the LambdaTestUtils module, with it evolving as more Java 8 code is adopted across Hadoop. In particular, the await() method extends eventually() with the ability to control what is thrown on a timeout. This can be used to generate complex diagnostics for logging and/or inclusion in the exception test, helping to understand why a closure never succeeded. It's equivalent proved invaluable in Slider groovy-lang integration tests; it can now be used for similar benefit in Hadoop.

The test code has been a key place for us to learn this, and, because intercept and eventually closures can be replaced with anonymous classes extending Callable and VoidCallable (an easy way to return voids), backported to Java-7 Hadoop with minor effort.

### Using AssertionError to include an inner exception

Take an operation which returns an exception

PathIOException pathIOE = intercept(PathIOException.class, ()-> testfs.openFile("").build());

Now imagine you want to examine it some more -and if it doesn't match your expectations, rethrow it.

Bad

assertNotNull(pathIOE.getPath())

This is bad as if the exception isn't what expected, while the JUnit assertion will fail, the stack trace is lost

Better to explicitly throw an AssertionError, using the caught exception as the cause:

Good

if (pathIOE.getPath() != null) {
  throw new AssertionError("no path", pathIOE);
}

This will fail the test, and the test report will contain the underlying exception.

Skipping tests that aren't available on the test system

Not all tests work everywhere

public static void skip(String message) {
  log.warn("Skipping test: {}", message)
  Assume.assumeTrue(message, false);
}

public static void assume(boolean condition, String message) {
  if (!condition) {
    log.warn("Skipping test: {}",  message)
    Assume.assumeTrue(message, false);
  }
}

Testing Tips

You can automatically pick the name to use for instances of mini clusters and the like by extracting the method name from JUnit:

@Rule
public TestName methodName = new TestName();```

Hadoop Coding Standards: Other Languages

Code Style: Python

  1. The indentation should be two spaces.
  2. Code for Windows as well as Unix.

Code Style: Bash

  1. Bash lines MAY exceed the 80 character limit where necessary.
  2. Try not to be too clever in use of the more obscure bash features —most Hadoop developers don't know them.
  3. MUST recognise problems and fail with exit codes. That is, it MUST check for non-zero return codes on its operations and SHOULD then exit the script with an error.
  4. MUST use bats for your bash unit tests.

The key thing to assume when writing Bash scripts is that the majority of the Hadoop project developers are not bash experts who know all the subtleties. If you are doing something reasonably complex, do add some comments explaining what you are doing.

Code Style: Native

MUST

  1. C code following Linux kernel style with one exception: indent by two spaces.
  2. Make no assumptions about ASCII/Unicode, 16 vs 32 bits: use typedef and TSTR definitions; int32 and int64 for explicit integer sizes.
  3. Use CMake for building.
  4. Assembly code MUST be optional; the code and algorithms around it MUST NOT be optimized for one specific CPU family.
  5. While you can try optimising for memory models of modern systems, with NUMA storage, three levels of cache and the like, it does produce code that is brittle against CPU part evolution. Don't optimize prematurely here.
  6. MUST compile on Linux, OSX and Windows platforms. You SHOULD test this as well as you can. If others are expected to do the work, it is likely to languish in the patch-pending state.
  7. MUST be accompanied by tests to verify the functionality.

MUST NOT

  1. MUST NOT impact the performance of the x86 code. This is the primary CPU family used in production Hadoop. While the project is happy to accept patches for other CPUs (e.g. ARM, PPC, ...), it must not be at the expense of x86.
  2. MUST NOT remove or rename existing methods. This has creasted hard-to-debug version compatibility problems in the past.

Maven POM files

  • Patches containing changes to maven files only MUST be marked as against the component build.
  • All declarations of dependencies with their versions must be in the file hadoop-project/pom.xml.
  • Version information must be included as a property, set in the <properties> section. This is to make it easy for people to switch versions of dependencies —be it experimentally or when making a unified release of a set of Hadoop-stack artifacts.
  • Be as restrictive as possible in your dependency scoping: use <test> for anything only needed in tests in particular.
  • If it is for an optional feature, set the scope to <provided>. That means it will be used at build time, but not forced onto downstream dependents.
  • Avoid adding anything else to the main projects. If it adds something to the hadoop-client transitive dependency set, there's a risk of causing version problems with downstream users.
  • Be very cautious when updating dependencies.

Dependency updates

https://issues.apache.org/jira/browse/HADOOP-9991 covers dependency updates.

Dependency updates are an eternal problem in Hadoop. While it seems simple and trivial to move up to later versions, it often turns out that something downstream breaks. Nobody likes this.

The ones most troublesome have proven to be: Guava, Jackson, and Protobuf; for a more detailed list see Fear of Dependencies .

  1. Updates of versions of existing dependencies MUST be submitted as isolated patches, independent of other changes.
  2. All proposed dependency issues MUST declare themselves a dependency of HADOOP-9991. This makes tracking dependency update JIRAs easier.
  3. Patches SHOULD come with justifications, rather than just "a bit old".
  4. For patches against the high risk dependencies, SHOULD be accompanied with details about builds and tests of downstream applications, including Apache HBase, Apache Hive, and other major projects.
  5. Proposed patches which change to an pre-release version of any artifact SHALL NOT be accepted unless its a critical CVE fix. Sorry.