Skip to content

Commit c17cf2e

Browse files
committed
HADOOP-13327 output stream spec.
Review with more on 404 caching. Change-Id: Ib474a84e48556c6b76121427a026fa854b5bd9e0
1 parent d33fa8c commit c17cf2e

File tree

3 files changed

+70
-34
lines changed

3 files changed

+70
-34
lines changed

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StoreImplementationUtils.java

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
* distributed under the License is distributed on an "AS IS" BASIS,
1414
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1515
* See the License for the specific language governing permissions and
16-
* limitations under the License.
16+
* limitations under the License
1717
*/
1818

1919
package org.apache.hadoop.fs.impl;
@@ -51,14 +51,14 @@ public static boolean supportsSyncable(String capability) {
5151

5252
/**
5353
* Probe for an object having a capability; returns true
54-
* iff the stream implements {@link StreamCapabilities} and its
54+
* if the stream implements {@link StreamCapabilities} and its
5555
* {@code hasCapabilities()} method returns true for the capability.
5656
* This is a package private method intended to provided a common
5757
* implementation for input and output streams.
5858
* {@link StreamCapabilities#hasCapability(String)} call is for public use.
5959
* @param object object to probe.
6060
* @param capability capability to probe for
61-
* @return true iff the object implements stream capabilities and
61+
* @return true if the object implements stream capabilities and
6262
* declares that it supports the capability.
6363
*/
6464
static boolean objectHasCapability(Object object, String capability) {
@@ -70,23 +70,23 @@ static boolean objectHasCapability(Object object, String capability) {
7070

7171
/**
7272
* Probe for an output stream having a capability; returns true
73-
* iff the stream implements {@link StreamCapabilities} and its
73+
* if the stream implements {@link StreamCapabilities} and its
7474
* {@code hasCapabilities()} method returns true for the capability.
7575
* @param out output stream
7676
* @param capability capability to probe for
77-
* @return true iff the stream declares that it supports the capability.
77+
* @return true if the stream declares that it supports the capability.
7878
*/
7979
public static boolean hasCapability(OutputStream out, String capability) {
8080
return objectHasCapability(out, capability);
8181
}
8282

8383
/**
8484
* Probe for an input stream having a capability; returns true
85-
* iff the stream implements {@link StreamCapabilities} and its
85+
* if the stream implements {@link StreamCapabilities} and its
8686
* {@code hasCapabilities()} method returns true for the capability.
8787
* @param out output stream
8888
* @param capability capability to probe for
89-
* @return true iff the stream declares that it supports the capability.
89+
* @return true if the stream declares that it supports the capability.
9090
*/
9191
public static boolean hasCapability(InputStream out, String capability) {
9292
return objectHasCapability(out, capability);

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -603,7 +603,7 @@ and MAY be a `RuntimeException` or subclass. For instance, HDFS may raise a `Inv
603603

604604
result = FSDataOutputStream
605605

606-
A zero byte file must exist at the end of the specified path, visible to all
606+
A zero byte file must exist at the end of the specified path, visible to all.
607607

608608
The updated (valid) FileSystem must contains all the parent directories of the path, as created by `mkdirs(parent(p))`.
609609

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md

Lines changed: 62 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ This document covers the Output Streams within the context of the
2424
It uses the filesystem model defined in [A Model of a Hadoop Filesystem](model.html)
2525
with the notation defined in [notation](Notation.md).
2626

27-
The target audiences are
27+
The target audiences are:
2828
1. Users of the APIs. While `java.io.OutputStream` is a standard interfaces,
2929
this document clarifies how it is implemented in HDFS and elsewhere.
3030
The Hadoop-specific interfaces `Syncable` and `StreamCapabilities` are new;
@@ -60,8 +60,6 @@ A new interface: `StreamCapabilities`. This allows callers
6060
to probe the exact capabilities of a stream, even transitively
6161
through a chain of streams.
6262

63-
64-
6563
## Output Stream Model
6664

6765
For this specification, an output stream can be viewed as a list of bytes
@@ -303,9 +301,7 @@ specifications of behaviour.
303301

304302
#### Preconditions
305303

306-
```python
307-
Stream.open else raise IOException
308-
```
304+
None.
309305

310306
#### Postconditions
311307

@@ -319,9 +315,23 @@ others"
319315
FS' = FS where data(FS', path) == buffer
320316
```
321317

322-
Some applications have been known to call `flush()` on a closed stream
323-
on the assumption that it is harmless. Implementations MAY choose to
324-
support this behaviour.
318+
When a stream is closed, `flush()` SHOULD downgrade to being a no-op, if it was not
319+
one already. This is to work with applications and libraries which can invoke
320+
it in exactly this way.
321+
322+
323+
*Issue*: Should `flush()` forward to `hflush()`?
324+
325+
No. Or at least, make it optional.
326+
327+
There's a lot of application code which assumes that `flush()` is low cost
328+
and should be invoked after writing every single line of output, after
329+
writing small 4KB blocks or similar.
330+
331+
Forwarding this to a full flush across a distributed filesystem, or worse,
332+
a distant object store, is very underperformant
333+
334+
See [HADOOP-16548](https://issues.apache.org/jira/browse/HADOOP-16548)
325335

326336
### <a name="close"></a>`close()`
327337

@@ -372,14 +382,17 @@ may hide serious problems.
372382
delay in `close()` does not block the thread so long that the heartbeat times
373383
out.
374384

385+
And for implementors: have a look at [HADOOP-16785](https://issues.apache.org/jira/browse/HADOOP-16785)
386+
to see examples of complications here.
387+
375388
### HDFS and `OutputStream.close()`
376389

377390
HDFS does not immediately `sync()` the output of a written file to disk on
378391
`OutputStream.close()` unless configured with `dfs.datanode.synconclose`
379392
is true. This has caused [problems in some applications](https://issues.apache.org/jira/browse/ACCUMULO-1364).
380393

381394
Applications which absolutely require the guarantee that a file has been persisted
382-
MUST call `Syncable.hsync()` before the file is closed.
395+
MUST call `Syncable.hsync()` *before* the file is closed.
383396

384397

385398
## <a name="syncable"></a>`org.apache.hadoop.fs.Syncable`
@@ -530,7 +543,7 @@ From the javadocs of `DFSOutputStream.hsync(EnumSet<SyncFlag> syncFlags)`
530543
531544

532545
In virtual machines, the notion of "disk hardware" is really that of
533-
another software abstraction: there are guarantees.
546+
another software abstraction: there are few guarantees.
534547

535548

536549
## <a name="streamcapabilities"></a>Interface `StreamCapabilities`
@@ -543,7 +556,6 @@ another software abstraction: there are guarantees.
543556
The `StreamCapabilities` interface exists to allow callers to dynamically
544557
determine the behavior of a stream.
545558

546-
547559
The reference implementation of this interface is
548560
`org.apache.hadoop.hdfs.DFSOutputStream`
549561

@@ -562,7 +574,14 @@ The reference implementation of this interface is
562574
Where `HSYNC` and `HFLUSH` are items in the enumeration
563575
`org.apache.hadoop.fs.StreamCapabilities.StreamCapability`.
564576

565-
## <a name="cansetdropbehind"></a>interface `CanSetDropBehind`
577+
Once a stream has been closed, th `hasCapability()` call MUST do one of
578+
579+
* return the capabilities of the open stream.
580+
* return false.
581+
582+
That is: it MUST NOT raise an exception about the file being closed;
583+
584+
## <a name="cansetdropbehind"></a> interface `CanSetDropBehind`
566585

567586
```java
568587
@InterfaceAudience.Public
@@ -595,7 +614,7 @@ covered in this (very simplistic) filesystem model, but which are visible
595614
in production.
596615

597616

598-
### <a name="durability"></a>Durability
617+
### <a name="durability"></a> Durability
599618

600619
1. `OutputStream.write()` MAY persist the data, synchronously or asynchronously
601620
1. `OutputStream.flush()` flushes data to the destination. There
@@ -618,7 +637,7 @@ Thus: `flush()` is often treated at most as a cue to flush data to the network
618637
buffers -but not commit to writing any data.
619638
It is only the `Syncable` interface which offers guarantees.
620639

621-
### <a name="concurrency"></a>Concurrency
640+
### <a name="concurrency"></a> Concurrency
622641

623642
1. The outcome of more than one process writing to the same file is undefined.
624643

@@ -648,7 +667,7 @@ SHOULD be thread safe. *Note*: even the `DFSOutputStream` synchronization
648667
model permits the output stream to have `close()` invoked while awaiting an
649668
acknowledgement from datanode or namenode writes in an `hsync()` operation.
650669

651-
### <a name="consistencyy"></a>Consistency and Visibility
670+
### <a name="consistencyy"></a> Consistency and Visibility
652671

653672
There is no requirement for the data to be immediately visible to other applications
654673
—not until a specific call to flush buffers or persist it to the underlying storage
@@ -678,7 +697,7 @@ exists(FS''', path)
678697
getFileStatus(FS''', path).getLen() = len(data)
679698
```
680699

681-
HDFS does not do this except when the write crosses a block boundary; to do
700+
*HDFS does not do this except when the write crosses a block boundary*; to do
682701
otherwise would overload the Namenode. Other stores MAY copy this behavior.
683702

684703
As a result, while a file is being written
@@ -710,15 +729,15 @@ which starts at server-side time `t1` and completes at time `t2` with a successf
710729
written file, then the last modification time SHOULD be a time `t` where
711730
`t1 <= t <= t2`
712731

713-
## <a name="issues"></a>Issues with the Hadoop Output Stream model.
732+
## <a name="issues"></a> Issues with the Hadoop Output Stream model.
714733

715734
There are some known issues with the output stream model as offered by Hadoop,
716735
specifically about the guarantees about when data is written and persisted
717736
and when the metadata is synchronized.
718737
These are where implementation aspects of HDFS and the "Local" filesystem
719738
do not follow the simple model of the filesystem used in this specification.
720739

721-
### <a name="hdfs-issues"></a>HDFS
740+
### <a name="hdfs-issues"></a> HDFS
722741

723742
That HDFS file metadata often lags the content of a file being written
724743
to is not something everyone expects, nor convenient for any program trying
@@ -751,7 +770,6 @@ empty.
751770
When an output stream in HDFS is closed; the newly written data is not immediately
752771
written to disk unless HDFS is deployed with `dfs.datanode.synconclose` set to
753772
true. Otherwise it is cached and written to disk later.
754-
755773

756774
### <a name="local-issues"></a>Local Filesystem, `file:`
757775

@@ -770,7 +788,7 @@ to the stream.
770788
For anyone thinking "this is a violation of this specification" —they are correct.
771789
The local filesystem was intended for testing, rather than production use.
772790

773-
### <a name="checksummed-fs-issues"></a>Checksummed output streams
791+
### <a name="checksummed-fs-issues"></a> Checksummed output streams
774792

775793
Because `org.apache.hadoop.fs.FSOutputSummer` and
776794
`org.apache.hadoop.fs.ChecksumFileSystem.ChecksumFSOutputSummer`
@@ -787,7 +805,7 @@ to close the stream more than once.
787805
Behaviors 1 and 2 really have to be considered bugs to fix, albeit with care.
788806

789807

790-
### <a name="object-store-issues"></a>Object Stores
808+
### <a name="object-store-issues"></a> Object Stores
791809

792810
Object store streams MAY buffer the entire stream's output
793811
until the final `close()` operation triggers a single `PUT` of the data
@@ -861,8 +879,26 @@ is present: the act of instantiating the object, while potentially exhibiting
861879
create inconsistency, is atomic. Applications may be able to use that fact
862880
to their advantage.
863881
864-
865-
## <a name="implementors"></a>Implementors notes.
882+
There is a special troublespot in AWS S3 where it caches 404 responses returned
883+
by the service from probes for an object existing _before the file has been created_.
884+
A 404 record can remain in the load balancer's cache for some time -it seems to expire
885+
only after a "sufficient" interval of no probes for that path.
886+
This has been difficult to deal with within the Hadoop S3A code itself
887+
(HADOOP-16490, HADOOP-16635) -and if applications make their own probes for files
888+
before creating them, the problem will intermittently surface.
889+
890+
1. If you look for an object on S3 and it is not there - The 404 MAY Be returned even
891+
after the object has been created.
892+
1. FS operations triggering such a probe include: `getFileStatus()`, `exists()`, `open()`
893+
and others.
894+
1. The S3A connector does not do a probe if a file is created through `create()` overwrite=true;
895+
it only makes sure that the path does not reference a directory. Applications SHOULD always
896+
create files with this option except when some form of exclusivity is needed on file
897+
creation -in which case, be aware, that with the non-atomic probe+create sequence which
898+
some object store connectors implement, the semantics of the creation are not sufficient
899+
to allow the filesystem to be used as an implicit coordination mechanism between processes.
900+
``
901+
## <a name="implementors"></a> Implementors notes.
866902
867903
### `StreamCapabilities`
868904
@@ -875,8 +911,8 @@ they support the `hflush` and `hsync` capabilities on streams where this is not
875911
876912
Sometimes streams pass their data to store, but the far end may not
877913
sync it all the way to disk. That is not something the client can determine.
878-
Here: if the client code is making the hflush/hsync calls to the distributed FS,
879-
it SHOULD declare that it supports them.
914+
Here: if the client code is making the hflush/hsync passes these requests
915+
on to the distributed FS, it SHOULD declare that it supports them.
880916
881917
### Metadata updates
882918

0 commit comments

Comments
 (0)