@@ -24,7 +24,7 @@ This document covers the Output Streams within the context of the
2424It uses the filesystem model defined in [ A Model of a Hadoop Filesystem] ( model.html )
2525with the notation defined in [ notation] ( Notation.md ) .
2626
27- The target audiences are
27+ The target audiences are:
28281 . Users of the APIs. While ` java.io.OutputStream ` is a standard interfaces,
2929this document clarifies how it is implemented in HDFS and elsewhere.
3030The Hadoop-specific interfaces ` Syncable ` and ` StreamCapabilities ` are new;
@@ -60,8 +60,6 @@ A new interface: `StreamCapabilities`. This allows callers
6060to probe the exact capabilities of a stream, even transitively
6161through a chain of streams.
6262
63-
64-
6563## Output Stream Model
6664
6765For this specification, an output stream can be viewed as a list of bytes
@@ -303,9 +301,7 @@ specifications of behaviour.
303301
304302#### Preconditions
305303
306- ``` python
307- Stream.open else raise IOException
308- ```
304+ None.
309305
310306#### Postconditions
311307
@@ -319,9 +315,23 @@ others"
319315FS ' = FS where data(FS' , path) == buffer
320316```
321317
322- Some applications have been known to call ` flush() ` on a closed stream
323- on the assumption that it is harmless. Implementations MAY choose to
324- support this behaviour.
318+ When a stream is closed, ` flush() ` SHOULD downgrade to being a no-op, if it was not
319+ one already. This is to work with applications and libraries which can invoke
320+ it in exactly this way.
321+
322+
323+ * Issue* : Should ` flush() ` forward to ` hflush() ` ?
324+
325+ No. Or at least, make it optional.
326+
327+ There's a lot of application code which assumes that ` flush() ` is low cost
328+ and should be invoked after writing every single line of output, after
329+ writing small 4KB blocks or similar.
330+
331+ Forwarding this to a full flush across a distributed filesystem, or worse,
332+ a distant object store, is very underperformant
333+
334+ See [ HADOOP-16548] ( https://issues.apache.org/jira/browse/HADOOP-16548 )
325335
326336### <a name =" close " ></a >` close() `
327337
@@ -372,14 +382,17 @@ may hide serious problems.
372382delay in ` close() ` does not block the thread so long that the heartbeat times
373383out.
374384
385+ And for implementors: have a look at [ HADOOP-16785] ( https://issues.apache.org/jira/browse/HADOOP-16785 )
386+ to see examples of complications here.
387+
375388### HDFS and ` OutputStream.close() `
376389
377390HDFS does not immediately ` sync() ` the output of a written file to disk on
378391` OutputStream.close() ` unless configured with ` dfs.datanode.synconclose `
379392is true. This has caused [ problems in some applications] ( https://issues.apache.org/jira/browse/ACCUMULO-1364 ) .
380393
381394Applications which absolutely require the guarantee that a file has been persisted
382- MUST call ` Syncable.hsync() ` before the file is closed.
395+ MUST call ` Syncable.hsync() ` * before* the file is closed.
383396
384397
385398## <a name =" syncable " ></a >` org.apache.hadoop.fs.Syncable `
@@ -530,7 +543,7 @@ From the javadocs of `DFSOutputStream.hsync(EnumSet<SyncFlag> syncFlags)`
530543
531544
532545In virtual machines, the notion of "disk hardware" is really that of
533- another software abstraction: there are guarantees.
546+ another software abstraction: there are few guarantees.
534547
535548
536549## <a name =" streamcapabilities " ></a >Interface ` StreamCapabilities `
@@ -543,7 +556,6 @@ another software abstraction: there are guarantees.
543556The ` StreamCapabilities ` interface exists to allow callers to dynamically
544557determine the behavior of a stream.
545558
546-
547559The reference implementation of this interface is
548560 ` org.apache.hadoop.hdfs.DFSOutputStream `
549561
@@ -562,7 +574,14 @@ The reference implementation of this interface is
562574Where ` HSYNC ` and ` HFLUSH ` are items in the enumeration
563575` org.apache.hadoop.fs.StreamCapabilities.StreamCapability ` .
564576
565- ## <a name =" cansetdropbehind " ></a >interface ` CanSetDropBehind `
577+ Once a stream has been closed, th ` hasCapability() ` call MUST do one of
578+
579+ * return the capabilities of the open stream.
580+ * return false.
581+
582+ That is: it MUST NOT raise an exception about the file being closed;
583+
584+ ## <a name =" cansetdropbehind " ></a > interface ` CanSetDropBehind `
566585
567586``` java
568587@InterfaceAudience . Public
@@ -595,7 +614,7 @@ covered in this (very simplistic) filesystem model, but which are visible
595614in production.
596615
597616
598- ### <a name =" durability " ></a >Durability
617+ ### <a name =" durability " ></a > Durability
599618
6006191 . ` OutputStream.write() ` MAY persist the data, synchronously or asynchronously
6016201 . ` OutputStream.flush() ` flushes data to the destination. There
@@ -618,7 +637,7 @@ Thus: `flush()` is often treated at most as a cue to flush data to the network
618637buffers -but not commit to writing any data.
619638It is only the ` Syncable ` interface which offers guarantees.
620639
621- ### <a name =" concurrency " ></a >Concurrency
640+ ### <a name =" concurrency " ></a > Concurrency
622641
6236421 . The outcome of more than one process writing to the same file is undefined.
624643
@@ -648,7 +667,7 @@ SHOULD be thread safe. *Note*: even the `DFSOutputStream` synchronization
648667model permits the output stream to have ` close() ` invoked while awaiting an
649668acknowledgement from datanode or namenode writes in an ` hsync() ` operation.
650669
651- ### <a name =" consistencyy " ></a >Consistency and Visibility
670+ ### <a name =" consistencyy " ></a > Consistency and Visibility
652671
653672There is no requirement for the data to be immediately visible to other applications
654673—not until a specific call to flush buffers or persist it to the underlying storage
@@ -678,7 +697,7 @@ exists(FS''', path)
678697getFileStatus(FS''' , path).getLen() = len (data)
679698```
680699
681- HDFS does not do this except when the write crosses a block boundary; to do
700+ * HDFS does not do this except when the write crosses a block boundary* ; to do
682701otherwise would overload the Namenode. Other stores MAY copy this behavior.
683702
684703As a result, while a file is being written
@@ -710,15 +729,15 @@ which starts at server-side time `t1` and completes at time `t2` with a successf
710729written file , then the last modification time SHOULD be a time `t` where
711730`t1 <= t <= t2`
712731
713- # # <a name="issues"></a>Issues with the Hadoop Output Stream model.
732+ # # <a name="issues"></a> Issues with the Hadoop Output Stream model.
714733
715734There are some known issues with the output stream model as offered by Hadoop,
716735specifically about the guarantees about when data is written and persisted
717736—and when the metadata is synchronized.
718737These are where implementation aspects of HDFS and the " Local" filesystem
719738do not follow the simple model of the filesystem used in this specification.
720739
721- # ## <a name="hdfs-issues"></a>HDFS
740+ # ## <a name="hdfs-issues"></a> HDFS
722741
723742That HDFS file metadata often lags the content of a file being written
724743to is not something everyone expects, nor convenient for any program trying
@@ -751,7 +770,6 @@ empty.
751770When an output stream in HDFS is closed; the newly written data is not immediately
752771written to disk unless HDFS is deployed with `dfs.datanode.synconclose` set to
753772true. Otherwise it is cached and written to disk later.
754-
755773
756774# ## <a name="local-issues"></a>Local Filesystem, `file:`
757775
@@ -770,7 +788,7 @@ to the stream.
770788For anyone thinking " this is a violation of this specification" —they are correct.
771789The local filesystem was intended for testing, rather than production use.
772790
773- # ## <a name="checksummed-fs-issues"></a>Checksummed output streams
791+ # ## <a name="checksummed-fs-issues"></a> Checksummed output streams
774792
775793Because `org.apache.hadoop.fs.FSOutputSummer` and
776794`org.apache.hadoop.fs.ChecksumFileSystem.ChecksumFSOutputSummer`
@@ -787,7 +805,7 @@ to close the stream more than once.
787805Behaviors 1 and 2 really have to be considered bugs to fix, albeit with care.
788806
789807
790- # ## <a name="object-store-issues"></a>Object Stores
808+ # ## <a name="object-store-issues"></a> Object Stores
791809
792810Object store streams MAY buffer the entire stream' s output
793811until the final `close()` operation triggers a single `PUT ` of the data
@@ -861,8 +879,26 @@ is present: the act of instantiating the object, while potentially exhibiting
861879create inconsistency, is atomic. Applications may be able to use that fact
862880to their advantage.
863881
864-
865- # # <a name="implementors"></a>Implementors notes.
882+ There is a special troublespot in AWS S3 where it caches 404 responses returned
883+ by the service from probes for an object existing _before the file has been created_.
884+ A 404 record can remain in the load balancer' s cache for some time -it seems to expire
885+ only after a " sufficient" interval of no probes for that path.
886+ This has been difficult to deal with within the Hadoop S3A code itself
887+ (HADOOP - 16490 , HADOOP - 16635 ) - and if applications make their own probes for files
888+ before creating them, the problem will intermittently surface.
889+
890+ 1 . If you look for an object on S3 and it is not there - The 404 MAY Be returned even
891+ after the object has been created.
892+ 1 . FS operations triggering such a probe include: `getFileStatus()` , `exists()` , `open ()`
893+ and others.
894+ 1 . The S3A connector does not do a probe if a file is created through `create()` overwrite = true;
895+ it only makes sure that the path does not reference a directory. Applications SHOULD always
896+ create files with this option except when some form of exclusivity is needed on file
897+ creation - in which case, be aware, that with the non- atomic probe+ create sequence which
898+ some object store connectors implement, the semantics of the creation are not sufficient
899+ to allow the filesystem to be used as an implicit coordination mechanism between processes.
900+ ``
901+ # # <a name="implementors"></a> Implementors notes.
866902
867903# ## `StreamCapabilities`
868904
@@ -875,8 +911,8 @@ they support the `hflush` and `hsync` capabilities on streams where this is not
875911
876912Sometimes streams pass their data to store, but the far end may not
877913sync it all the way to disk. That is not something the client can determine.
878- Here: if the client code is making the hflush/ hsync calls to the distributed FS ,
879- it SHOULD declare that it supports them.
914+ Here: if the client code is making the hflush/ hsync passes these requests
915+ on to the distributed FS , it SHOULD declare that it supports them.
880916
881917# ## Metadata updates
882918
0 commit comments