Skip to content

Commit 658396a

Browse files
committed
fixes yetus errors
1 parent 125152e commit 658396a

File tree

1 file changed

+31
-31
lines changed
  • hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws

1 file changed

+31
-31
lines changed

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md

Lines changed: 31 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -22,36 +22,36 @@ A high level overview of this feature was published in
2222
[Pinterest Engineering's blog post titled "Improving efficiency and reducing runtime using S3 read optimization"](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0).
2323

2424
With prefetching, the input stream divides the remote file into blocks of a fixed size, associates
25-
buffers to these blocks and then reads data into these buffers asynchronously.
25+
buffers to these blocks and then reads data into these buffers asynchronously.
2626
It also potentially caches these blocks.
2727

2828
### Basic Concepts
2929

3030
* **Remote File**: A binary blob of data stored on some storage device.
3131
* **Block File**: Local file containing a block of the remote file.
32-
* **Block**: A file is divided into a number of blocks.
32+
* **Block**: A file is divided into a number of blocks.
3333
The size of the first n-1 blocks is same, and the size of the last block may be same or smaller.
34-
* **Block based reading**: The granularity of read is one block.
35-
That is, either an entire block is read and returned or none at all.
34+
* **Block based reading**: The granularity of read is one block.
35+
That is, either an entire block is read and returned or none at all.
3636
Multiple blocks may be read in parallel.
3737

3838
### Configuring the stream
3939

4040
|Property |Meaning |Default |
41-
|--- |--- |--- |
41+
|---|---|---|
4242
|`fs.s3a.prefetch.enabled` |Enable the prefetch input stream |`true` |
4343
|`fs.s3a.prefetch.block.size` |Size of a block |`8M` |
4444
|`fs.s3a.prefetch.block.count` |Number of blocks to prefetch |`8` |
4545

4646
### Key Components
4747

4848
`S3PrefetchingInputStream` - When prefetching is enabled, S3AFileSystem will return an instance of
49-
this class as the input stream.
49+
this class as the input stream.
5050
Depending on the remote file size, it will either use
5151
the `S3InMemoryInputStream` or the `S3CachingInputStream` as the underlying input stream.
5252

5353
`S3InMemoryInputStream` - Underlying input stream used when the remote file size < configured block
54-
size.
54+
size.
5555
Will read the entire remote file into memory.
5656

5757
`S3CachingInputStream` - Underlying input stream used when remote file size > configured block size.
@@ -61,30 +61,30 @@ Uses asynchronous prefetching of blocks and caching to improve performance.
6161

6262
* Number of blocks in the remote file
6363
* Block size
64-
* State of each block (initially all blocks have state *NOT_READY*).
64+
* State of each block (initially all blocks have state *NOT_READY*).
6565
Other states are: Queued, Ready, Cached.
6666

6767
`BufferData` - Holds the buffer and additional information about it such as:
6868

6969
* The block number this buffer is for
70-
* State of the buffer (Unknown, Blank, Prefetching, Caching, Ready, Done).
70+
* State of the buffer (Unknown, Blank, Prefetching, Caching, Ready, Done).
7171
Initial state of a buffer is blank.
7272

7373
`CachingBlockManager` - Implements reading data into the buffer, prefetching and caching.
7474

75-
`BufferPool` - Manages a fixed sized pool of buffers.
75+
`BufferPool` - Manages a fixed sized pool of buffers.
7676
It’s used by `CachingBlockManager` to acquire buffers.
7777

7878
`S3File` - Implements operations to interact with S3 such as opening and closing the input stream to
7979
the remote file in S3.
8080

81-
`S3Reader` - Implements reading from the stream opened by `S3File`.
81+
`S3Reader` - Implements reading from the stream opened by `S3File`.
8282
Reads from this input stream in blocks of 64KB.
8383

84-
`FilePosition` - Provides functionality related to tracking the position in the file.
84+
`FilePosition` - Provides functionality related to tracking the position in the file.
8585
Also gives access to the current buffer in use.
8686

87-
`SingleFilePerBlockCache` - Responsible for caching blocks to the local file system.
87+
`SingleFilePerBlockCache` - Responsible for caching blocks to the local file system.
8888
Each cache block is stored on the local disk as a separate block file.
8989

9090
### Operation
@@ -101,8 +101,8 @@ in.read(buffer, 0, 3MB);
101101
in.read(buffer, 0, 2MB);
102102
```
103103

104-
When the first read is issued, there is no buffer in use yet.
105-
The `S3InMemoryInputStream` gets the data in this remote file by calling the `ensureCurrentBuffer()`
104+
When the first read is issued, there is no buffer in use yet.
105+
The `S3InMemoryInputStream` gets the data in this remote file by calling the `ensureCurrentBuffer()`
106106
method, which ensures that a buffer with data is available to be read from.
107107

108108
The `ensureCurrentBuffer()` then:
@@ -117,7 +117,7 @@ The `ensureCurrentBuffer()` then:
117117

118118
The read operation now just gets the required bytes from the buffer in `FilePosition`.
119119

120-
When the second read is issued, there is already a valid buffer which can be used.
120+
When the second read is issued, there is already a valid buffer which can be used.
121121
Don’t do anything else, just read the required bytes from this buffer.
122122

123123
#### S3CachingInputStream
@@ -134,7 +134,7 @@ in.read(buffer, 0, 5MB)
134134
in.read(buffer, 0, 8MB)
135135
```
136136

137-
For the first read call, there is no valid buffer yet.
137+
For the first read call, there is no valid buffer yet.
138138
`ensureCurrentBuffer()` is called, and for the first `read()`, prefetch count is set as 1.
139139

140140
The current block (block 0) is read synchronously, while the blocks to be prefetched (block 1) is
@@ -143,29 +143,29 @@ read asynchronously.
143143
The `CachingBlockManager` is responsible for getting buffers from the buffer pool and reading data
144144
into them. This process of acquiring the buffer pool works as follows:
145145

146-
* The buffer pool keeps a map of allocated buffers and a pool of available buffers.
147-
The size of this pool is = prefetch block count + 1.
146+
* The buffer pool keeps a map of allocated buffers and a pool of available buffers.
147+
The size of this pool is = prefetch block count + 1.
148148
If the prefetch block count is 8, the buffer pool has a size of 9.
149149
* If the pool is not yet at capacity, create a new buffer and add it to the pool.
150-
* If its at capacity, check if any buffers with state = done can be released.
151-
Releasing a buffer means removing it from allocated and returning it back to the pool of available
150+
* If it's at capacity, check if any buffers with state = done can be released.
151+
Releasing a buffer means removing it from allocated and returning it back to the pool of available
152152
buffers.
153153
* If there are no buffers with state = done currently then nothing will be released, so retry the
154154
above step at a fixed interval a few times till a buffer becomes available.
155-
* If after multiple retries there are still no available buffers, release a buffer in the ready state.
155+
* If after multiple retries there are still no available buffers, release a buffer in the ready state.
156156
The buffer for the block furthest from the current block is released.
157157

158158
Once a buffer has been acquired by `CachingBlockManager`, if the buffer is in a *READY* state, it is
159-
returned.
160-
This means that data was already read into this buffer asynchronously by a prefetch.
161-
If its state is *BLANK,* then data is read into it using
159+
returned.
160+
This means that data was already read into this buffer asynchronously by a prefetch.
161+
If it's state is *BLANK* then data is read into it using
162162
`S3Reader.read(ByteBuffer buffer, long offset, int size).`
163163

164164
For the second read call, `in.read(buffer, 0, 8MB)`, since the block sizes are of 8MB and only 5MB
165165
of block 0 has been read so far, 3MB of the required data will be read from the current block 0.
166166
Once all data has been read from this block, `S3CachingInputStream` requests the next block (
167-
block 1), which will already have been prefetched and so it can just start reading from it.
168-
Also, while reading from block 1 it will also issue prefetch requests for the next blocks.
167+
block 1), which will already have been prefetched and so it can just start reading from it.
168+
Also, while reading from block 1 it will also issue prefetch requests for the next blocks.
169169
The number of blocks to be prefetched is determined by `fs.s3a.prefetch.block.count`.
170170

171171
##### Random Reads
@@ -180,13 +180,13 @@ in.seek(2MB)
180180
in.read(buffer, 0, 4MB)
181181
```
182182

183-
The `CachingInputStream` also caches prefetched blocks.
184-
This happens when a `seek()` is issued for outside the current block and the current block still has
183+
The `CachingInputStream` also caches prefetched blocks.
184+
This happens when a `seek()` is issued for outside the current block and the current block still has
185185
not been fully read.
186186

187187
For the above read sequence, when the `seek(10MB)` call is issued, block 0 has not been read
188188
completely so cache it as the caller will probably want to read from it again.
189189

190-
When `seek(2MB)` is called, the position is back inside block 0.
191-
The next read can now be satisfied from the locally cached block file, which is typically orders of
190+
When `seek(2MB)` is called, the position is back inside block 0.
191+
The next read can now be satisfied from the locally cached block file, which is typically orders of
192192
magnitude faster than a network based read.

0 commit comments

Comments
 (0)