|
| 1 | +<!--- |
| 2 | + Licensed under the Apache License, Version 2.0 (the "License"); |
| 3 | + you may not use this file except in compliance with the License. |
| 4 | + You may obtain a copy of the License at |
| 5 | +
|
| 6 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 7 | +
|
| 8 | + Unless required by applicable law or agreed to in writing, software |
| 9 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 10 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 11 | + See the License for the specific language governing permissions and |
| 12 | + limitations under the License. See accompanying LICENSE file. |
| 13 | +--> |
| 14 | + |
| 15 | +# S3A Prefetching |
| 16 | + |
| 17 | +This document explains the `S3PrefetchingInputStream` and the various components it uses. |
| 18 | + |
| 19 | +This input stream implements prefetching and caching to improve read performance of the input |
| 20 | +stream. |
| 21 | +A high level overview of this feature was published in |
| 22 | +[Pinterest Engineering's blog post titled "Improving efficiency and reducing runtime using S3 read optimization"](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0). |
| 23 | + |
| 24 | +With prefetching, the input stream divides the remote file into blocks of a fixed size, associates |
| 25 | +buffers to these blocks and then reads data into these buffers asynchronously. |
| 26 | +It also potentially caches these blocks. |
| 27 | + |
| 28 | +### Basic Concepts |
| 29 | + |
| 30 | +* **Remote File**: A binary blob of data stored on some storage device. |
| 31 | +* **Block File**: Local file containing a block of the remote file. |
| 32 | +* **Block**: A file is divided into a number of blocks. |
| 33 | +The size of the first n-1 blocks is same, and the size of the last block may be same or smaller. |
| 34 | +* **Block based reading**: The granularity of read is one block. |
| 35 | +That is, either an entire block is read and returned or none at all. |
| 36 | +Multiple blocks may be read in parallel. |
| 37 | + |
| 38 | +### Configuring the stream |
| 39 | + |
| 40 | +|Property |Meaning |Default | |
| 41 | +|---|---|---| |
| 42 | +|`fs.s3a.prefetch.enabled` |Enable the prefetch input stream |`true` | |
| 43 | +|`fs.s3a.prefetch.block.size` |Size of a block |`8M` | |
| 44 | +|`fs.s3a.prefetch.block.count` |Number of blocks to prefetch |`8` | |
| 45 | + |
| 46 | +### Key Components |
| 47 | + |
| 48 | +`S3PrefetchingInputStream` - When prefetching is enabled, S3AFileSystem will return an instance of |
| 49 | +this class as the input stream. |
| 50 | +Depending on the remote file size, it will either use |
| 51 | +the `S3InMemoryInputStream` or the `S3CachingInputStream` as the underlying input stream. |
| 52 | + |
| 53 | +`S3InMemoryInputStream` - Underlying input stream used when the remote file size < configured block |
| 54 | +size. |
| 55 | +Will read the entire remote file into memory. |
| 56 | + |
| 57 | +`S3CachingInputStream` - Underlying input stream used when remote file size > configured block size. |
| 58 | +Uses asynchronous prefetching of blocks and caching to improve performance. |
| 59 | + |
| 60 | +`BlockData` - Holds information about the blocks in a remote file, such as: |
| 61 | + |
| 62 | +* Number of blocks in the remote file |
| 63 | +* Block size |
| 64 | +* State of each block (initially all blocks have state *NOT_READY*). |
| 65 | +Other states are: Queued, Ready, Cached. |
| 66 | + |
| 67 | +`BufferData` - Holds the buffer and additional information about it such as: |
| 68 | + |
| 69 | +* The block number this buffer is for |
| 70 | +* State of the buffer (Unknown, Blank, Prefetching, Caching, Ready, Done). |
| 71 | +Initial state of a buffer is blank. |
| 72 | + |
| 73 | +`CachingBlockManager` - Implements reading data into the buffer, prefetching and caching. |
| 74 | + |
| 75 | +`BufferPool` - Manages a fixed sized pool of buffers. |
| 76 | +It's used by `CachingBlockManager` to acquire buffers. |
| 77 | + |
| 78 | +`S3File` - Implements operations to interact with S3 such as opening and closing the input stream to |
| 79 | +the remote file in S3. |
| 80 | + |
| 81 | +`S3Reader` - Implements reading from the stream opened by `S3File`. |
| 82 | +Reads from this input stream in blocks of 64KB. |
| 83 | + |
| 84 | +`FilePosition` - Provides functionality related to tracking the position in the file. |
| 85 | +Also gives access to the current buffer in use. |
| 86 | + |
| 87 | +`SingleFilePerBlockCache` - Responsible for caching blocks to the local file system. |
| 88 | +Each cache block is stored on the local disk as a separate block file. |
| 89 | + |
| 90 | +### Operation |
| 91 | + |
| 92 | +#### S3InMemoryInputStream |
| 93 | + |
| 94 | +For a remote file with size 5MB, and block size = 8MB, since file size is less than the block size, |
| 95 | +the `S3InMemoryInputStream` will be used. |
| 96 | + |
| 97 | +If the caller makes the following read calls: |
| 98 | + |
| 99 | +``` |
| 100 | +in.read(buffer, 0, 3MB); |
| 101 | +in.read(buffer, 0, 2MB); |
| 102 | +``` |
| 103 | + |
| 104 | +When the first read is issued, there is no buffer in use yet. |
| 105 | +The `S3InMemoryInputStream` gets the data in this remote file by calling the `ensureCurrentBuffer()` |
| 106 | +method, which ensures that a buffer with data is available to be read from. |
| 107 | + |
| 108 | +The `ensureCurrentBuffer()` then: |
| 109 | + |
| 110 | +* Reads data into a buffer by calling `S3Reader.read(ByteBuffer buffer, long offset, int size)`. |
| 111 | +* `S3Reader` uses `S3File` to open an input stream to the remote file in S3 by making |
| 112 | + a `getObject()` request with range as `(0, filesize)`. |
| 113 | +* The `S3Reader` reads the entire remote file into the provided buffer, and once reading is complete |
| 114 | + closes the S3 stream and frees all underlying resources. |
| 115 | +* Now the entire remote file is in a buffer, set this data in `FilePosition` so it can be accessed |
| 116 | + by the input stream. |
| 117 | + |
| 118 | +The read operation now just gets the required bytes from the buffer in `FilePosition`. |
| 119 | + |
| 120 | +When the second read is issued, there is already a valid buffer which can be used. |
| 121 | +Don't do anything else, just read the required bytes from this buffer. |
| 122 | + |
| 123 | +#### S3CachingInputStream |
| 124 | + |
| 125 | +If there is a remote file with size 40MB and block size = 8MB, the `S3CachingInputStream` will be |
| 126 | +used. |
| 127 | + |
| 128 | +##### Sequential Reads |
| 129 | + |
| 130 | +If the caller makes the following calls: |
| 131 | + |
| 132 | +``` |
| 133 | +in.read(buffer, 0, 5MB) |
| 134 | +in.read(buffer, 0, 8MB) |
| 135 | +``` |
| 136 | + |
| 137 | +For the first read call, there is no valid buffer yet. |
| 138 | +`ensureCurrentBuffer()` is called, and for the first `read()`, prefetch count is set as 1. |
| 139 | + |
| 140 | +The current block (block 0) is read synchronously, while the blocks to be prefetched (block 1) is |
| 141 | +read asynchronously. |
| 142 | + |
| 143 | +The `CachingBlockManager` is responsible for getting buffers from the buffer pool and reading data |
| 144 | +into them. This process of acquiring the buffer pool works as follows: |
| 145 | + |
| 146 | +* The buffer pool keeps a map of allocated buffers and a pool of available buffers. |
| 147 | +The size of this pool is = prefetch block count + 1. |
| 148 | +If the prefetch block count is 8, the buffer pool has a size of 9. |
| 149 | +* If the pool is not yet at capacity, create a new buffer and add it to the pool. |
| 150 | +* If it's at capacity, check if any buffers with state = done can be released. |
| 151 | +Releasing a buffer means removing it from allocated and returning it back to the pool of available |
| 152 | +buffers. |
| 153 | +* If there are no buffers with state = done currently then nothing will be released, so retry the |
| 154 | + above step at a fixed interval a few times till a buffer becomes available. |
| 155 | +* If after multiple retries there are still no available buffers, release a buffer in the ready state. |
| 156 | +The buffer for the block furthest from the current block is released. |
| 157 | + |
| 158 | +Once a buffer has been acquired by `CachingBlockManager`, if the buffer is in a *READY* state, it is |
| 159 | +returned. |
| 160 | +This means that data was already read into this buffer asynchronously by a prefetch. |
| 161 | +If it's state is *BLANK* then data is read into it using |
| 162 | +`S3Reader.read(ByteBuffer buffer, long offset, int size).` |
| 163 | + |
| 164 | +For the second read call, `in.read(buffer, 0, 8MB)`, since the block sizes are of 8MB and only 5MB |
| 165 | +of block 0 has been read so far, 3MB of the required data will be read from the current block 0. |
| 166 | +Once all data has been read from this block, `S3CachingInputStream` requests the next block ( |
| 167 | +block 1), which will already have been prefetched and so it can just start reading from it. |
| 168 | +Also, while reading from block 1 it will also issue prefetch requests for the next blocks. |
| 169 | +The number of blocks to be prefetched is determined by `fs.s3a.prefetch.block.count`. |
| 170 | + |
| 171 | +##### Random Reads |
| 172 | + |
| 173 | +If the caller makes the following calls: |
| 174 | + |
| 175 | +``` |
| 176 | +in.read(buffer, 0, 5MB) |
| 177 | +in.seek(10MB) |
| 178 | +in.read(buffer, 0, 4MB) |
| 179 | +in.seek(2MB) |
| 180 | +in.read(buffer, 0, 4MB) |
| 181 | +``` |
| 182 | + |
| 183 | +The `CachingInputStream` also caches prefetched blocks. |
| 184 | +This happens when a `seek()` is issued for outside the current block and the current block still has |
| 185 | +not been fully read. |
| 186 | + |
| 187 | +For the above read sequence, when the `seek(10MB)` call is issued, block 0 has not been read |
| 188 | +completely so cache it as the caller will probably want to read from it again. |
| 189 | + |
| 190 | +When `seek(2MB)` is called, the position is back inside block 0. |
| 191 | +The next read can now be satisfied from the locally cached block file, which is typically orders of |
| 192 | +magnitude faster than a network based read. |
0 commit comments