-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a new storage specific Env API #5761
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic structure looks good to me! It's very hard to have confidence unless seeing an implementation. I suggest we hold the PR from being merged, and merge it together when the PR in which PosixEnv is moved to this new interface.
include/rocksdb/fs_env.h
Outdated
std::string file_path; | ||
|
||
// To be set by the FSEnv implementation | ||
std::string msg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something tricky to design. Ideally, we want something that can be added up for multiple calls. For the same type of latency, for example, they can go together. But there might be something that is not counter-based. So this is a little bit tricky. Maybe both of a map<string, uint64> and a string msg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think a map is a good idea to pass back counters that can be used for debugging. We could have a map<string, string> and require the env implementation to format counters as string?
include/rocksdb/fs_env.h
Outdated
static const char* Type() { return "Environment"; } | ||
|
||
// Loads the environment specified by the input value into the result | ||
static IOStatus LoadEnv(const std::string& value, FSEnv** result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is supposed to implement this function? For example, if a user implements HDFSEnv, how it is supposed to return from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a generic implementation of Env::LoadEnv
that instantiates an Env object from the ObjectRegistry. I guess we need to do the same for FSEnv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think the LoadFSEnv should return a std:::shared_ptr *. The use of the static pointers is problematic from a memory management perspective, IMO.
include/rocksdb/fs_env.h
Outdated
// The returned file will only be accessed by one thread at a time. | ||
virtual IOStatus NewSequentialFile(const std::string& fname, | ||
const IOOptions& io_opts, | ||
std::unique_ptr<FSSequentialFile>* result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output parameters after input ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks for pointing it out. I tried to follow that in general, but missed a few cases.
include/rocksdb/fs_env.h
Outdated
// These values match Linux definition | ||
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fcntl.h#n56 | ||
enum WriteLifeTimeHint { | ||
WLTH_NOT_SET = 0, // No hint information set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming as kWlthNotSet
include/rocksdb/fs_env.h
Outdated
// IO_LOW - Typically background reads/writes such as compaction/flush | ||
// IO_HIGH - Typically user reads/synchronous WAL writes | ||
enum class IOPriority : uint8_t { | ||
IO_LOW, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming as kIoLow
or kIOLow
.
include/rocksdb/status.h
Outdated
@@ -18,31 +18,16 @@ | |||
|
|||
#include <string> | |||
#include "rocksdb/slice.h" | |||
#ifdef OS_WIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OS_WIN is a RocksDB build-time flag and may not be available when projects are built on top of RocksDB. This should not be used under include/rocksdb
.
include/rocksdb/status.h
Outdated
static Status CompactionTooLarge(const Slice& msg, | ||
const Slice& msg2 = Slice()) { | ||
return Status(kCompactionTooLarge, msg, msg2); | ||
static T CompactionTooLarge(const Slice& msg, const Slice& msg2 = Slice()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that one point of having a separate status is to have a separate list of messages. For example, this compaction-too-large is not applicable to IOStatus. Having them sharing the same template with the same set of messages defeat the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason for introducing IOStatus
was to keep the metadata separate. Status
has severity
, which it doesn't make sense for the FSEnv implementation to try to interpret, and IOStatus
has its own metadata with doesn't apply to user facing RocksDB APIs. We could separate out the error codes too, but there will be a fair amount of overlap, and I was thinking that it could be patterned after errno
, which is a superset of all system call error codes.
include/rocksdb/status.h
Outdated
|
||
namespace rocksdb { | ||
|
||
class Status { | ||
template <class T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RocksDB code base has very limited use of template (which follows the guideline of Google C++ Style https://google.github.io/styleguide/cppguide.html#Template_metaprogramming ), and I don't remember any public interface uses template. This makes the public interface harder to understand. I suggest we think twice before using template here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I wanted to reuse the code/subcodes and template seemed to be the way to do that. But on further reflection, it allows us to reuse the values but the names have to be qualified by classname. Maybe we can use Status::Code
and Status::SubCode
directly in IOStatus
? That way, we can have a single set of error codes, but have a distinct set of named constructors in IOStatus
to indicate what's allowable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to review without the corresponding changes to PosixEnv et al.
This is also listed as issue: #4746
include/rocksdb/fs_env.h
Outdated
|
||
virtual ~FSEnv(); | ||
|
||
static const char* Type() { return "Environment"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be something other than "Environment", like "FSEnvironment", to prevent it from clashing with the Env value.
include/rocksdb/fs_env.h
Outdated
// These are hints and are not necessarily guaranteed to be honored. | ||
// More hints can be added here in the future to indicate things like | ||
// storage media (HDD/SSD) to be used, replication level etc. | ||
struct IOOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see where you might want this on a per-file basis, but why do you want them on a per-request basis? Can you give an example where one read/write from a file would be a different priority than another to the same file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reads, compaction vs user reads is one use case. The long term plan is to also allow users to specify a timeout when calling Get/MultiGet, and that can be passed through to the FSEnv.
For writes, we might want to assign different priorities/timeouts to background WAL flush vs a sync write requested by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout might be a per-operation parameter, but the IOType is not likely to be (are you suggesting the same file might be used for different types of IO? I also doubt that read or write operations to a single IO file would have different priorities (this read his high but the next one is low does not make sense to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use a single TableReader object for user reads and compaction reads, so yes its possible to have different priority IOs to the same file. Even if we use a separate TableReader for compaction, I wonder if we might want to allow users to specify a priority. All user reads would go through the same file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use IOOptions for this though, rather than using the ReadOptions that are already present? Why add a new struct with new parameters rather than extending the one that already exists? It seems like there are an awful lot of places where you would need to update (SST reader/writer, trace methods, etc)
include/rocksdb/fs_env.h
Outdated
static const char* Type() { return "Environment"; } | ||
|
||
// Loads the environment specified by the input value into the result | ||
static IOStatus LoadEnv(const std::string& value, FSEnv** result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think the LoadFSEnv should return a std:::shared_ptr *. The use of the static pointers is problematic from a memory management perspective, IMO.
include/rocksdb/fs_env.h
Outdated
const IOOptions& io_opts, | ||
std::unique_ptr<FSWritableFile>* result, | ||
const EnvOptions& env_opts, | ||
IODebugContext* dbg) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the IOOptions and EnvOptions be merged here? Why have both arguments?
Also, is there a reason to rename the file classes to FS* names. It seems like it will cause a lot of changes for not much gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOOptions is per-IO, whereas EnvOptions is for file open. There's no overlap between the two.
Also, is there a reason to rename the file classes to FS* names. It seems like it will cause a lot of changes for not much gain.
Do you mean change the APIs but leave the class name unchanged? That would cause code compatibility issues.
include/rocksdb/fs_env.h
Outdated
|
||
// A structure to pass back some debugging information from the FSEnv | ||
// implementation to RocksDB in case of an IO error | ||
struct IODebugContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the use of this struct. Why would there be a different one for every API call? Who is managing this class (it is usually passed in as a pointer -- can it be null? what is the lifetime, etc?)
Would this be better as something that is a member of the FSEnv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I envisioned this being similar to PerfContext or IOStatsContext, i.e it could keep debugging counters for a single IO operation. I think its definition will have to go through a couple of iterations before it becomes meaningful. I will update the comments to clarify allocation, lifetime etc.
include/rocksdb/io_status.h
Outdated
|
||
namespace rocksdb { | ||
|
||
class IOStatus : public BaseStatus<IOStatus> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need for a new IOStatus class to add all of the extensions to Status. But would it be better to have IOStatus extend Status and possibly have a new "Code" in Status for IO errors? Then code could treat the return from a FS operation as either a Status or IOStatus if they cared about the specific sub-codes for retrying the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea. I tried to do something similar here, except by defining a base class that both Status and IOStatus extend. I don't think either approach would alter the impact to the code. The IOStatus needs to be propagated up through the *FileReader/*FileWriter classes and converted to the appropriate Status by the caller based on the context. Status::severity would be set based on the scope and retryability of the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an IOStatus extends a Status, then code like:
Status s =
and
IOStatus s =
could both be valid. As you have it defined, the Status and IOStatus are not related, and therefore not interchangeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree this would reduce the code changes required. I'm concerned that by implementing it this way, it would be very easy for developers to ignore IOStatus. Maintaining the error handling features would be difficult. @siying Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrambacher Thanks for the review! I plan to port PosixEnv to this new interface as a next step. I just wanted to propose an interface first and get some early feedback on the direction.
include/rocksdb/fs_env.h
Outdated
// These are hints and are not necessarily guaranteed to be honored. | ||
// More hints can be added here in the future to indicate things like | ||
// storage media (HDD/SSD) to be used, replication level etc. | ||
struct IOOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reads, compaction vs user reads is one use case. The long term plan is to also allow users to specify a timeout when calling Get/MultiGet, and that can be passed through to the FSEnv.
For writes, we might want to assign different priorities/timeouts to background WAL flush vs a sync write requested by the user.
include/rocksdb/fs_env.h
Outdated
const IOOptions& io_opts, | ||
std::unique_ptr<FSWritableFile>* result, | ||
const EnvOptions& env_opts, | ||
IODebugContext* dbg) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOOptions is per-IO, whereas EnvOptions is for file open. There's no overlap between the two.
Also, is there a reason to rename the file classes to FS* names. It seems like it will cause a lot of changes for not much gain.
Do you mean change the APIs but leave the class name unchanged? That would cause code compatibility issues.
include/rocksdb/io_status.h
Outdated
|
||
namespace rocksdb { | ||
|
||
class IOStatus : public BaseStatus<IOStatus> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea. I tried to do something similar here, except by defining a base class that both Status and IOStatus extend. I don't think either approach would alter the impact to the code. The IOStatus needs to be propagated up through the *FileReader/*FileWriter classes and converted to the appropriate Status by the caller based on the context. Status::severity would be set based on the scope and retryability of the error.
include/rocksdb/fs_env.h
Outdated
|
||
// A structure to pass back some debugging information from the FSEnv | ||
// implementation to RocksDB in case of an IO error | ||
struct IODebugContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I envisioned this being similar to PerfContext or IOStatsContext, i.e it could keep debugging counters for a single IO operation. I think its definition will have to go through a couple of iterations before it becomes meaningful. I will update the comments to clarify allocation, lifetime etc.
include/rocksdb/fs_env.h
Outdated
// To be set by the FSEnv implementation | ||
std::string msg; | ||
|
||
IODebugContext(IOOptions& options) { opts = options; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a IODebugContext is created from an IOOptions, why does every API require both? Can we do with just one of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anand1976 for the PR. I have taken a pass with a few comments.
include/rocksdb/fs_env.h
Outdated
// of the APIs is of type IOStatus, which can indicate an error code/sub-code, | ||
// as well as metadata about the error such as its scope and whether its | ||
// retryable. | ||
class FSEnv { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name FSEnv
somehow tends to lead to me thinking of it as a kind of Env
, which is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that comment. I was wondering why not call it simply "FileSystem"
include/rocksdb/fs_env.h
Outdated
virtual IOStatus NewSequentialFile(const std::string& fname, | ||
const IOOptions& io_opts, | ||
std::unique_ptr<FSSequentialFile>* result, | ||
const EnvOptions& env_opts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
env_opts
should go before result
.
include/rocksdb/fs_env.h
Outdated
IODebugContext* dbg) = 0; | ||
// These values match Linux definition | ||
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fcntl.h#n56 | ||
enum WriteLifeTimeHint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using enum class in our codebase, and converting to Linux-compatible C-style enums?
include/rocksdb/fs_env.h
Outdated
std::unique_ptr<FSRandomRWFile>* /*result*/, | ||
const EnvOptions& /*options*/, | ||
IODebugContext* /*dbg*/) { | ||
return IOStatus::NotSupported( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why NewRandomRWFile()
returnes NotImplemented
while NewWritableFile()
is a pure virtual function? Does it imply that NewWritableFile()
is something that an FS
must support, while NewRandomRWFile()
is not?
include/rocksdb/fs_env.h
Outdated
const EnvOptions& env_options, | ||
const ImmutableDBOptions& db_options) const; | ||
|
||
// This seems to clash with a macro on Windows, so #undef it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably use ifndef
around this undef
.
include/rocksdb/fs_env.h
Outdated
|
||
private: | ||
// No copying allowed | ||
FSEnv(const FSEnv&); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can use =delete
to disallow copy.
include/rocksdb/io_status.h
Outdated
void SetDataLoss(bool data_loss) { data_loss_ = data_loss; } | ||
void SetScope(IOErrorScope scope) { scope_ = scope; } | ||
|
||
bool retryable() { return retryable_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: bool retryable() const
@siying @mrambacher @riversand963 I have updated the PR with a port of PosixEnv. I've also incorporated your comments and feedback. The PR description has also been updated with the basic design. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard is it to modify the DB code to directly use the new interface?
The way is done now introduces extra wrapper for every file system call, which I am not sure we are OK with. Also still we don't know how effective the new interface is because we are not directly using them in DB.
The extra wrapper for every file system call is a fair point. I think this can be solved by inlining |
@siying Your other point about not knowing how effective is the new interface is more complicated. In this PR, I have focused on basic error recovery for retryable errors. The DB should be able to automatically recover (ignoring the scope of the error) from errors that have I don't think we will know the effectiveness of the other aspects of the new interface, such as timeouts and priority, until there is an end to end implementation (including a functioning FileSystem implementation from WS). So I think modifying the DB to specify timeout/priority for IO operations should be deferred to a follow-on PR. In the meantime, it should be made clear that the new interface is still experimental, in order to discourage developers from porting their Env implementations to it right away. Thoughts? |
@anand1976 I don't think we can easily inline virtual functions. More importantly, this is not how we want it to eventually look like. Eventually, we want the new API used everywhere, and legacy Env implementation will be wrapped as the new one. So I'm curious how hard it is to skip the reverse wrapped stage and just do what we want to do eventually now. |
@anand1976 I totally agree that we should delay the implementation of all the timeout, io priority, etc. I totally support this plan. Even the retryable can be delayed. My comment is not about having those implemented, but about which interface is used. If DB still uses the old Env interface, we do not really see how the new interface is used. I'm not talking about the new options like timeout, etc, but the functionalities already in use. I know it is very time consuming to go through all the code base and chase where they are used. But this is needed anyway. By going through it earlier, we can also look at whether the interface is a good one for all those places in longer term, which is helpful to avoid changing the interface again. |
The core RocksDB code reads/writes to files through RandomAccessFileReader/WritableFileWriter. It should be relatively straightforward to modify those two to use the new interfaces, as well as deal with the ripple effects (since they would need to be constructed with FSRandomAccessFile/FSWritableFile). This would make it easier in the future to add support in RocksDB for fine-grained error handling, timeouts and IO prioritization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 @siying I see a couple different things here:
-
The new methods take a few additional arguments (IOOptions, IODebugContext). I am not sure what the point is of these arguments -- why not extend EnvOptions for IOOptions? And if you need a debug context, why not make it part of the constructor rather than the individual methods? I do not see what the usage pattern is for this argument yet.
-
I do not see the need for the FSXXX classes and methods. If the goal is to replace the existing methods, why not just do that rather than create this additional wrapper layer?
-
I do not understand the relationship between Env and FileSystem. I would have thought that an Env has a FileSystem in it but this appears not to be the case. Additionally, I am concerned about the lifetime of a FileSystem. It seems like it needs to be a std:shared_ptr, which it is not currently (I think the same should be true of Env BTW).
include/rocksdb/file_system.h
Outdated
// implementation instead of relying on this default file_system | ||
// | ||
// The result of Default() belongs to rocksdb and must never be deleted. | ||
static FileSystem* Default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this a std:shared_ptr.
include/rocksdb/file_system.h
Outdated
const std::string& fname, const IOOptions& io_opts, | ||
std::unique_ptr<FSRandomAccessFile>* result, const EnvOptions& env_opts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sure I understand why/how we need an IOOptions and an EnvOptions and an IODebugContext. It seems like one of these would do...
logging/posix_logger.h
Outdated
IOStatus status = IOError("Unable to close log file", "", ret); | ||
if (!status.ok()) { | ||
return Status::IOError("Unable to close log file", ""); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this code. How does the IOError get converted to an IOStatus and how does it ever translate to ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted this change. The return from IOError() can be directly returned.
include/rocksdb/io_status.h
Outdated
void SetRetryable(bool retryable) { retryable_ = retryable; } | ||
void SetDataLoss(bool data_loss) { data_loss_ = data_loss; } | ||
void SetScope(IOErrorScope scope) { scope_ = scope; } | ||
|
||
bool retryable() const { return retryable_; } | ||
bool data_loss() const { return data_loss_; } | ||
IOErrorScope scope() const { return scope_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these more symmetrical? It does not seem like SetDataLoss and data_loss are the equivalent setter and getter...
db/db_impl/db_impl_open.cc
Outdated
if (result.file_system != nullptr) { | ||
result.env = new CompositeEnvWrapper(result.env, result.file_system); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't an Env have a FileSystem member rather than the FileSystem bubbling up to a higher level?
@@ -0,0 +1,730 @@ | |||
// Copyright (c) 2019-present, Facebook, Inc. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this all necessary because you do not want to change the Env classes? Why not?
I am a little confused as to why you are not replacing the existing implementation of the File classes, rather than wrapping them. I understand there might be some potential API compatibility issues, but that seems like something that could be worked out in the design.
IOStatus GetChildren(const std::string& dir, const IOOptions& /*opts*/, | ||
std::vector<std::string>* result, | ||
IODebugContext* /*dbg*/) override { | ||
result->clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think every code path in the FS Posix does not use the IOOptions and IODebugContext, which makes it very hard to say that the function signatures you suggest are the ones that should be used. I am especially concerned of the IODebugContext * as I do not understand the expected behavior of this (is nullptr a valid value?) and it is not clear what the lifetime of this object is.
I would prefer having something that is specified when the FileSystem or File object is created, rather than on every function signature. Unless you can tell me why one would use a different argument for different calls for the same file object...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying something when the File object is created would not work for retrieving fine grained statistics for a single operation. The IODebugContext is somewhat similar in concept to the IOStatsContext, which records some statistics for each IO call, but intended to provide more visibility into the Env which IOStatsContext does not. And unlike IOStatsContext, we want to move away from using thread local storage and explicitly pass a pointer to the object.
The scope of EnvOptions and IOOptions is different. EnvOptions is for file creation, whereas IOOptions is for individual file operations. For the New*File methods, I agree IOOptions and EnvOptions can be combined (maybe the latter can embed the former). IODebugContext is envisioned to be used for per file request tracing. It cannot be embedded in the file object since the file access is multi-threaded.
Replacing the existing methods would break backward compatibility. There are a number of users with custom Env implementations and requiring changes in them would complicate things. We could overload the methods, but we wouldn't be able to define the new ones as pure virtual (doing so would break compatibility).
Agree about FileSystem being a shared_ptr. There's technically no relationship between Env and FileSystem. Going forward, we'd like to have the ability to use multiple FileSystems in the same DB - perhaps for tiering or some other use case. |
@siying PTAL. The new commit ports most of the core RocksDB code to use the new API. The RandomAccessFileReader, SequenceFileReader and WritableFileWriter classes call the new API. Tools and tests still use the old API, either directly or through wrapper classes where needed. There may be some small cleanup needed, which I'll do once we agree on the basic structure of the code. |
I know I sound like a broken record here, but strongly believe that the additional arguments to the methods is going to be confusing and add little value. I would much prefer that the IOOptions and EnvOptions have some is-a relationship to one another, rather than passing two in. As far as the relationship between Env and FileSystem, I still think there is a relationship. At the moment, a large percentage of the methods of Env are FileSystem related. I think an Env has three components -- FileSystem, Threads, and Clock/System. I do not believe that having an Env has-a FileSystem causes any future problems with multiple FileSystems in a single DB. Finally, I also think that the IODebugContext should be a std::shared_ptr and should ONLY be an argument to the NewXXXFile methods. I have a hard time envisioning when you would like to debug specific calls to read for specific files and how you would implement that. I think making it a call to the File constructor could be useful however. But I also believe it should be a std::shared_ptr to guarantee its lifefime. I also think that all of the existing Env in the code base (HDFS, Memory, test ones) should be ported to this new API. That would help validate that the API is correct for more than a single FileSystem implementation. |
@mrambacher the main motivation of the separation of Env and FileSystem is that, we plan to support multiple file systems and sharing some other resource like thread pool, timing, etc. So separating file system related functionality feels something we have to do. I agree that there is going to be a painful transition, but we can't think of a better way. Do you have any suggestion to solve this problem? |
@siying I am not sure how the issues are related. If there are multiple FileSystems, there will need to be some mechanism of getting which one you want (like by name). So Env would probably have two methods: If you intend to support multiples, something must be the container for storing the multiples, so why not the Env? I guess one question is which layer knows which FileSystem to use. At a certain point, someone has to make a decision. In the places that both an Env and FileSystem are required arguments, what is needed from the Env? Another solution would be to invert it. Instead of an Env having a FileSystem, why not have a FileSystem have an Env? What from an Env is useful/important to a FileSystem? |
@mrambacher I agree that getting a file system from a string may be a good idea. It is useful for many use cases. Where to put it is interesting though. I don't have an opinion on that. I don't think there are many places where Env and FileSystem need to be passed in together. In some entrance of major functionality, maybe, but I don't expect it will be common. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic interface looks good to me. I'll try to do more detailed review today.
include/rocksdb/file_system.h
Outdated
// The returned file will only be accessed by one thread at a time. | ||
virtual IOStatus NewSequentialFile(const std::string& fname, | ||
const IOOptions& io_opts, | ||
const EnvOptions& env_opts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to put EnvOptions here? Ideally we don't. I know consolidating the two options may be painful, but fow now, can we build a two way converter so that the API can be clean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scope of EnvOptions is a file, for the duration of the file being open. The scope of IOOptions is the specific IOs that would happen as part of file open. How about having a FileOptions with IOOptions embedded in it? Other methods such as FSRandomAccessFile::Read() can take IOOptions as an argument, while methods such as FieSystem::NewRandomAccessFile() can take a FileOptions as argument.
include/rocksdb/io_status.h
Outdated
IO_ERROR_SCOPE_FILE_SYSTEM, | ||
IO_ERROR_SCOPE_FILE, | ||
IO_ERROR_SCOPE_RANGE, | ||
IO_ERROR_SCOPE_MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kIoErrorScopeFileSystem?
Can we overload the existing Env to encapsulate various combinations of file systems and operating systems.? Every Env is a combination of OperatingSystem + FileSystem + hardware. If we can do that elegantly, then we do not have to introduce the FileSystem class to propagate to all the db code. |
@dhruba eventually, we will need to allow one DB to store different files to different file systems. The list of file systems to use will belong to DB's configuration, not Env's. |
@siying agreed. For example, when we had to store files to local xfs as well as AWS S3 fs, we created an AwsEnv. The code that implements this AwsEnv can then decide where to store which file based on its own criteria. Also, we overloaded EnvOptions so that we can add more config parameters for this decision making. |
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…nv (#6414) Summary: In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller. A set of ErrorHandlerFSTests are introduced for testing Pull Request resolved: facebook/rocksdb#6414 Test Plan: pass make asan_check, pass error_handler_fs_test. Differential Revision: D20252421 Pulled By: zhichao-cao fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21 Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR facebook/rocksdb#5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: facebook/rocksdb#6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1 Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…nv (#6414) Summary: In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller. A set of ErrorHandlerFSTests are introduced for testing Pull Request resolved: facebook/rocksdb#6414 Test Plan: pass make asan_check, pass error_handler_fs_test. Differential Revision: D20252421 Pulled By: zhichao-cao fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21 Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…nv (#6414) Summary: In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller. A set of ErrorHandlerFSTests are introduced for testing Pull Request resolved: facebook/rocksdb#6414 Test Plan: pass make asan_check, pass error_handler_fs_test. Differential Revision: D20252421 Pulled By: zhichao-cao fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21 Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…nv (#6414) Summary: In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller. A set of ErrorHandlerFSTests are introduced for testing Pull Request resolved: facebook/rocksdb#6414 Test Plan: pass make asan_check, pass error_handler_fs_test. Differential Revision: D20252421 Pulled By: zhichao-cao fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21 Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: facebook/rocksdb#5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f Signed-off-by: Changlong Chen <levisonchen@live.cn>
…in BG jobs (#6487) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: facebook/rocksdb#6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both
options.env
andoptions.file_system
to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, aCompositeEnvWrapper
has been introduced that inherits fromEnv
and redirects individual methods to either anEnv
implementation or theFileSystem
as appropriate. When options are sanitized duringDB::Open
,options.env
is replaced with a newly allocatedCompositeEnvWrapper
instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The
CompositeEnvWrapper
translatesIOStatus
return code toStatus
, and sets the severity tokSoftError
if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.