Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and formalize TensorBoard's Python filesystem abstraction #5286

Open
nfelt opened this issue Sep 2, 2021 · 1 comment
Open

Improve and formalize TensorBoard's Python filesystem abstraction #5286

nfelt opened this issue Sep 2, 2021 · 1 comment
Assignees
Labels
core:backend theme:performance Performance, scalability, large data sizes, slowness, etc. type:cleanup

Comments

@nfelt
Copy link
Contributor

nfelt commented Sep 2, 2021

This is a generic tracking issue for improvements to TensorBoard's Python filesystem abstraction. (This doesn't impact Rustboard, which today should be the read path for regular users, but there are still users who have to use the Python read path because they can't use Rustboard, and the write path for TensorBoard's summary APIs to write without TensorFlow is still entirely in Python.)

Currently, TensorBoard's Python code relies on the TensorFlow filesystem API via tf.io.gfile, which has built-in support for a number of protocols besides local disk. When TensorFlow is not available, we fall back to a stub implementation of the tf.io.gfile API in tensorboard/compat/tensorflow_stub/io/gfile.py, which today only supports local disk and S3. The stub implementation achieves some reuse between filesystem implementations by having an intermediate set of methods that both implementations expose, but this abstraction A) isn't even defined other than by virtue of common methods, i.e. there's no abstract base class or spec, and B) has a number of issues.

Here are a few of the issues with that abstraction:

  • underspecified behavior in a number of places (e.g. what characters glob supports), which makes it harder to guarantee much in the way of consistency across implementations
  • little clarity/documentation about how filenames are treated as bytes vs strings
  • no "file" object at all, which is mostly ok for remote filesystems where everything is an RPC anyway, but is significantly worse for filesystems where there actually might be local state between, say, multiple reads or writes to a single file (e.g. local file descriptors, buffers, caches, etc.). Right now we ignore all of that, so it's conceivable that in cases we end up with quadratic behavior where each read at offset N is independent and must first seek past all the previous data.
  • no walk() method for filesystems, just a function, as pointed out in gfile: add support for fsspec filesystems #5248 (comment), so filesystems aren't able to take advantage of fast paths they might support

We already have a goal (bottom of #3666) to ultimately migrate away from the stub approach and instead properly inject a filesystem abstraction into places that do I/O. As part of that, we should consider revisiting this abstraction and formalizing it properly. One possibility is to just adopt the existing fsspec abstraction, since it's a whole project around this idea: https://filesystem-spec.readthedocs.io/en/latest/

If we did that, we'd likely still want to be able to use the TF filesystem APIs when available (so that existing TensorBoard users who also have TensorFlow and thus benefit from built-in support for cloud filesystems don't experience a regression), so one open question would be determining whether it's possible to shoehorn the TF filesystem support into this setup.

@nfelt nfelt added core:backend type:cleanup theme:performance Performance, scalability, large data sizes, slowness, etc. labels Sep 2, 2021
@nfelt nfelt changed the title Streamline and improve TensorBoard's Python filesystem abstraction Improve and formalize TensorBoard's Python filesystem abstraction Sep 2, 2021
@sanatmpa1 sanatmpa1 assigned sanatmpa1 and nfelt and unassigned sanatmpa1 Sep 2, 2021
@nfelt
Copy link
Contributor Author

nfelt commented Jan 7, 2022

Some updates: we've now added fsspec filesystem support via #5248, but it's still optional (we don't depend on fsspec, so users have to install it) and it's also only accessible when tensorflow is not installed, since the current tensorboard.compat.tf logic ensures we always use TF's filesystem API when it's available.

Also, to clarify the nature of the abstraction discussed in this issue, it isn't really a filesystem per se, or at least not in the usual sense - we actually want it to represent a one-level-higher abstraction, a "filesystem proxy" that can resolve scheme-prefixed paths to multiple concrete filesystems. (That said, the ambiguity in terminology arises because it is sort of tempting to shoehorn this proxy object itself into a filesystem interface - like TB does with DataProvider and DispatchingDataProvider, more or less - but just like DispatchingDataProvider, this abstraction can get kind of leaky/unwieldy. And in practice with fsspec, it has already led to confusion around how things like globbing interact with schemes and prefixes.)

Furthermore, we actually have multiple "filesystem proxy" implementations in TensorBoard today:

  1. TensorFlow's tf.io.gfile (which has different sets of concrete filesystems aka supported schemes, depending on whether tensorflow-io has been imported or not; the tf.io.gfile.get_registered_schemes() API was added so it's possible to dynamically find out if a scheme is supported)
  2. fsspec (where analogously we can call fsspec.get_filesystem_class() to find out if a specific scheme is supported; some schemes require additional imports/dependencies, including S3 which requires s3fs)
  3. built-in support, aka the no-TF stub implementation - just the local filesystem support using Python's standard library (plus our S3 boto-based support, which probably should be removed in favor of the S3 support in TF I/O and fsspec since we don't really need 3 different ways to access S3...)

Right now we have fallback from 2 to 3, but the choice of 1 vs 2&3 is mutually exclusive. We should fix this so TF users aren't artificially limited to only the TF filesystems, and can benefit from the fsspec support. The end state we want is that individual I/O calls can be dispatched across all possible proxies. This could be thought of at a 4th filesystem proxy in its own right, the "automatic" one, that supports the union of all the various concrete filesystems/schemes supported by TF, fsspec, and our built-in support. If users don't actually want automatic dispatch for some reason, we can provide a flag to specify that only a single proxy should be used (e.g. just TF).

For the automatic dispatch we'd basically have a flow roughly like this pseudocode:

scheme = ... # any prefix up to "://" - don't just use ":" to avoid confusion w/ Windows paths
options = []

def can_import(module_name, pip_name=None):
  pip_name = pip_name or module_name
  try:
    __import__(package)
    return True
  except ImportError:
    options.append("pip install " + pip_name)
    return False

if can_import("tensorflow"):
  if scheme in tf.io.gfile.get_registered_schemes():
    return tf_proxy
  if can_import("tensorflow_io", "tensorflow-io"):
    if scheme in tf.io.gfile.get_registered_schemes():
      return tf_proxy

if scheme in BUILT_IN_SUPPORTED_SCHEMES:
  return builtin_proxy

if can_import("fsspec"):
  try:
    fsspec.get_filesystem_class(scheme)
    return fsspec_proxy
  except ImportError as e:
    options.append(str(e))
  except ValueError as e:
    # Unrecognized scheme, fall through
    pass

raise UnsupportedScheme(options)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core:backend theme:performance Performance, scalability, large data sizes, slowness, etc. type:cleanup
Projects
None yet
Development

No branches or pull requests

2 participants