Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow concatenation of several DBN files in a single data stream #54

Open
schrodervictor opened this issue Jun 24, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@schrodervictor
Copy link

Feature Request: Concatenation of DBN files in a single stream

When working with files with high granularity, it is common to have directories with hundreds or even thousands of DBN files. In such situations, one file is the timeseries continuation of the previous one. Very often, there is a need to combine the contents of these files in a single stream of data, e.g., proper windowing of data, to serve as input to a machine learning training task, etc.

While experimenting with the dbn CLI tool and with the Python library, I wasn't able to find any commands or helper functions to achieve this result easily. The workaround was to load the DBN files one by one, convert each one to a Pandas DataFrame and use the pd.concat function to merge them all into a single DataFrame. However, this process is slow, memory intensive and involves the creation of multiple intermediary Pandas DataFrames, just to have one single stream at the end. Also, because the data has to be converted into a Pandas DataFrame, all the benefits of DBN files are not available in such situation.

Current behavior

Trying to use the CLI passing multiple input files and a single output destination is not supported:

$ dbn ./file-00.dbn.zst ./file-01.dbn.zst --out combined.dbn
error: unexpected argument './file-01.dbn.zst' found

Trying to load several files at once from the Python library is also not supported:

>>> import databento as db
>>> data = db.DBNStore.from_file('./*.dbn.zst')

FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 4
      1 import os
      2 import databento as db
----> 4 data = db.DBNStore.from_file('./*.dbn.zst')

File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:649, in DBNStore.from_file(cls, path)
    627 @classmethod
    628 def from_file(cls, path: PathLike[str] | str) -> DBNStore:
    629     """
    630     Load the data from a DBN file at the given path.
    631 
   (...)
    647 
    648     """
--> 649     return cls(FileDataSource(path))

File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:145, in FileDataSource.__init__(self, source)
    142 self._path = Path(source)
    144 if not self._path.is_file() or not self._path.exists():
--> 145     raise FileNotFoundError(source)
    147 if self._path.stat().st_size == 0:
    148     raise ValueError(
    149         f"Cannot create data source from empty file: {self._path.name}",
    150     )

FileNotFoundError: ./*.dbn.zst

Same with an array of files:

>>> import databento as db
>>> data = db.DBNStore.from_file(['./file-00.dbn.zst', './file-01.dbn.zst'])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 4
      1 import os
      2 import databento as db
----> 4 data = db.DBNStore.from_file(['./file-00.dbn.zst', './file-01.dbn.zst'])

File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:649, in DBNStore.from_file(cls, path)
    627 @classmethod
    628 def from_file(cls, path: PathLike[str] | str) -> DBNStore:
    629     """
    630     Load the data from a DBN file at the given path.
    631 
   (...)
    647 
    648     """
--> 649     return cls(FileDataSource(path))

File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:142, in FileDataSource.__init__(self, source)
    141 def __init__(self, source: PathLike[str] | str):
--> 142     self._path = Path(source)
    144     if not self._path.is_file() or not self._path.exists():
    145         raise FileNotFoundError(source)

File /opt/conda/.../python3.11/pathlib.py:871, in Path.__new__(cls, *args, **kwargs)
    869 if cls is Path:
    870     cls = WindowsPath if os.name == 'nt' else PosixPath
--> 871 self = cls._from_parts(args)
    872 if not self._flavour.is_supported:
    873     raise NotImplementedError("cannot instantiate %r on your system"
    874                               % (cls.__name__,))

File /opt/conda/.../python3.11/pathlib.py:509, in PurePath._from_parts(cls, args)
    504 @classmethod
    505 def _from_parts(cls, args):
    506     # We need to call _parse_args on the instance, so as to get the
    507     # right flavour.
    508     self = object.__new__(cls)
--> 509     drv, root, parts = self._parse_args(args)
    510     self._drv = drv
    511     self._root = root

File /opt/conda/.../python3.11/pathlib.py:493, in PurePath._parse_args(cls, args)
    491     parts += a._parts
    492 else:
--> 493     a = os.fspath(a)
    494     if isinstance(a, str):
    495         # Force-cast str subclasses to str (issue #21127)
    496         parts.append(str(a))

TypeError: expected str, bytes or os.PathLike object, not list

Expected behavior

The commands and function calls above should work as intended, meaning:

  • The CLI tool should accept multiple positional arguments and stream the content of each file one by one sequentially, into the desired output in the specified order
  • The helper function DBNStore.from_file in Python should accept either a glob pattern or a list of filenames from the file system, exposing in return a single stream of data from all the matching files in the provided sequence
  • Similar adjustments should be made into the Rust library

Added Value

If the dbn command line tool provides an easy way to convert multiple DBN files into a single one, the issue reported above can be easily solved by a very simple preprocessing step where all the necessary files are merged, so they can be later loaded as a single stream (for example, in a Python application).

If the library functions are adapted to load multiple files at once, the benefits are even greater, as the final result would be achievable from the programming language itself.

@schrodervictor schrodervictor added the enhancement New feature or request label Jun 24, 2024
@threecgreen
Copy link
Contributor

Hi,
We have a roadmap item for supporting merging DBN files in the CLI and client libraries.

You can create a memory-efficient stream by chaining iterators over the separate DBNStores like so:

from itertools import chain

for record in chain(DBNStore.from_file('file1.dbn'), DBNStore.from_file('file2.dbn')):
    foo(record)

With glob.glob() and some file name sorting, you can get a lot of what you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

2 participants