You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature Request: Concatenation of DBN files in a single stream
When working with files with high granularity, it is common to have directories with hundreds or even thousands of DBN files. In such situations, one file is the timeseries continuation of the previous one. Very often, there is a need to combine the contents of these files in a single stream of data, e.g., proper windowing of data, to serve as input to a machine learning training task, etc.
While experimenting with the dbn CLI tool and with the Python library, I wasn't able to find any commands or helper functions to achieve this result easily. The workaround was to load the DBN files one by one, convert each one to a Pandas DataFrame and use the pd.concat function to merge them all into a single DataFrame. However, this process is slow, memory intensive and involves the creation of multiple intermediary Pandas DataFrames, just to have one single stream at the end. Also, because the data has to be converted into a Pandas DataFrame, all the benefits of DBN files are not available in such situation.
Current behavior
Trying to use the CLI passing multiple input files and a single output destination is not supported:
Trying to load several files at once from the Python library is also not supported:
>>> import databento as db
>>> data = db.DBNStore.from_file('./*.dbn.zst')
FileNotFoundError Traceback (most recent call last)
Cell In[1], line 4
1 import os
2 import databento as db
----> 4 data = db.DBNStore.from_file('./*.dbn.zst')
File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:649, in DBNStore.from_file(cls, path)
627 @classmethod
628 def from_file(cls, path: PathLike[str] | str) -> DBNStore:
629 """
630 Load the data from a DBN file at the given path.
631
(...)
647
648 """
--> 649 return cls(FileDataSource(path))
File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:145, in FileDataSource.__init__(self, source)
142 self._path = Path(source)
144 if not self._path.is_file() or not self._path.exists():
--> 145 raise FileNotFoundError(source)
147 if self._path.stat().st_size == 0:
148 raise ValueError(
149 f"Cannot create data source from empty file: {self._path.name}",
150 )
FileNotFoundError: ./*.dbn.zst
Same with an array of files:
>>> import databento as db
>>> data = db.DBNStore.from_file(['./file-00.dbn.zst', './file-01.dbn.zst'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 4
1 import os
2 import databento as db
----> 4 data = db.DBNStore.from_file(['./file-00.dbn.zst', './file-01.dbn.zst'])
File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:649, in DBNStore.from_file(cls, path)
627 @classmethod
628 def from_file(cls, path: PathLike[str] | str) -> DBNStore:
629 """
630 Load the data from a DBN file at the given path.
631
(...)
647
648 """
--> 649 return cls(FileDataSource(path))
File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:142, in FileDataSource.__init__(self, source)
141 def __init__(self, source: PathLike[str] | str):
--> 142 self._path = Path(source)
144 if not self._path.is_file() or not self._path.exists():
145 raise FileNotFoundError(source)
File /opt/conda/.../python3.11/pathlib.py:871, in Path.__new__(cls, *args, **kwargs)
869 if cls is Path:
870 cls = WindowsPath if os.name == 'nt' else PosixPath
--> 871 self = cls._from_parts(args)
872 if not self._flavour.is_supported:
873 raise NotImplementedError("cannot instantiate %r on your system"
874 % (cls.__name__,))
File /opt/conda/.../python3.11/pathlib.py:509, in PurePath._from_parts(cls, args)
504 @classmethod
505 def _from_parts(cls, args):
506 # We need to call _parse_args on the instance, so as to get the
507 # right flavour.
508 self = object.__new__(cls)
--> 509 drv, root, parts = self._parse_args(args)
510 self._drv = drv
511 self._root = root
File /opt/conda/.../python3.11/pathlib.py:493, in PurePath._parse_args(cls, args)
491 parts += a._parts
492 else:
--> 493 a = os.fspath(a)
494 if isinstance(a, str):
495 # Force-cast str subclasses to str (issue #21127)
496 parts.append(str(a))
TypeError: expected str, bytes or os.PathLike object, not list
Expected behavior
The commands and function calls above should work as intended, meaning:
The CLI tool should accept multiple positional arguments and stream the content of each file one by one sequentially, into the desired output in the specified order
The helper function DBNStore.from_file in Python should accept either a glob pattern or a list of filenames from the file system, exposing in return a single stream of data from all the matching files in the provided sequence
Similar adjustments should be made into the Rust library
Added Value
If the dbn command line tool provides an easy way to convert multiple DBN files into a single one, the issue reported above can be easily solved by a very simple preprocessing step where all the necessary files are merged, so they can be later loaded as a single stream (for example, in a Python application).
If the library functions are adapted to load multiple files at once, the benefits are even greater, as the final result would be achievable from the programming language itself.
The text was updated successfully, but these errors were encountered:
Feature Request: Concatenation of DBN files in a single stream
When working with files with high granularity, it is common to have directories with hundreds or even thousands of DBN files. In such situations, one file is the timeseries continuation of the previous one. Very often, there is a need to combine the contents of these files in a single stream of data, e.g., proper windowing of data, to serve as input to a machine learning training task, etc.
While experimenting with the
dbn
CLI tool and with the Python library, I wasn't able to find any commands or helper functions to achieve this result easily. The workaround was to load the DBN files one by one, convert each one to a Pandas DataFrame and use thepd.concat
function to merge them all into a single DataFrame. However, this process is slow, memory intensive and involves the creation of multiple intermediary Pandas DataFrames, just to have one single stream at the end. Also, because the data has to be converted into a Pandas DataFrame, all the benefits of DBN files are not available in such situation.Current behavior
Trying to use the CLI passing multiple input files and a single output destination is not supported:
Trying to load several files at once from the Python library is also not supported:
Same with an array of files:
Expected behavior
The commands and function calls above should work as intended, meaning:
DBNStore.from_file
in Python should accept either a glob pattern or a list of filenames from the file system, exposing in return a single stream of data from all the matching files in the provided sequenceAdded Value
If the
dbn
command line tool provides an easy way to convert multiple DBN files into a single one, the issue reported above can be easily solved by a very simple preprocessing step where all the necessary files are merged, so they can be later loaded as a single stream (for example, in a Python application).If the library functions are adapted to load multiple files at once, the benefits are even greater, as the final result would be achievable from the programming language itself.
The text was updated successfully, but these errors were encountered: