-
Notifications
You must be signed in to change notification settings - Fork 2
Support read and metadata for BytesIO objects #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR adds support for reading audio and querying metadata from file-like objects (e.g., BytesIO) for SND formats (WAV, FLAC, MP3, OGG) by extending the core IO and info utilities to detect and handle file-like inputs, preserve stream position, and raise clear errors for unsupported formats, along with comprehensive tests for the new behavior. Sequence diagram for read handling of BytesIO file_like objectssequenceDiagram
actor User
participant Buffer as "io.BytesIO"
participant Read as "audiofile.core.io.read"
participant Utils as "audiofile.core.utils"
participant Soundfile as "soundfile.read/info"
User->>Read: read(Buffer, duration, offset, always_2d, kwargs)
Read->>Utils: is_file_like(Buffer)
Utils-->>Read: True
Read->>Utils: file_extension(Buffer)
Utils-->>Read: None
Note over Read: file_like=True, file_ext=None
alt extension in SNDFORMATS or (file_like and file_ext is None)
Read->>Soundfile: read(Buffer, start, stop, dtype, always_2d)
Soundfile-->>Read: signal, sampling_rate
Read->>Buffer: seek(0)
else unsupported format
Read-->>User: RuntimeError
end
Read-->>User: signal, sampling_rate
Class diagram for updated IO and info functions supporting file_like objectsclassDiagram
class Utils {
+bool is_file_like(obj)
+str~None file_extension(path)
MAX_CHANNELS
SNDFORMATS
}
class InfoModule {
+int~None bit_depth(file)
+int channels(file)
+float duration(file, sloppy)
+int samples(file)
+int sampling_rate(file)
}
class IOModule {
+tuple read(file, duration, offset, always_2d, **kwargs)
}
%% Dependencies between modules
InfoModule ..> Utils : uses
IOModule ..> Utils : uses
%% Indicate file or file_like union types via comments
%% InfoModule : file: str | io.IOBase
%% IOModule : file: str | io.IOBase
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 6 issues, and left some high level feedback:
- The error message for unsupported file-like formats ("File-like objects are only supported ...") is duplicated across
read(),channels(),duration(),samples(), andsampling_rate(); consider centralizing this in a small helper to avoid divergence if it ever needs to change. bit_depth()now accepts file-like objects but, unlike the other info functions, returnsNoneinstead of raising aRuntimeErrorfor unsupported formats; consider aligning its behavior withchannels()/duration()/samples()/sampling_rate()for consistency with file-like inputs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The error message for unsupported file-like formats ("File-like objects are only supported ...") is duplicated across `read()`, `channels()`, `duration()`, `samples()`, and `sampling_rate()`; consider centralizing this in a small helper to avoid divergence if it ever needs to change.
- `bit_depth()` now accepts file-like objects but, unlike the other info functions, returns `None` instead of raising a `RuntimeError` for unsupported formats; consider aligning its behavior with `channels()/duration()/samples()/sampling_rate()` for consistency with file-like inputs.
## Individual Comments
### Comment 1
<location> `audiofile/core/utils.py:9-18` </location>
<code_context>
import audmath
+def is_file_like(obj) -> bool:
+ r"""Check if object is a file-like object.
+
+ A file-like object is an object with a ``read`` method,
+ such as ``io.BytesIO``.
+
+ Args:
+ obj: object to check
+
+ Returns:
+ ``True`` if object is file-like
+
+ """
+ return hasattr(obj, "read")
+
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Tighten `is_file_like` to match the actual expectations of callers (e.g. `seek` support).
Callers like `bit_depth`, `channels`, `duration`, `samples`, `sampling_rate`, and `read` unconditionally call `file.seek(0)` whenever `is_file_like(file)` is true. Since `is_file_like` only checks for `read`, non-seekable objects with `read` will pass and then fail at runtime when `seek` is called. Please either (a) update `is_file_like` to also require `seek` (and maybe `tell`), or (b) make those callers robust to non-seekable streams (e.g., check for `seek` before calling it or explicitly document that only seekable file-like objects are supported).
</issue_to_address>
### Comment 2
<location> `tests/test_audiofile.py:1316` </location>
<code_context>
write_and_read("test.wav", np.zeros((65536, 100)), sampling_rate)
+
+
+class TestBytesIO:
+ """Tests for reading from file-like objects (BytesIO)."""
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests for MP3/OGG BytesIO support to cover all advertised formats.
The implementation and docstring claim BytesIO support for WAV, FLAC, MP3, and OGG, but `TestBytesIO` currently only covers WAV and FLAC. Please add MP3/OGG fixtures analogous to `wav_bytes`/`flac_bytes` and matching tests (e.g., `test_read_bytesio_mp3` / `test_read_bytesio_ogg`, including `always_2d` and possibly offset/duration) so all advertised formats are exercised from a `BytesIO` buffer.
Suggested implementation:
```python
class TestBytesIO:
"""Tests for reading from file-like objects (BytesIO)."""
@pytest.fixture
def mp3_bytes(self, tmpdir):
"""Create MP3 audio data as bytes."""
sampling_rate = 8000
signal = sine(
duration=0.5,
sampling_rate=sampling_rate,
)
tmp_file = tmpdir.join("bytesio_test.mp3")
audiofile.write(str(tmp_file), signal, sampling_rate)
with open(str(tmp_file), "rb") as f:
return f.read()
@pytest.fixture
def ogg_bytes(self, tmpdir):
"""Create OGG/Vorbis audio data as bytes."""
sampling_rate = 8000
signal = sine(
duration=0.5,
sampling_rate=sampling_rate,
)
tmp_file = tmpdir.join("bytesio_test.ogg")
audiofile.write(str(tmp_file), signal, sampling_rate)
with open(str(tmp_file), "rb") as f:
return f.read()
def test_read_bytesio_mp3(self, mp3_bytes):
"""Reading MP3 from a BytesIO buffer works and honors always_2d."""
buffer = io.BytesIO(mp3_bytes)
data, sampling_rate = audiofile.read(buffer, always_2d=True)
assert data.ndim == 2
assert data.shape[1] > 0
def test_read_bytesio_ogg(self, ogg_bytes):
"""Reading OGG from a BytesIO buffer works and honors always_2d."""
buffer = io.BytesIO(ogg_bytes)
data, sampling_rate = audiofile.read(buffer, always_2d=True)
assert data.ndim == 2
assert data.shape[1] > 0
def test_read_bytesio_mp3_with_offset_and_duration(self, mp3_bytes):
"""Reading MP3 from BytesIO with offset/duration works."""
buffer = io.BytesIO(mp3_bytes)
data, sampling_rate = audiofile.read(
buffer, offset=0.1, duration=0.2, always_2d=True
)
assert data.ndim == 2
assert data.shape[1] > 0
def test_read_bytesio_ogg_with_offset_and_duration(self, ogg_bytes):
"""Reading OGG from BytesIO with offset/duration works."""
buffer = io.BytesIO(ogg_bytes)
data, sampling_rate = audiofile.read(
buffer, offset=0.1, duration=0.2, always_2d=True
)
assert data.ndim == 2
assert data.shape[1] > 0
@pytest.fixture
def wav_bytes(self, tmpdir):
```
1. Ensure `io` is imported at the top of `tests/test_audiofile.py`, e.g. `import io`, if it is not already.
2. The above code assumes the module is imported as `import audiofile` and the helper `sine(...)` is available in this file (as it is for the existing WAV fixtures). If the existing tests use a different import style (e.g. `import audiofile as af` or `from audiofile import read, write`), adjust `audiofile.read` / `audiofile.write` accordingly to match the existing convention.
3. Align the MP3/OGG fixture implementations with how `wav_bytes` and `flac_bytes` are implemented in the same file (e.g. how `tmpdir` is used, whether I/O uses `open(str(tmp_file), "rb")` or `tmp_file.open()`, and any additional parameters passed to `audiofile.write` for MP3/OGG).
4. If the existing BytesIO tests also assert on sample values, channel count, or exact length, you may want to add similar, stricter assertions to the new MP3/OGG tests for consistency.
</issue_to_address>
### Comment 3
<location> `tests/test_audiofile.py:1483-1490` </location>
<code_context>
+
+ assert bit_depth == 16 # default bit depth
+
+ def test_bytesio_info_reusable(self, wav_bytes):
+ """Test that BytesIO can be reused after info calls."""
+ import io
+
+ audio_bytes, expected_signal, expected_sr = wav_bytes
+ buffer = io.BytesIO(audio_bytes)
+
+ # Call info functions
+ _ = af.channels(buffer)
+ _ = af.duration(buffer)
+ _ = af.sampling_rate(buffer)
+ _ = af.samples(buffer)
+
+ # Should still be able to read
+ signal, sr = af.read(buffer)
+ assert sr == expected_sr
+ # Use tolerance for 16-bit quantization
+ np.testing.assert_allclose(signal, expected_signal, atol=1e-4)
+
+ def test_bit_depth_bytesio_flac(self, flac_bytes):
</code_context>
<issue_to_address>
**suggestion (testing):** Include `bit_depth` in the BytesIO info reusability test.
Since `bit_depth` now has special handling for file-like objects (including seeking to position 0), it should also be invoked here to ensure the buffer remains reusable. Please add `af.bit_depth(buffer)` before the final `af.read(buffer)` so this test covers that behavior as well.
```suggestion
# Call info functions
_ = af.channels(buffer)
_ = af.duration(buffer)
_ = af.sampling_rate(buffer)
_ = af.samples(buffer)
_ = af.bit_depth(buffer)
# Should still be able to read
signal, sr = af.read(buffer)
```
</issue_to_address>
### Comment 4
<location> `tests/test_audiofile.py:1527-1536` </location>
<code_context>
+ def test_bytesio_unsupported_format_error(self, wav_bytes):
</code_context>
<issue_to_address>
**suggestion (testing):** Consider also testing unsupported BytesIO without a `name` attribute.
This test only covers the case where a file-like object has an unsupported extension (e.g. `name='test.m4a'`). There’s another edge case where a plain `io.BytesIO` (no `name`, or empty `name`) with unsupported data will go through `soundfile.info` and may raise a different error. Please add a separate test using arbitrary non‑audio bytes in a bare `BytesIO` and assert that an error is raised, to document and preserve the behavior for unsupported byte streams.
</issue_to_address>
### Comment 5
<location> `audiofile/core/info.py:21` </location>
<code_context>
-def bit_depth(file: str) -> int | None:
+def bit_depth(file: str | io.IOBase) -> int | None:
r"""Bit depth of audio file.
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the repeated file-normalization, soundfile access, validation, and bit-depth mapping logic into shared helpers to reduce duplication and branching across these functions.
You can keep all the new functionality while reducing duplication and branching by extracting the repeated patterns into small internal helpers.
### 1. Centralize file-like detection, path normalization, and extension
```python
def _normalize_file_input(file: str | io.IOBase):
file_like = is_file_like(file)
if not file_like:
file = audeer.safe_path(file)
file_ext = file_extension(file)
return file, file_like, file_ext
```
Usage in `channels`, `duration`, `samples`, `sampling_rate`:
```python
def channels(file: str | io.IOBase) -> int:
file, file_like, file_ext = _normalize_file_input(file)
# ...
```
This removes the repeated `is_file_like` / `safe_path` / `file_extension` logic from every function.
### 2. Encapsulate `soundfile.info()` + `seek(0)`
```python
def _sf_info(file: str | io.IOBase, file_like: bool):
info = soundfile.info(file)
if file_like:
file.seek(0)
return info
```
Then all call sites become simpler:
```python
info = _sf_info(file, file_like)
return info.channels
```
This lets you drop the `file_like` boolean parameter in `samples_as_int`:
```python
def samples_as_int(file, file_like: bool):
info = _sf_info(file, file_like)
return int(info.duration * info.samplerate)
```
And use it in other helpers:
```python
info = _sf_info(file, file_like)
return info.duration
```
### 3. Centralize file-like format validation and error
The SNDFORMATS/file-like checks and error message are repeated across several functions. You can extract them:
```python
FILE_LIKE_ERROR = (
"File-like objects are only supported for WAV, FLAC, MP3, and OGG files."
)
def _ensure_soundfile_supported(
file: str | io.IOBase,
file_like: bool,
file_ext: str | None,
) -> bool:
# returns True if soundfile should be used, otherwise raises on invalid file-like
if file_ext in SNDFORMATS or (file_like and file_ext is None):
return True
if file_like:
raise RuntimeError(FILE_LIKE_ERROR)
return False
```
Then e.g. in `duration`:
```python
def duration(file: str | io.IOBase, sloppy=False) -> float:
file, file_like, file_ext = _normalize_file_input(file)
if _ensure_soundfile_supported(file, file_like, file_ext):
info = _sf_info(file, file_like)
return info.duration
if sloppy:
...
```
Same pattern can be applied to `channels`, `samples`, `sampling_rate` to remove duplicated conditionals and error messages.
### 4. Simplify `bit_depth` mappings and logic
You can remove the duplicated info + seek logic and two separate mappings by either:
**Option A: single mapping keyed by (format, subtype)**
```python
BIT_DEPTH_MAPPING = {
("wav", "PCM_16"): 16,
("wav", "PCM_24"): 24,
("wav", "PCM_32"): 32,
("wav", "PCM_U8"): 8,
("wav", "FLOAT"): 32,
("wav", "DOUBLE"): 64,
("wav", "ULAW"): 8,
("wav", "ALAW"): 8,
("wav", "IMA_ADPCM"): 4,
("wav", "MS_ADPCM"): 4,
("wav", "GSM610"): 16,
("wav", "G721_32"): 4,
("flac", "PCM_16"): 16,
("flac", "PCM_24"): 24,
("flac", "PCM_32"): 32,
("flac", "PCM_S8"): 8,
}
def bit_depth(file: str | io.IOBase) -> int | None:
file, file_like, file_type = _normalize_file_input(file)
if file_like and file_type is None:
info = _sf_info(file, file_like=True)
file_type = info.format.lower()
if file_type not in ("wav", "flac"):
return None
info = _sf_info(file, file_like)
return BIT_DEPTH_MAPPING.get((file_type, info.subtype))
```
**Option B: helper that hides info/seek and mapping selection**
```python
def _bit_depth_from_info(file_type: str, file, file_like: bool) -> int | None:
info = _sf_info(file, file_like)
if file_type == "wav":
mapping = WAV_PRECISION_MAPPING
elif file_type == "flac":
mapping = FLAC_PRECISION_MAPPING
else:
return None
return mapping.get(info.subtype)
def bit_depth(file: str | io.IOBase) -> int | None:
file, file_like, file_type = _normalize_file_input(file)
if file_like and file_type is None:
info = _sf_info(file, file_like=True)
file_type = info.format.lower()
return _bit_depth_from_info(file_type, file, file_like)
```
Either approach reduces branching and ensures the `soundfile.info` + `seek` pattern is consistent and centralized.
</issue_to_address>
### Comment 6
<location> `audiofile/core/io.py:124` </location>
<code_context>
def read(
- file: str,
+ file: str | io.IOBase,
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the file-like normalization/validation and soundfile read/seek-reset logic into small helpers so `read` remains focused on core audio processing.
You can reduce the added complexity by centralizing the file-like handling and `soundfile` interaction into small helpers instead of sprinkling conditionals and duplicated logic through `read`.
### 1. Normalize path vs file-like handling
Right now `read` does:
```python
file_like = is_file_like(file)
if not file_like:
file = audeer.safe_path(file)
file_ext = file_extension(file)
...
if file_ext not in SNDFORMATS and not (file_like and file_ext is None):
if file_like:
raise RuntimeError(
"File-like objects are only supported "
"for WAV, FLAC, MP3, and OGG files."
)
with tempfile.TemporaryDirectory(...):
...
```
This mixes normalization, validation, format detection and error raising. You can extract this into a small internal helper that you can also reuse from `core/info.py`:
```python
_UNSUPPORTED_FILELIKE_MSG = (
"File-like objects are only supported for WAV, FLAC, MP3, and OGG files."
)
def _normalize_audio_source(file: str | io.IOBase) -> tuple[io.IOBase | str, bool, str | None, bool]:
"""Return (file, is_file_like, file_ext, needs_conversion)."""
file_like = is_file_like(file)
if not file_like:
file = audeer.safe_path(file)
file_ext = file_extension(file)
if file_like and file_ext not in SNDFORMATS and file_ext is not None:
# or if you want to reject unknown-extension file-like objects too:
# if file_like and file_ext not in SNDFORMATS:
raise RuntimeError(_UNSUPPORTED_FILELIKE_MSG)
needs_conversion = (file_ext not in SNDFORMATS) and not (file_like and file_ext is None)
return file, file_like, file_ext, needs_conversion
```
Then `read` becomes much easier to follow:
```python
def read(...):
file, file_like, file_ext, needs_conversion = _normalize_audio_source(file)
...
tmpdir = None
if needs_conversion:
with tempfile.TemporaryDirectory(prefix="audiofile") as tmpdir:
tmpfile = os.path.join(tmpdir, "tmp.wav")
...
file = tmpfile # so later logic can just use `file`
file_like = False
```
This removes the complex condition and centralizes the error message and validation. The same helper (or the `_UNSUPPORTED_FILELIKE_MSG`) can be reused in `core/info.py` to deduplicate the error string and rule.
### 2. Wrap `soundfile.read` and reset seek in one place
Instead of:
```python
signal, sampling_rate = soundfile.read(
file,
start=start,
stop=stop,
dtype=dtype,
always_2d=always_2d,
**kwargs,
)
if file_like:
file.seek(0)
```
and similar patterns elsewhere, you can wrap `soundfile.read` in a tiny helper:
```python
def _sf_read(
file: str | io.IOBase,
*,
file_like: bool,
**kwargs,
) -> tuple[np.ndarray, int]:
signal, sampling_rate = soundfile.read(file, **kwargs)
if file_like:
file.seek(0)
return signal, sampling_rate
```
Usage inside `read`:
```python
signal, sampling_rate = _sf_read(
file,
file_like=file_like,
start=start,
stop=stop,
dtype=dtype,
always_2d=always_2d,
**kwargs,
)
```
If you have similar seek-reset logic in `core/info.py`, a shared helper like `_sf_read` / `_sf_info` keeps that behavior consistent and removes scattered `if file_like: file.seek(0)` branches.
These two small helpers keep `read` focused on offset/duration and channel logic, while consolidating the cross-cutting file-like rules and `soundfile` behaviors.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Closes #172
Add support for reading SND formats (wav, flac, ogg, mp3) from a
BytesIOfile-like object withaudiofile.read()in the same way this is supported bysoundfile. We also add support for the info functions (channels,duration,sampling_rate,samples,bit_depth).If the bytes object has an unsupported format, we raise a
RuntimeError.I checked that the changes do not degrade any of the benchmarks when dealing with real files.
Usage example: