Skip to content

Introduce .bin.ndjson format #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

FindHao
Copy link
Member

@FindHao FindHao commented Jun 24, 2025

Summary:

  • Added a new format: .bin.ndjson
  • Added support for reading compressed input files in extract_source_mappings.py using gzip.
  • Updated file handling logic to differentiate between compressed (.bin.ndjson) and uncompressed files.
  • Enhanced structured_logging.py to enable gzip compression for trace logs, allowing for efficient storage and retrieval.
  • Modified TritonTraceHandler to handle gzip compression for individual log records, ensuring compatibility with standard gzip readers.

New .bin.ndjson format summary:

  • Writing: Each JSON record is individually gzip-compressed into a separate gzip member, then sequentially appended to the same binary file, leveraging the gzip specification's support for member concatenation.
  • Reading: Standard gzip.open() automatically handles member concatenation, allowing line-by-line reading just like a regular text file, without requiring special parsing logic.

For compression, gzip is fast, reasonably effective, and natively supported in Python. For example, a 950MB raw log file can be compressed to 117MB with gzip and 110MB with lzma. However, gzip is more than 10 times faster than lzma.
Test Plan:

TRITON_TRACE_GZIP=1 python test_add.py

By default, a new trace file generated ./logs/dedicated_log_triton_trace_yhao24_.bin.ndjson. The trace file size is reduced from 24K to 5.3K. In another larger scale experiment, the trace file is reduced from 950M to 117MB.

% ll
total 32K
-rw-r--r--. 1 5.3K Jun 23 20:01 dedicated_log_triton_trace_yhao24_.bin.ndjson
-rw-r--r--. 1 24K Jun 23 20:01 dedicated_log_triton_trace_yhao24_.ndjson

Summary:
- Added support for reading compressed input files in extract_source_mappings.py using gzip.
- Updated file handling logic to differentiate between compressed (.bin.ndjson) and uncompressed files.
- Enhanced structured_logging.py to enable gzip compression for trace logs, allowing for efficient storage and retrieval.
- Modified TritonTraceHandler to handle gzip compression for individual log records, ensuring compatibility with standard gzip readers.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2025
@FindHao FindHao changed the title Enhance file handling in extract_source_mappings and structured_logging Introduce .bin.ndjson format Jun 24, 2025
Summary:
- Introduced a new script, decompress_bin_ndjson.py, to handle the decompression of .bin.ndjson files back to standard .ndjson format.
- The script validates input files, manages output file naming, and provides detailed output on the decompression process, including file sizes and compression ratios.
- It utilizes gzip for reading compressed files and includes error handling for various potential issues during decompression.
@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants