John zerocopy #214

johnpalsberg · 2025-06-06T03:53:36Z

So far I've created a system for writing and reading compressed sequence data from files. I've created a Rust unit test for this, as well as a command-line test case in a Bash file (running code in a new main.rs file). The current merge conflicts seem to stem from me making certain functions (and the Size struct) in file.rs public.

sampsyo

Awesome; this is looking good already! I have a few suggestions around the edges—please let me know where I can clarify further.

sampsyo · 2025-06-06T19:56:14Z

.vscode/settings.json

Let's leave the VS Code settings out of the repo, probably?

sampsyo · 2025-06-06T19:57:09Z

flatgfa/src/file.rs

All these pubs and pub(crate)s seem fine! To resolve the conflict, maybe try rebasing on the current main branch and perahps applying these access modifiers again?

sampsyo · 2025-06-06T20:00:07Z

flatgfa/src/main.rs

We actually already have a main.rs, under the src/cli directory. You'll notice that this is organized around subcommands: for example, fgfa paths or fgfa stats or fgfa bed. Instead of adding a new main.rs with a new main function, let's just add a new subcommand there! Or perhaps two—we could call them seq-import and seq-export, for example.

sampsyo · 2025-06-06T20:03:23Z

flatgfa/src/packedseq.rs

+#[derive(FromBytes, FromZeroes, AsBytes, Debug)]
+#[repr(packed)]
+pub struct PackedToc {
+    magic: u8,


I know this is a silly thing, but you probably want the magic number to be an entire word, i.e., a u64. Here are two reasons:

There are just too many collisions in the space of a u8! That is, a u64 that you pick is likely to be unique; a u8 that you pick is likely to be the same as someone else's u8 magic number.

Using a single byte at the start of the file will make the rest of the data unaligned. If the magic number is a u64, then the next field will start on a word-aligned boundary.

sampsyo · 2025-06-06T20:03:37Z

flatgfa/src/packedseq.rs

+pub struct PackedToc {
+    magic: u8,
+    data: Size,
+    high_nibble_end: Size,


This strategy stores the high_nibble_end field after the actual sequence data, and it uses a "pointer/span" to allow for any amount of bytes to be stored there. But, unlike for the sequence data, we know that we only ever need to store a single byte (or actually, a single bit)! So it would probably be more sensible to just store this right here, in the table of contents.

In other words, we could just make this field have type bool or u8 or or u64 or something, instead of Size. And instead of using that Size to refer to data stored elsewhere in the file, we just store the flag right here, in the TOC. Would that make sense?

sampsyo · 2025-06-06T20:14:18Z

flatgfa/src/packedseq.rs

+    slice.vec_ref.as_ref().get_range(slice.span)
+}
+
+pub fn total_bytes(num_elems: usize) -> usize {


Not 100% sure about this one, but perhaps this API could be a tiny bit friendlier if it took a PackedSeqStore or PackedSeqView as an argument. It would of course just immediately use .len() to get the size, but it would make it a little clearer to callers that they're supposed to use this to get the file size for one of these data structures.

sampsyo · 2025-06-06T20:14:33Z

flatgfa/src/pool.rs

    }

-    /// Check if the pool is empty.
+    /// Check if th e pool is empty.


Looks like a typo!

sampsyo · 2025-06-06T20:15:38Z

flatgfa/src/main.rs

+    if args.len() > 2 && args[1] == "import" {
+        let mmap = memfile::map_file(&args[2]); // args[2] is the filename
+        let seq = packedseq::view(&mmap);
+        println!("Sequence: {}", seq);


I say let's just print the raw seq itself, without the Sequence: prefix. This will make the CLI friendlier for use in larger pipelines.

sampsyo · 2025-06-06T20:17:39Z

flatgfa/src/main.rs

+        let vec = PackedSeqStore::create(vec![
+            Nucleotide::A,
+            Nucleotide::C,
+            Nucleotide::T,
+            Nucleotide::G,
+        ]);


Instead of constructing a specific sequence here, let's read a text file from standard input! That is, we'd like to do something like this:

$ cat myseq.txt | fgfa seq-export myseq.seq

where myseq.txt contains text like ACTGTGGACCAATG or whatever. In other words, let's make a CLI here that can convert between two file types: text files with one nucleotide per character, and our brand-new binary sequence data format.

sampsyo · 2025-06-06T20:21:53Z

flatgfa/src/test_export_import.sh

A standalone shell script is perfectly fine, and maybe the right tool for the job here. The "next level up" for testing this that I had in mind was to set up Turnt, along the lines of what we already do for round-trips from GFA to FlatGFA:

pollen/tests/turnt.toml

Lines 162 to 172 in 8a4fd75

[envs.flatgfa_mem]

command = "../target/debug/fgfa < {filename}"

output.gfa = "-"

[envs.flatgfa_file]

command = "../target/debug/fgfa -o {base}.flatgfa < {filename} ; ../target/debug/fgfa -i {base}.flatgfa"

output.gfa = "-"

[envs.flatgfa_file_inplace]

command = "../target/debug/fgfa -m -p 128 -o {base}.inplace.flatgfa -I {filename} ; ../target/debug/fgfa -m -i {base}.inplace.flatgfa"

output.gfa = "-"

Basically, the idea would be to create a little tests directory here and to populate it with a few small text sequence files. Then, a Turnt config would look something like this:

command = "fgfa seq-import < {filename} | fgfa seq-export" output.txt = "-"

This would test that, starting with the original text file, a round trip through importing and exporting (I forget which is which!) eventually produces exactly the same .txt file, byte for byte. Would that strategy make sense?

…ta from files

sampsyo · 2025-07-17T18:02:40Z

It seems like this PR is now outdated—can we close it?

…pass

sampsyo · 2025-09-22T18:54:52Z

In doing a little bit of prep for the Panorama presentation today, I thought I'd do some quick & dirty benchmarking of the current compressed version. Sadly, we're currently "underwater": the compressed version seems to be both larger and slower than the current main branch?

I used this sequence of commands to do a quick test:

cargo build --release ; fgfa -I tests/DRB1-3123.gfa -o blarg ; ll blarg ; hyperfine -w3 -N 'fgfa -I tests/DRB1-3123.gfa extract -n 3 -c 3' 'fgfa -I blarg extract -n 3 -c 3'

And I did that for both branches. The sizes of the files are:

original GFA: 463k
main FlatGFA: 497k
john-zerocopy FlatGFA: 538k

And running that extract command goes from 7.9 ms to 8.4 ms, i.e., a slight slowdown.

I'm not shocked that the speed is a bit slower (there is overhead when accessing compressed data), but I think we should probably get to the bottom of why the file size is larger?

sampsyo requested changes Jun 6, 2025

View reviewed changes

johnpalsberg added 10 commits June 23, 2025 12:01

Add zerocopy interfacing

f8dadda

Attempt to resolve merge conflicts

fe3972b

Add a rust and a bash test case for writing and reading compressed da…

137ed25

…ta from files

Make progress on adding the command line interface

f394aad

Clean up code for push

06fef05

Make rand a dev-dependency

2ffc343

Add a subslice method to PackedSeqView

1f7a76e

Add SeqSpan to flatgfa.rs

ef0b96f

Add a from_pool method to PackedSeqView

293a296

Continue progressing with integration

247847e

johnpalsberg force-pushed the john-zerocopy branch from cafa33b to 247847e Compare June 26, 2025 07:35

johnpalsberg added 4 commits June 26, 2025 10:22

Add a turnt test directory to flatgfa

e8a2b41

Complete the compressed Sequence implementation

1a876fc

Add turnt test case for the cli

28e6ab2

Compress sequences when adding them to a GFA file

7f91260

johnpalsberg added 13 commits July 19, 2025 02:05

Fix remaining compile errors across codebase and make all test cases …

f954e34

…pass

Address Clippy warnings

502d5ae

Add changes from compress-cli-refine

d3f0266

Merge remote-tracking branch 'origin/main' into john-zerocopy

4d2cef1

Clean up after merge

d8cd462

Update Turnt tests

7341185

progress on debugging

9dc3aa5

Convert ASCII to internal nucleotide encoding when adding segments

5931f1a

Fix Clippy warnings

de3acc4

Merge branch 'main' into john-zerocopy

e7dbb94

Add a 'N' variant to the Nucleotide enum

b7b7822

Fix python CI failures attempt #1

d46499e

Fix index issue with slice() function

15c11ef

johnpalsberg added 11 commits July 28, 2025 15:54

More indexing fixes

185a1fb

Fix indexing errors attempt #2

c74d916

Fix indexing errors attempt #3

c0ca7fb

Fix Python dependencies

e03ab24

Change SeqSpan to be half-open (exclude end byte)

6d0994c

Fix the from_range() method in SeqSpan

8ca33dd

Fix turnt test in flatgfa

466f5a6

Fix Python ruff formatting

2cbc7a3

Fix the indices in chop

0cb4e6b

Use slow_odgi instead of uv run slow_odgi

1e8f0dc

Clean up code for review

b7a3e3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

John zerocopy #214

John zerocopy #214

Uh oh!

johnpalsberg commented Jun 6, 2025

Uh oh!

sampsyo left a comment

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo Jun 6, 2025

Uh oh!

sampsyo commented Jul 17, 2025

Uh oh!

sampsyo commented Sep 22, 2025

Uh oh!

Uh oh!

	[envs.flatgfa_mem]
	command = "../target/debug/fgfa < {filename}"
	output.gfa = "-"

	[envs.flatgfa_file]
	command = "../target/debug/fgfa -o {base}.flatgfa < {filename} ; ../target/debug/fgfa -i {base}.flatgfa"
	output.gfa = "-"

	[envs.flatgfa_file_inplace]
	command = "../target/debug/fgfa -m -p 128 -o {base}.inplace.flatgfa -I {filename} ; ../target/debug/fgfa -m -i {base}.inplace.flatgfa"
	output.gfa = "-"

John zerocopy #214

Are you sure you want to change the base?

John zerocopy #214

Uh oh!

Conversation

johnpalsberg commented Jun 6, 2025

Uh oh!

sampsyo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sampsyo commented Jul 17, 2025

Uh oh!

sampsyo commented Sep 22, 2025

Uh oh!

Uh oh!