Skip to content

Conversation

johnpalsberg
Copy link
Contributor

So far I've created a system for writing and reading compressed sequence data from files. I've created a Rust unit test for this, as well as a command-line test case in a Bash file (running code in a new main.rs file). The current merge conflicts seem to stem from me making certain functions (and the Size struct) in file.rs public.

Copy link
Collaborator

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome; this is looking good already! I have a few suggestions around the edges—please let me know where I can clarify further.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the VS Code settings out of the repo, probably?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these pubs and pub(crate)s seem fine! To resolve the conflict, maybe try rebasing on the current main branch and perahps applying these access modifiers again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually already have a main.rs, under the src/cli directory. You'll notice that this is organized around subcommands: for example, fgfa paths or fgfa stats or fgfa bed. Instead of adding a new main.rs with a new main function, let's just add a new subcommand there! Or perhaps two—we could call them seq-import and seq-export, for example.

#[derive(FromBytes, FromZeroes, AsBytes, Debug)]
#[repr(packed)]
pub struct PackedToc {
magic: u8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a silly thing, but you probably want the magic number to be an entire word, i.e., a u64. Here are two reasons:

  • There are just too many collisions in the space of a u8! That is, a u64 that you pick is likely to be unique; a u8 that you pick is likely to be the same as someone else's u8 magic number.
  • Using a single byte at the start of the file will make the rest of the data unaligned. If the magic number is a u64, then the next field will start on a word-aligned boundary.

pub struct PackedToc {
magic: u8,
data: Size,
high_nibble_end: Size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strategy stores the high_nibble_end field after the actual sequence data, and it uses a "pointer/span" to allow for any amount of bytes to be stored there. But, unlike for the sequence data, we know that we only ever need to store a single byte (or actually, a single bit)! So it would probably be more sensible to just store this right here, in the table of contents.

In other words, we could just make this field have type bool or u8 or or u64 or something, instead of Size. And instead of using that Size to refer to data stored elsewhere in the file, we just store the flag right here, in the TOC. Would that make sense?

slice.vec_ref.as_ref().get_range(slice.span)
}

pub fn total_bytes(num_elems: usize) -> usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure about this one, but perhaps this API could be a tiny bit friendlier if it took a PackedSeqStore or PackedSeqView as an argument. It would of course just immediately use .len() to get the size, but it would make it a little clearer to callers that they're supposed to use this to get the file size for one of these data structures.

}

/// Check if the pool is empty.
/// Check if th e pool is empty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a typo!

if args.len() > 2 && args[1] == "import" {
let mmap = memfile::map_file(&args[2]); // args[2] is the filename
let seq = packedseq::view(&mmap);
println!("Sequence: {}", seq);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say let's just print the raw seq itself, without the Sequence: prefix. This will make the CLI friendlier for use in larger pipelines.

Comment on lines 15 to 20
let vec = PackedSeqStore::create(vec![
Nucleotide::A,
Nucleotide::C,
Nucleotide::T,
Nucleotide::G,
]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of constructing a specific sequence here, let's read a text file from standard input! That is, we'd like to do something like this:

$ cat myseq.txt | fgfa seq-export myseq.seq

where myseq.txt contains text like ACTGTGGACCAATG or whatever. In other words, let's make a CLI here that can convert between two file types: text files with one nucleotide per character, and our brand-new binary sequence data format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A standalone shell script is perfectly fine, and maybe the right tool for the job here. The "next level up" for testing this that I had in mind was to set up Turnt, along the lines of what we already do for round-trips from GFA to FlatGFA:

pollen/tests/turnt.toml

Lines 162 to 172 in 8a4fd75

[envs.flatgfa_mem]
command = "../target/debug/fgfa < {filename}"
output.gfa = "-"
[envs.flatgfa_file]
command = "../target/debug/fgfa -o {base}.flatgfa < {filename} ; ../target/debug/fgfa -i {base}.flatgfa"
output.gfa = "-"
[envs.flatgfa_file_inplace]
command = "../target/debug/fgfa -m -p 128 -o {base}.inplace.flatgfa -I {filename} ; ../target/debug/fgfa -m -i {base}.inplace.flatgfa"
output.gfa = "-"

Basically, the idea would be to create a little tests directory here and to populate it with a few small text sequence files. Then, a Turnt config would look something like this:

command = "fgfa seq-import < {filename} | fgfa seq-export"
output.txt = "-"

This would test that, starting with the original text file, a round trip through importing and exporting (I forget which is which!) eventually produces exactly the same .txt file, byte for byte. Would that strategy make sense?

@sampsyo
Copy link
Collaborator

sampsyo commented Jul 17, 2025

It seems like this PR is now outdated—can we close it?

@sampsyo
Copy link
Collaborator

sampsyo commented Sep 22, 2025

In doing a little bit of prep for the Panorama presentation today, I thought I'd do some quick & dirty benchmarking of the current compressed version. Sadly, we're currently "underwater": the compressed version seems to be both larger and slower than the current main branch?

I used this sequence of commands to do a quick test:

cargo build --release ; fgfa -I tests/DRB1-3123.gfa -o blarg ; ll blarg ; hyperfine -w3 -N 'fgfa -I tests/DRB1-3123.gfa extract -n 3 -c 3' 'fgfa -I blarg extract -n 3 -c 3'

And I did that for both branches. The sizes of the files are:

  • original GFA: 463k
  • main FlatGFA: 497k
  • john-zerocopy FlatGFA: 538k

And running that extract command goes from 7.9 ms to 8.4 ms, i.e., a slight slowdown.

I'm not shocked that the speed is a bit slower (there is overhead when accessing compressed data), but I think we should probably get to the bottom of why the file size is larger?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants