-
Notifications
You must be signed in to change notification settings - Fork 2
John zerocopy #214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
John zerocopy #214
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome; this is looking good already! I have a few suggestions around the edges—please let me know where I can clarify further.
.vscode/settings.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave the VS Code settings out of the repo, probably?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these pub
s and pub(crate)
s seem fine! To resolve the conflict, maybe try rebasing on the current main
branch and perahps applying these access modifiers again?
flatgfa/src/main.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually already have a main.rs
, under the src/cli
directory. You'll notice that this is organized around subcommands: for example, fgfa paths
or fgfa stats
or fgfa bed
. Instead of adding a new main.rs
with a new main
function, let's just add a new subcommand there! Or perhaps two—we could call them seq-import
and seq-export
, for example.
flatgfa/src/packedseq.rs
Outdated
#[derive(FromBytes, FromZeroes, AsBytes, Debug)] | ||
#[repr(packed)] | ||
pub struct PackedToc { | ||
magic: u8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a silly thing, but you probably want the magic number to be an entire word, i.e., a u64
. Here are two reasons:
- There are just too many collisions in the space of a
u8
! That is, au64
that you pick is likely to be unique; au8
that you pick is likely to be the same as someone else'su8
magic number. - Using a single byte at the start of the file will make the rest of the data unaligned. If the magic number is a
u64
, then the next field will start on a word-aligned boundary.
flatgfa/src/packedseq.rs
Outdated
pub struct PackedToc { | ||
magic: u8, | ||
data: Size, | ||
high_nibble_end: Size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strategy stores the high_nibble_end
field after the actual sequence data, and it uses a "pointer/span" to allow for any amount of bytes to be stored there. But, unlike for the sequence data, we know that we only ever need to store a single byte (or actually, a single bit)! So it would probably be more sensible to just store this right here, in the table of contents.
In other words, we could just make this field have type bool
or u8
or or u64
or something, instead of Size
. And instead of using that Size
to refer to data stored elsewhere in the file, we just store the flag right here, in the TOC. Would that make sense?
flatgfa/src/packedseq.rs
Outdated
slice.vec_ref.as_ref().get_range(slice.span) | ||
} | ||
|
||
pub fn total_bytes(num_elems: usize) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure about this one, but perhaps this API could be a tiny bit friendlier if it took a PackedSeqStore
or PackedSeqView
as an argument. It would of course just immediately use .len()
to get the size, but it would make it a little clearer to callers that they're supposed to use this to get the file size for one of these data structures.
flatgfa/src/pool.rs
Outdated
} | ||
|
||
/// Check if the pool is empty. | ||
/// Check if th e pool is empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a typo!
flatgfa/src/main.rs
Outdated
if args.len() > 2 && args[1] == "import" { | ||
let mmap = memfile::map_file(&args[2]); // args[2] is the filename | ||
let seq = packedseq::view(&mmap); | ||
println!("Sequence: {}", seq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say let's just print the raw seq
itself, without the Sequence:
prefix. This will make the CLI friendlier for use in larger pipelines.
flatgfa/src/main.rs
Outdated
let vec = PackedSeqStore::create(vec![ | ||
Nucleotide::A, | ||
Nucleotide::C, | ||
Nucleotide::T, | ||
Nucleotide::G, | ||
]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of constructing a specific sequence here, let's read a text file from standard input! That is, we'd like to do something like this:
$ cat myseq.txt | fgfa seq-export myseq.seq
where myseq.txt
contains text like ACTGTGGACCAATG
or whatever. In other words, let's make a CLI here that can convert between two file types: text files with one nucleotide per character, and our brand-new binary sequence data format.
flatgfa/src/test_export_import.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A standalone shell script is perfectly fine, and maybe the right tool for the job here. The "next level up" for testing this that I had in mind was to set up Turnt, along the lines of what we already do for round-trips from GFA to FlatGFA:
Lines 162 to 172 in 8a4fd75
[envs.flatgfa_mem] | |
command = "../target/debug/fgfa < {filename}" | |
output.gfa = "-" | |
[envs.flatgfa_file] | |
command = "../target/debug/fgfa -o {base}.flatgfa < {filename} ; ../target/debug/fgfa -i {base}.flatgfa" | |
output.gfa = "-" | |
[envs.flatgfa_file_inplace] | |
command = "../target/debug/fgfa -m -p 128 -o {base}.inplace.flatgfa -I {filename} ; ../target/debug/fgfa -m -i {base}.inplace.flatgfa" | |
output.gfa = "-" |
Basically, the idea would be to create a little tests
directory here and to populate it with a few small text sequence files. Then, a Turnt config would look something like this:
command = "fgfa seq-import < {filename} | fgfa seq-export"
output.txt = "-"
This would test that, starting with the original text file, a round trip through importing and exporting (I forget which is which!) eventually produces exactly the same .txt
file, byte for byte. Would that strategy make sense?
cafa33b
to
247847e
Compare
It seems like this PR is now outdated—can we close it? |
In doing a little bit of prep for the Panorama presentation today, I thought I'd do some quick & dirty benchmarking of the current compressed version. Sadly, we're currently "underwater": the compressed version seems to be both larger and slower than the current main branch? I used this sequence of commands to do a quick test:
And I did that for both branches. The sizes of the files are:
And running that I'm not shocked that the speed is a bit slower (there is overhead when accessing compressed data), but I think we should probably get to the bottom of why the file size is larger? |
So far I've created a system for writing and reading compressed sequence data from files. I've created a Rust unit test for this, as well as a command-line test case in a Bash file (running code in a new main.rs file). The current merge conflicts seem to stem from me making certain functions (and the Size struct) in file.rs public.