Skip to content

Updating to use compositional generator/parser pairs #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 19, 2025

Conversation

aryanjassal
Copy link
Member

@aryanjassal aryanjassal commented Feb 12, 2025

Description

Implement a custom solution for generating tar archives using compositional generator/parser pairs. The primary use case of this is to transfer file tree data from the efs to the local system and vice-versa for Polykey.

Issues Fixed

Tasks

  • 1. Update the library to new templating
  • 2. Create compositional generators for archiving file
  • 3. Create parsers to extract file tree data from archive
  • 4. Add options for deterministic tar
  • 5. Add options to provide custom implementation of fs
  • 6. Add rigorous tests for everything
  • 7. Add header for extended metadata

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@aryanjassal aryanjassal self-assigned this Feb 12, 2025
@aryanjassal
Copy link
Member Author

When creating a file in the archive, if the header contains nested paths like /home/aryanj/Downloads/FILE.txt as the file name, then that is traditionally an incorrect archive. Some stricter archival tools (like perhaps WinRAR/7z) might fail to parse this. However, I tested this on File Roller (Ubuntu's default archival software) and GNU tar. Supposedly it's also meant to work on BSD tar, but I haven't tested that yet. I also haven't tested it on WinRAR/7z either.

How do I know about it working on File Roller and GNU tar? Because I've tested it and it works!! For now, I have only implemented a simple implementation which takes in a file, chunks it by 512 bytes, then writes the headers and contents using composable async generators.

It's still basic in complexity, but it works and can be tested out by using the system's archival tools to check for the validity of the archives. This will also be implemented as part of some tests.

@aryanjassal
Copy link
Member Author

The default limit of tar file names is 255 bytes. I don't think such a limitation is imposed on the file names in the efs for Polykey, so this might be an issue.

This is solved by GNU tar (and also one of the most widespread implementation) by using a special header for longer file names.
https://stackoverflow.com/questions/2078778/what-exactly-is-the-gnu-tar-longlink-trick

Basically, for file names longer than 255 characters, a new block is created just before the header block, which can store upto 8,589,934,591 bytes, or approximately 8.59 GiB of file path. This is achieved by the following structure.

[Long Filename Header]
[Long Filename Data Block]
...
[Actual File Header]
[File Content]
...

In the long file name header, most of the bits are ignored except the filename, which must be ././@LongLink, the size, which must be properly calculated, the magic values of ustar and 00, and the type being L signifying a long filename header. All other values can be null.

After this header block, the long file name follows, padded to 512 bytes. Note that this is the full path, and using a long file name header means the file name fields in the actual file header will be null, as the file name came just before the header. Everything else is the same.

Should I also implement this, or make an issue for this and leave it for later @tegefaulkes @CMCDragonkai?

@aryanjassal
Copy link
Member Author

I have also implemented support for relative paths now. This is mostly complete. With a parser and some updates to the generator code, this can be finished rather quickly now.

The performance is also very good. I have tested this on the local file system (as efs isn't available for running tests at the moment) and I was able to archive a 600 MiB directory within 50 seconds. This is without any sort of optimisation and with a lot of debug code, so this number might drop yet further as this approaches completion.

@aryanjassal
Copy link
Member Author

Should I write benches for this? I've never written them or looked into it so I'm not sure how I would approach it, or if I even need to for this repo.

@CMCDragonkai
Copy link
Member

Should I write benches for this? I've never written them or looked into it so I'm not sure how I would approach it, or if I even need to for this repo.

Just look at how it's done in js-logger and js-id or similar. Copy those into chatgpt and ask how best to bench the code you have.

Btw the new copilot extension to vscode supports computer operating so it can kind of inspect the full context and code up things like this.

@CMCDragonkai
Copy link
Member

I suspect you need a utility function to deterministically operate over a filesystem to get a sorted list of files. Again that might not exactly be a virtualtar problem, so you should keep that in mind in the case of PK though.

@aryanjassal
Copy link
Member Author

After discussing the structure of this project further, we have decided to pursue a functional style instead of a generator style. Instead of being provided a directory and performing all the operations ourselves, small utilities will be written instead.

For the generator, this utility will take in a path, a type, and optionally the file data. For directories, a single header will be returned. For files, a header will be returned populated with the data size and other metadata.

For the parser, a Uint8Array will be passed in by reference, then a 512-byte chunk will be sliced out of the array and a token will be generated. I can use a DataView here.

Basically, js-virtualtar will consist of two utility functions, one which generates headers and one which parses them into tokens. Is this the direction we want js-virtualtar to go in? It will become difficult to make changes to the structure down the line, so I'm making this clear in advance. What do you think, @tegefaulkes @CMCDragonkai?

@CMCDragonkai
Copy link
Member

After discussing the structure of this project further, we have decided to pursue a functional style instead of a generator style. Instead of being provided a directory and performing all the operations ourselves, small utilities will be written instead.

For the generator, this utility will take in a path, a type, and optionally the file data. For directories, a single header will be returned. For files, a header will be returned populated with the data size and other metadata.

For the parser, a Uint8Array will be passed in by reference, then a 512-byte chunk will be sliced out of the array and a token will be generated. I can use a DataView here.

Basically, js-virtualtar will consist of two utility functions, one which generates headers and one which parses them into tokens. Is this the direction we want js-virtualtar to go in? It will become difficult to make changes to the structure down the line, so I'm making this clear in advance. What do you think, @tegefaulkes @CMCDragonkai?

You should be able to export multiple functions not just 2. Plus their types.

@aryanjassal
Copy link
Member Author

There are some operations which would require logging or communicating with the user. For example, a header can be of type directory but can have a non-zero size. While this breaks convention, it is technically allowed and is parsable by some tar software. In cases like this, I would like to log out a warning.

The idea I have is to pass in a parameter for logger. If it is unset, then all logging will be disabled. Otherwise, the logger will be used for logging out messages. Should I implement this?

@aryanjassal
Copy link
Member Author

I have updated the structure for the generator to expose utility functions which generate the relevant blocks instead. I still haven't implemented proper CI tests, but I have tested this out by archiving a local directory on my system and viewing and unarchiving it using other utilities.

I ran into two major issues when doing this. Firstly, I got some weird issues with the headers containing invalid data, but I then realised that instead of converting the digits to radix-8, I was converting them based on the number of bytes the chunk would occupy. This was because of a typo when extracting that method into its own utility.

Another issue was me forgetting that a buffer is memory in heap, so I tried to add buffers to an array (for later writing to a tar file) but was using the same buffer to write multiple blocks, which was overwriting data and resulting in the files corrupting.

After some debugging, I finally got this working perfectly. After some cleaning up of the Generator and the test, I will start work on the parser.

@aryanjassal
Copy link
Member Author

After discussing with Brian, then conclusion we have now is that the generator/parser is too low-level to deal with logging, and it would also impact performance. So, if the structure deviates too much from expectation, the parser should throw, and if it is a minor issue like the size field being set for directories, that is out of scope for the parser, which just generates tokens. It is the responsibility of whatever is consuming the tokens to analyse if the tokens contain invalid data and deal with it accordingly. As such, no logger is required for virtualtar. It needs to be as performant as possible.

This probably also means writing benchmarks. I will reference other repos and ask chatgpt about writing benchmarks.

@aryanjassal
Copy link
Member Author

Just added simple parsing. I tested parsing by archiving a directory using the generator then parsing using the parser. It was able to do the conversion both ways pretty well.

Next steps include cleaning up the code and bring it up to library standards, adding proper tests, and adding benchmarks.

@CMCDragonkai
Copy link
Member

After discussing with Brian, then conclusion we have now is that the generator/parser is too low-level to deal with logging, and it would also impact performance. So, if the structure deviates too much from expectation, the parser should throw, and if it is a minor issue like the size field being set for directories, that is out of scope for the parser, which just generates tokens. It is the responsibility of whatever is consuming the tokens to analyse if the tokens contain invalid data and deal with it accordingly. As such, no logger is required for virtualtar. It needs to be as performant as possible.

This probably also means writing benchmarks. I will reference other repos and ask chatgpt about writing benchmarks.

Parsers should just parse. It's not really stateful. Therefore logging is not required. Logging is best needed for stateful systems.

@CMCDragonkai
Copy link
Member

We are ok using Buffer. Because of the feross/buffer. It's fine.

@CMCDragonkai
Copy link
Member

In fact Buffer will be faster than Uint8Array.

Please talk to chatgpt about how to improve your buffer speeds with better concatenation and avoiding memory allocation.

@CMCDragonkai
Copy link
Member

Reduce function calling and jumps in your hot path.

@CMCDragonkai
Copy link
Member

Some feedback on type naming.

Prefer ObjectVerb. Or GenericSpecific over SpecificGeneric.

@aryanjassal
Copy link
Member Author

We are ok using Buffer. Because of the feross/buffer. It's fine.

In fact Buffer will be faster than Uint8Array.

I have investigated feross/buffer and while it is an interesting project, I don't think it works for our use case.

The performance tests show that on browser, Uint8Array are actually faster in most cases, reaching upto 200% faster than the buffer implementation. For node usage, the difference increased yet further, reaching upto 500% slower than other methods for some operations. Note that while using Uint8Array is pretty fast, many operations were actually faster when Node's buffer was used by upto 200% than with Uint8Array.

Due to these reasons, I think that using Uint8Array and DataView, which I already have been doing now, is sufficient for this project, and switching to web buffer won't give me much of a performance boost.

Please talk to chatgpt about how to improve your buffer speeds with better concatenation and avoiding memory allocation.

I am doing only a single allocation for generating headers. There is no concatenation for buffers going on. I wrap the raw bytestream in a data view if I want to interact with it, and I do it by specifying the buffer address and the offset/length, avoiding needless copying.

I am yet to run performance tests and analysing hot paths, so that will come when I write performance benchmarks to investigate hot paths.

@CMCDragonkai
Copy link
Member

There are some operations which would require logging or communicating with the user. For example, a header can be of type directory but can have a non-zero size. While this breaks convention, it is technically allowed and is parsable by some tar software. In cases like this, I would like to log out a warning.

The idea I have is to pass in a parameter for logger. If it is unset, then all logging will be disabled. Otherwise, the logger will be used for logging out messages. Should I implement this?

Generally for libraries like this, no logging should be done.

Remember there is NO IO here.

So you should be able to communicate purely through construction and deconstruction. It is pure and easy to test.

@CMCDragonkai
Copy link
Member

We are ok using Buffer. Because of the feross/buffer. It's fine.
In fact Buffer will be faster than Uint8Array.

I have investigated feross/buffer and while it is an interesting project, I don't think it works for our use case.

The performance tests show that on browser, Uint8Array are actually faster in most cases, reaching upto 200% faster than the buffer implementation. For node usage, the difference increased yet further, reaching upto 500% slower than other methods for some operations. Note that while using Uint8Array is pretty fast, many operations were actually faster when Node's buffer was used by upto 200% than with Uint8Array.

Due to these reasons, I think that using Uint8Array and DataView, which I already have been doing now, is sufficient for this project, and switching to web buffer won't give me much of a performance boost.

Please talk to chatgpt about how to improve your buffer speeds with better concatenation and avoiding memory allocation.

I am doing only a single allocation for generating headers. There is no concatenation for buffers going on. I wrap the raw bytestream in a data view if I want to interact with it, and I do it by specifying the buffer address and the offset/length, avoiding needless copying.

I am yet to run performance tests and analysing hot paths, so that will come when I write performance benchmarks to investigate hot paths.

Weird, Buffer uses Uint8Array. But if you don't need the complex buffer concatenation and slicing operations, then you don't need it.

However generally if you were in fact operation/mutating over the same buffer, and slicing out parts then it would generally more easier. It might depend on how you coded this.

@aryanjassal
Copy link
Member Author

I have temporarily disabled some tests as they were failing, and I wanted to knock out the higher-level VirtualTar API first.

On that note, I have added a proper API to VirtualTar (without documentation for the time being, but I will update it). Now, all that remains is to use it in actual tests. Once that is done, I will fix up the pending tests, update documentation, and mark it as ready for merging.

For now, I have not properly integrated streaming yet. I will probably use webstreams to do that, as it makes the most sense if I want to use streams. Currently, the API looks like the following.

import VirtualTar from '@matrixai/js-virtualtar';

const generator = new VirtualTar({ mode: 'generate' });
generator.addDirectory('dir');
generator.addFile('file1.txt', { size: 10 }, '1234567890');
generator.addFile('file1.txt', { size: 10 }, new Uint8Array(10));
generator.addFile('file1.txt', { size: 10 }, (write) => {
  // Note this currently doesn't use streams. I plan on changing
  // over to streams soon.
  write('12345');
  write(new Uint8Array(5));
});
const archive = generator.finalize();

const parser = new VirtualTar({ mode: 'parse' });
parser.push(chunk);
parser.push(chunk);

// Token is either a complete (unbuffered) file, a directory, or
// an empty token. A flag on the token signifies when a single
// token has been parsed and we can parse next token or we don't
// have enough data to parse the next token.
const token: ParsedFile | ParsedDirectory | ParsedEmpty = parser.next();

if (token.type === 'empty' && token.awaitingData) parser.push(block)

// Parse all available blocks until we need more data
const tokens: Array<ParsedFile | ParsedDirectory> = parser.parseAvailable();

I will probably add streams to data input and output to make it more memory-efficient.

The longest task in this list is fixing the tests, as I need to investigate the raw header data every time the tests fail to gain an understanding of the failure reason. Everything else should be quick and not take much time.

Let me know if this is a good API for virtualtar @CMCDragonkai @tegefaulkes

@CMCDragonkai
Copy link
Member

I have temporarily disabled some tests as they were failing, and I wanted to knock out the higher-level VirtualTar API first.

On that note, I have added a proper API to VirtualTar (without documentation for the time being, but I will update it). Now, all that remains is to use it in actual tests. Once that is done, I will fix up the pending tests, update documentation, and mark it as ready for merging.

For now, I have not properly integrated streaming yet. I will probably use webstreams to do that, as it makes the most sense if I want to use streams. Currently, the API looks like the following.

import VirtualTar from '@matrixai/js-virtualtar';

const generator = new VirtualTar({ mode: 'generate' });
generator.addDirectory('dir');
generator.addFile('file1.txt', { size: 10 }, '1234567890');
generator.addFile('file1.txt', { size: 10 }, new Uint8Array(10));
generator.addFile('file1.txt', { size: 10 }, (write) => {
  // Note this currently doesn't use streams. I plan on changing
  // over to streams soon.
  write('12345');
  write(new Uint8Array(5));
});
const archive = generator.finalize();

const parser = new VirtualTar({ mode: 'parse' });
parser.push(chunk);
parser.push(chunk);

// Token is either a complete (unbuffered) file, a directory, or
// an empty token. A flag on the token signifies when a single
// token has been parsed and we can parse next token or we don't
// have enough data to parse the next token.
const token: ParsedFile | ParsedDirectory | ParsedEmpty = parser.next();

if (token.type === 'empty' && token.awaitingData) parser.push(block)

// Parse all available blocks until we need more data
const tokens: Array<ParsedFile | ParsedDirectory> = parser.parseAvailable();

I will probably add streams to data input and output to make it more memory-efficient.

The longest task in this list is fixing the tests, as I need to investigate the raw header data every time the tests fail to gain an understanding of the failure reason. Everything else should be quick and not take much time.

Let me know if this is a good API for virtualtar @CMCDragonkai @tegefaulkes

Don't want web stream integration. Keep it to functions and generators for now.

@CMCDragonkai
Copy link
Member

Can you check that there's no buffer accumulation anywhere?

@aryanjassal
Copy link
Member Author

After having a discussion with Brian, this is the feedback I received.

Instead of queuing up all the chunks as I process them, I need to create an async generator which generates the chunks. Then, I need to add the generator to a queue which will be lazily processed when needed.

To support streaming, I will have an async generator which constantly consumes these generators as they are queued, yielding the chunks to be streamed by the user. There will be a semaphore to prevent eager execution of the loop and only trigger when new data is available.

For the parser, callbacks will be provided in the constructor. These callbacks will correspond to getting a file, directory, or the end of the archive. For files, the data will be returned as an async generator. The API for parser will look like this at the end.

const parser = new VirtualTar({
  mode: 'parse',
  file: async (header, dataGen) => {
    // write header
    const gen = dataGen();
    for await (const chunk of gen) {
      // write each chunk as it is available
    }
  },
  directory: async (header) => {
    // write directory
  },
  end: async (error?) => {
    if (error != null) // something went wrong
    // otherwise, the archive ended just fine
  },
});

// Run this to enqueue more data as needed
await parser.write(data);

This will be the new API for the generator now. The method of streaming the data in will remain mostly consistent, but instead of a callback, it will be a generator which will yield chunks of data. It will still retain the option of having a buffer or a string as data, but the base case will remain of a generator yielding chunks.

const generator = new VirtualTar({ mode: 'generate' });

const p = (async () => {
  generator.addFile();
  generator.addDirectory();
  generator.addFile();
  // Wait for the queued operations to finish first
  await generator.settled();
  generator.finalize();
}());

for await (const chunk of generator.data()) {
  // each chunk can now be streamed as it becomes available
}
await p;

This aims to minimise buffering the data as much as possible, evaluating lazily.

@aryanjassal
Copy link
Member Author

Now, the only pending task for this PR is to add and update documentation. Once the documentation has been updated, this PR can be marked as completed.

@aryanjassal
Copy link
Member Author

This PR should now be all finished. After a final review, this can be merged.

@CMCDragonkai
Copy link
Member

Why no merge? Just merge and start testing asap.

@CMCDragonkai
Copy link
Member

Test the package release process. Ask @brynblack how it works. Or read it yourself or ask the R1 AI to explain to you how it works in the .GitHub repo.

Copy link

@tegefaulkes tegefaulkes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skipping test review for now until my comments have been addressed.

@aryanjassal
Copy link
Member Author

To allow for rapid development, I will merge this PR if the CI passes. Then, any point fixes can be done later on with new releases as needed.

@aryanjassal aryanjassal merged commit b8bf47e into master Mar 19, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Update implementation to use compositional generator/parser pairs
3 participants