feat: improve passthrough method to use an asynchrone pipeline #132

Dafyh · 2024-08-21T14:58:02Z

Improvement of the passthrough method.

Wait for the pipeline to execute to obtain the results from ClamScan.
Listening to the socket events is now done in the _flush method instead of _transform because the stream chunks are transmitted via TCP but are not immediately analyzed. ClamAV waits to have the complete file before sending a response. In fact, the data event from the socket is emitted only once per connection to deliver the scan results. See the premier paragraphe on clamdscan in the documentation.

New implementation:

const av = await clamscan.passthrough();

await pipeline(input, av, output)
const { isInfected, viruses, timeout } = av.result;

kylefarris · 2024-08-21T16:46:34Z

@Dafyh Thanks for the PR! It looks like you've put a lot of thought into the code and have done all the work necessary (updated docs, wrote tests, etc...).

I did a bit of research since the link you provided doesn't necessarily state that "the data event from the socket is emitted only once per connection to deliver the scan results" anywhere that I could see. It does allude to the fact that clamdscan (their client) behaves that way but not necessarily clamd itself. But, alas, it does seem that the daemon indeed waits for the EOF before performing the actual scan–which is unfortunate but also understandable now that I have more thorough understanding of how ClamAV actually scans files.

With all that being said... it does seem that your PR does away with the "forking" aspect of this code which simultaneously piped data to the desired output and ClamAV. In other words, what we had before was a parallel pipeline but now we have a serial pipeline. This won't matter in most cases (small files) but for large files, the data not being sent along to the output until after the file has been fully scanned could significantly increase the time to completion which could decrease the end users' experience.

For example, in the case of, say, a 3 GB file being uploaded to your server and then off to S3, it will have to be fully scanned before being uploaded to S3 whereas in the current implementation, the file is uploaded to S3 at the same time and one would merely need to remove the file from S3 if a virus was detected. A 3 GB file take quite a while to scan and there's only a very small chance of a virus being detected. It would be unfortunate to have to wait until that completed before then having to wait another decent chunk of time to upload the 3 GB to S3 (when it could have already been done).

Does the current implementation prevent the usage of the pipeline function?

Dafyh · 2024-08-21T20:36:35Z

Hello, thank you for your response 😊.

In reality, the functionality is roughly the same as before because the file is sent both to ClamAV and to the output (S3 or disk I/O). By the time it reaches the _flush, the file has already been written by Node (for example, you can use the timers and set a setTimeout in the _flush method on the first line, and you'll see that the file, despite the delay, has already been written to the output).

const timers = require("timers/promises")
// ...
// ...
async _flush(cb) {
  await timers.setTimeout(10000);

  // ...
}

I implemented this because exceptions in event callbacks cannot be caught in a try/catch block, which can cause issues, especially for server responses.
In the previous case, even using an asynchronous pipeline (await pipeline(...)), the only way to handle errors is within the event callbacks, not in the main execution script. In other words, I can't throw errors to the user.

However, there's nothing stopping you from using events to avoid waiting for the pipeline because finish event is triggered after the _flush.

const av = await clamscan.passthrough();

output.on("finish", () => {
  console.log(av.result);
})

input.pipe(av).pipe(output)

What do you think ?

Dafyh · 2024-08-22T07:51:48Z

For now, I am using clamscan like this in my code to handle exceptions in the main script :

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 60_000);

const av = this.#clamav.passthrough();
const result: any = await new Promise((res, rej) => {
  av.on("scan-complete", (result) => {
    clearTimeout(timeoutId);
    res(result);
  });

  controller.signal.addEventListener("abort", () => {
    data.destroy();
    rej(new Error("Scan aborted"));
  });

  data.pipe(av).pipe(createWriteStream(tempPath));
 });

const { isInfected, viruses, timeout } = result;

But I wanted to make an improvement to make it more readable :

const av = await this.#clamav.passthrough();
const signal = AbortSignal.timeout(60_000);

await pipeline(data, av, createWriteStream(tempPath), { signal });
const { isInfected, viruses, timeout } = av.result;

// OR

const av = await this.#clamav.passthrough();
const output = createWriteStream(tempPath);

output.on("finish", () => {
  const { isInfected, viruses, timeout } = av.result;
  // ...
})

data.pipe(av).pipe(output)
// Do something while you wait for the results

feat: improve passthrough method to use an asynchrone pipeline

3e0dc6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve passthrough method to use an asynchrone pipeline #132

feat: improve passthrough method to use an asynchrone pipeline #132

Dafyh commented Aug 21, 2024

kylefarris commented Aug 21, 2024

Dafyh commented Aug 21, 2024

Dafyh commented Aug 22, 2024 •

edited

Loading

feat: improve passthrough method to use an asynchrone pipeline #132

Are you sure you want to change the base?

feat: improve passthrough method to use an asynchrone pipeline #132

Conversation

Dafyh commented Aug 21, 2024

kylefarris commented Aug 21, 2024

Dafyh commented Aug 21, 2024

Dafyh commented Aug 22, 2024 • edited Loading

Dafyh commented Aug 22, 2024 •

edited

Loading