Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve passthrough method to use an asynchrone pipeline #132

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Dafyh
Copy link

@Dafyh Dafyh commented Aug 21, 2024

Improvement of the passthrough method.

  • Wait for the pipeline to execute to obtain the results from ClamScan.
  • Listening to the socket events is now done in the _flush method instead of _transform because the stream chunks are transmitted via TCP but are not immediately analyzed. ClamAV waits to have the complete file before sending a response. In fact, the data event from the socket is emitted only once per connection to deliver the scan results. See the premier paragraphe on clamdscan in the documentation.

New implementation:

const av = await clamscan.passthrough();

await pipeline(input, av, output)
const { isInfected, viruses, timeout } = av.result;

@kylefarris
Copy link
Owner

@Dafyh Thanks for the PR! It looks like you've put a lot of thought into the code and have done all the work necessary (updated docs, wrote tests, etc...).

I did a bit of research since the link you provided doesn't necessarily state that "the data event from the socket is emitted only once per connection to deliver the scan results" anywhere that I could see. It does allude to the fact that clamdscan (their client) behaves that way but not necessarily clamd itself. But, alas, it does seem that the daemon indeed waits for the EOF before performing the actual scan–which is unfortunate but also understandable now that I have more thorough understanding of how ClamAV actually scans files.

With all that being said... it does seem that your PR does away with the "forking" aspect of this code which simultaneously piped data to the desired output and ClamAV. In other words, what we had before was a parallel pipeline but now we have a serial pipeline. This won't matter in most cases (small files) but for large files, the data not being sent along to the output until after the file has been fully scanned could significantly increase the time to completion which could decrease the end users' experience.

For example, in the case of, say, a 3 GB file being uploaded to your server and then off to S3, it will have to be fully scanned before being uploaded to S3 whereas in the current implementation, the file is uploaded to S3 at the same time and one would merely need to remove the file from S3 if a virus was detected. A 3 GB file take quite a while to scan and there's only a very small chance of a virus being detected. It would be unfortunate to have to wait until that completed before then having to wait another decent chunk of time to upload the 3 GB to S3 (when it could have already been done).

Does the current implementation prevent the usage of the pipeline function?

@Dafyh
Copy link
Author

Dafyh commented Aug 21, 2024

Hello, thank you for your response 😊.

In reality, the functionality is roughly the same as before because the file is sent both to ClamAV and to the output (S3 or disk I/O). By the time it reaches the _flush, the file has already been written by Node (for example, you can use the timers and set a setTimeout in the _flush method on the first line, and you'll see that the file, despite the delay, has already been written to the output).

const timers = require("timers/promises")
// ...
// ...
async _flush(cb) {
  await timers.setTimeout(10000);

  // ...
}

I implemented this because exceptions in event callbacks cannot be caught in a try/catch block, which can cause issues, especially for server responses.
In the previous case, even using an asynchronous pipeline (await pipeline(...)), the only way to handle errors is within the event callbacks, not in the main execution script. In other words, I can't throw errors to the user.

However, there's nothing stopping you from using events to avoid waiting for the pipeline because finish event is triggered after the _flush.

const av = await clamscan.passthrough();

output.on("finish", () => {
  console.log(av.result);
})

input.pipe(av).pipe(output)

What do you think ?

@Dafyh
Copy link
Author

Dafyh commented Aug 22, 2024

For now, I am using clamscan like this in my code to handle exceptions in the main script :

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 60_000);

const av = this.#clamav.passthrough();
const result: any = await new Promise((res, rej) => {
  av.on("scan-complete", (result) => {
    clearTimeout(timeoutId);
    res(result);
  });

  controller.signal.addEventListener("abort", () => {
    data.destroy();
    rej(new Error("Scan aborted"));
  });

  data.pipe(av).pipe(createWriteStream(tempPath));
 });

const { isInfected, viruses, timeout } = result;

But I wanted to make an improvement to make it more readable :

const av = await this.#clamav.passthrough();
const signal = AbortSignal.timeout(60_000);

await pipeline(data, av, createWriteStream(tempPath), { signal });
const { isInfected, viruses, timeout } = av.result;

// OR

const av = await this.#clamav.passthrough();
const output = createWriteStream(tempPath);

output.on("finish", () => {
  const { isInfected, viruses, timeout } = av.result;
  // ...
})

data.pipe(av).pipe(output)
// Do something while you wait for the results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants