Description
Is your feature request related to a problem? Please describe.
When uploading a large file to Google Drive via Google Drive API, it is recommended to upload in chunks, so that sending can be resumed if transfer of a particular chunk fails (Google API docs).
After transfer of each chunk, the API returns the MD5 hash of all data transferred to date (i.e. of all chunks up to and including the last one). NB This is not documented, but appears in an x-range-md5
HTTP response header.
It would be useful to be able to verify that hash after transfer of each chunk, in order to know if a chunk has been corrupted in transmission. It would then be possible to transfer the chunk again.
This is not feasible to do in Node at present. You can verify the final hash after the very last chunk and make sure it matches for the entire file. That ensures data integrity. But if it doesn't match, you don't know which chunk of the file was corrupted, and have to start the upload again from the beginning (expensive when the file is 100GB!)
Describe the solution you'd like
Some way to calculate "rolling" hashes. i.e. call hash.digest()
but then still be able to do further calls to hash.update()
and call hash.digest()
again.
Possible ways to achieve this:
- Keep the internal state of the hash after call to
.digest()
, so it can be reused. - Add a
hash.copy()
method which clones the hash so you can call.digest()
on the clone, and still retain a "live" hash which you can continue to.update()
. - As (1) but only enable this feature if
crypto.createHash()
called with areuseable
option.
@sam-github raised the possibility of a .copy()
method in #25857 (comment).
Here's how that would work for my use case:
// NB Simplified code
const CHUNK_SIZE = 256 * 1024; // 256 KiB
async function upload(path, size) {
const hash = crypto.createHash('md5');
for (const start = 0; start < size; start += CHUNK_SIZE) {
const end = start + CHUNK_SIZE - 1;
const stream = fs.createReadStream(path, {start, end})
.pipe(new stream.Transform({
transform(data, encoding, cb) {
hash.update(data);
this.push(data);
cb();
}
});
const md5FromApi = await transferChunkToGoogleDrive(stream, start, end);
const md5Actual = hash.copy().digest('hex');
if (md5FromApi !== md5Actual) {
// Rather than throwing, we could transfer last chunk again.
// This logic omitted to keep example short.
throw new Error(`Transfer failed on chunk ${start}-${end}`);
}
}
The hash.copy()
call will no doubt have a performance penalty, but this is outweighed in this case by the cost of having to start an upload again from the beginning if the file is large.
Describe alternatives you've considered
An alternative is to use a JS implementation of MD5, where you can access the internal state of a hash and clone it. I suspect performance would be much worse than Node's native methods though.