Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check blob hash #942

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

Nicqu
Copy link

@Nicqu Nicqu commented Nov 10, 2023

Purpose

This feature eliminates the need for local md5 hash files. The hash values of the local documents are compared with the documents in the remote blob storage. This also facilitates the automatic upload of documents later, as no md5 files need to be committed.

  • Check for existing files with the hash from the blob storage instead of local files
  • Overwrite documents if the hash value has changed (so you always have the latest version of the document)

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[x] Documentation content changes
[ ] Other... Please describe:

How to Test

  • run ./scripts/prepdocs.sh or ./scripts/prepdocs.ps1 once to uploade docs
  • run ./scripts/prepdocs.sh or ./scripts/prepdocs.ps1 again to see that the docs will be skipped

What to Check

Verify that the following are valid

  • no local hash files will be created

@Nicqu
Copy link
Author

Nicqu commented Nov 10, 2023

@microsoft-github-policy-service agree

@pamelafox
Copy link
Collaborator

This seems like an overall better approach, as I've been running into issues with the md5 approach when I switch environments (since the md5 files remain). I'd love for @tonybaloney to take a look since he wrote the original md5 approach.

docs/data_ingestion.md Outdated Show resolved Hide resolved
@pamelafox
Copy link
Collaborator

I checked this out last night and did a bunch of experiments with different docs. I've done a bit of reformatting/rewording of the print() statements, which I'll push to this branch shortly.

Unfortunately, I ran into an issue with a large document (55 MB): it had no MD5. Apparently this is a known issue/feature with documents that are uploaded in chunks:
Azure/azure-storage-python#411

I see a few approaches:

  1. For large documents, download the file and manually compute the MD5 hash locally. That has the obvious drawback that you have to download a large document, but download speeds are at least typically better than upload speeds.
  2. Compute the MD5 hash before uploading the document, and manually specify it. That is what the issue suggests, but has the drawback that it will be harder for existing developers to use this code. They will get an error unless they re-upload the document.
  3. Fallback to local MD5 hash for those documents. That seems confusing though.

Copy link
Collaborator

@pamelafox pamelafox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this out last night and did a bunch of experiments with different docs. I've done a bit of reformatting/rewording of the print() statements, which I'll push to this branch shortly.

Unfortunately, I ran into an issue with a large document (55 MB): it had no MD5. Apparently this is a known issue/feature with documents that are uploaded in chunks:
Azure/azure-storage-python#411

I see a few approaches:

  1. For large documents, download the file and manually compute the MD5 hash locally. That has the obvious drawback that you have to download a large document, but download speeds are at least typically better than upload speeds.
  2. Compute the MD5 hash before uploading the document, and manually specify it. That is what the issue suggests, but has the drawback that it will be harder for existing developers to use this code. They will get an error unless they re-upload the document.
  3. Fallback to local MD5 hash for those documents. That seems confusing though.

@Nicqu
Copy link
Author

Nicqu commented Nov 29, 2023

I think approach 1 makes the most sense. If you like, I can extend the PR.

@pamelafox
Copy link
Collaborator

I discussed this with @tonybaloney and we suggest a combination of the approaches, and also removing our reliance on the built-in MD5 given its flakiness. So that means:

  1. When first uploading a file, compute MD5 locally and set it as a custom property on the blob.

  2. When looking at blob properties to decide whether to re-upload, if it doesn't have the MD5, then download the file, compute the MD5 and re-upload it.

That way, this approach will work for developers that have checked out the repository before, and also won't incur unneeded downloads for developers using large files in the future.

That's a larger change though, so let us know if you don't have the time to take it on. Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants