-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retrieval issues: TreeHash doesn't match #115
Comments
Hello.
Not too rare and nobody else noticed that. Could this be hardware issue (memory, network)? Did you test memory, did you tried different machine? Anyway, could you pls give
|
That's a good thought about testing the memory -- I'll schedule a Here are the details you requested: 11:12 AM [root@storage01 ~] more /etc/redhat-release Platform: Characteristics of this binary (from libperl): mt-aws-glacier version: 1.120 On Tue, Sep 15, 2015 at 12:38 PM, Victor Efimov notifications@github.com
Tim Irvin |
I already tested/developed with that version of perl. Don't see anything suspicious in module versions too. Let's wait memory test. |
Hi Victor, Well, no joy on the memory check. The RAM checked out fine on the server. I have a theory, see if you think there could be any correlation. The machine that is doing the restores is also pushing new content to What we do when we schedule something for restore is create a new journal
using the new journal file. Then after 4 hours we run: mtglacier restore-completed .... on that journal file to download the archive. If the file fails to be It appears that the failure in downloading occurs mostly (perhaps solely, Tim On Tue, Sep 15, 2015 at 1:03 PM, Victor Efimov notifications@github.com
Tim Irvin |
ok.. let me think.. |
Would it be true if we say that failure happens when overall machine load is high (i.e. not too much free memory, pretty much of disk activity etc)? Also - the drive where you donwload files - is this a network drive? What technology? |
I think I can eliminate the theory about other mtglacier apps running at Looking at the machine load, it's not very high. The machine's load average is 0.33 Here is the current iostat on that partition (5 second intervals): Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn On Wed, Sep 16, 2015 at 2:36 PM, Victor Efimov notifications@github.com
Tim Irvin |
I tried simulate the error. I only get this error if data corruption happens.
Other errors - like data truncation, wrong HTTP headers, error in creating temp files during download does not produce such error (i.e. all works as expected). If you use HTTP could you try HTTPS, and if you use HTTPS could you try HTTP ? Let's see how close the bug to networking library. |
We use HTTP, I had to install LWP::Protocol::https from CPAN, along with a The first test succeeded after switching, I'm running some more test Thanks for all the help! Tim On Thu, Sep 17, 2015 at 5:26 AM, Victor Efimov notifications@github.com
Tim Irvin |
yes, just in case, deploy instructions for centos6 here https://github.com/vsespb/mt-aws-glacier#rhelcentos-6 |
Thanks. Since we didn't use HTTPS I hadn't gone through those steps On Thu, Sep 17, 2015 at 10:26 AM, Victor Efimov notifications@github.com
Tim Irvin |
No errors after 8 restores. What does this tell us? Besides that I'll leave it on HTTPS for now. It would be nice to go back to unencrypted at some point, but this workaround is fine for now. ---Tim IrvinNetTempo, Inc.130 Battery St., Suite 500San Francisco, CA 94111+1-415-992-4902 voice/faxhttp://www.nettempo.com yes, just in case, deploy instructions for centos6 here https://github.com/vsespb/mt-aws-glacier#rhelcentos-6 — |
Bug somewhere in old versionsof LWP::UserAgent shipped with Rhel6 (not necessary HTTP-only; when you installed HTTPS you also updated HTTP part). However maybe I'll install CentOS6 on VM and try to stresstest (I have script for that). Let's leave ticket open for now. |
In retrieving files from glacier using mtglacier, we often get an error that the TreeHash doesn't match. Our job is configured to retry the retrieval when it fails, so it may get this error 10 or 15 times, and then finally a retrieval will succeed. This happens about once every dozen or so retrieval requests.
The offset in the file where the TreeHash mismatch occurs is in a different location on each failure, and since the file eventually is restored it appears to not be corrupt at glacier. So, something is wrong with the transfer or the TreeHash calculation routine.
ERROR (child 26241): TreeHash for received segment of file "/data/glacier-stg/86346" (position 2550136832, size 134217728) does not match. TreeHash reported by server af0b2467868b783ab298faa03f58133cdff7b88309c0fcb928bae650420d6a8f, Calculated TreeHash bfed84480067e86258b3e14026a8f0d2362286459e31389674b632af9b66348e
The text was updated successfully, but these errors were encountered: