retrieval issues: TreeHash doesn't match #115

irvintim · 2015-09-15T19:16:08Z

In retrieving files from glacier using mtglacier, we often get an error that the TreeHash doesn't match. Our job is configured to retry the retrieval when it fails, so it may get this error 10 or 15 times, and then finally a retrieval will succeed. This happens about once every dozen or so retrieval requests.

The offset in the file where the TreeHash mismatch occurs is in a different location on each failure, and since the file eventually is restored it appears to not be corrupt at glacier. So, something is wrong with the transfer or the TreeHash calculation routine.

ERROR (child 26241): TreeHash for received segment of file "/data/glacier-stg/86346" (position 2550136832, size 134217728) does not match. TreeHash reported by server af0b2467868b783ab298faa03f58133cdff7b88309c0fcb928bae650420d6a8f, Calculated TreeHash bfed84480067e86258b3e14026a8f0d2362286459e31389674b632af9b66348e

vsespb · 2015-09-15T19:37:58Z

Hello.

This happens about once every dozen or so retrieval requests.

Not too rare and nobody else noticed that.

Could this be hardware issue (memory, network)? Did you test memory, did you tried different machine?

Anyway, could you pls give

your OS, distro and version
perl -V (notice capital "V")
mtglacier version
mtglacier --version output

irvintim · 2015-09-15T19:56:04Z

That's a good thought about testing the memory -- I'll schedule a
maintenance window.

Here are the details you requested:

11:12 AM [root@storage01 ~] more /etc/redhat-release
CentOS release 6.6 (Final)
12:54 PM [root@storage01 ~] perl -V
Summary of my perl5 (revision 5 version 10 subversion 1) configuration:

Platform:
osname=linux, osvers=2.6.32-220.el6.x86_64,
archname=x86_64-linux-thread-multi
uname='linux c6b9.bsys.dev.centos.org 2.6.32-220.el6.x86_64 #1 smp tue
dec 6 19:48:22 gmt 2011 x86_64 x86_64 x86_64 gnulinux '
config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
-mtune=generic -DDEBUGGING=-g -Dversion=5.10.1 -Dmyhostname=localhost
-Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr
-Dvendorprefix=/usr -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5
-Dprivlib=/usr/share/perl5 -Darchlib=/usr/lib64/perl5
-Dvendorlib=/usr/share/perl5/vendor_perl
-Dvendorarch=/usr/lib64/perl5/vendor_perl -Dinc_version_list=5.10.0
-Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64
/usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles
-Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog
-Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005
-Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto
-Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto
-Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto
-Dscriptdir=/usr/bin -Dusesitecustomize'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing
-pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include'
ccversion='', gccversion='4.4.7 20120313 (Red Hat 4.4.7-16)',
gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='gcc', ldflags =' -fstack-protector'
libpth=/usr/local/lib64 /lib64 /usr/lib64
libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.12'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E
-Wl,-rpath,/usr/lib64/perl5/CORE'
cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=generic'

Characteristics of this binary (from libperl):
Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
USE_64_BIT_ALL
USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
USE_PERLIO USE_REENTRANT_API USE_SITECUSTOMIZE
Built under linux
Compiled at Jul 24 2015 01:51:51
@inc:
/usr/local/lib64/perl5
/usr/local/share/perl5
/usr/lib64/perl5/vendor_perl
/usr/share/perl5/vendor_perl
/usr/lib64/perl5
/usr/share/perl5
.
12:54 PM [root@storage01 ~] uname -a
Linux storage01.sfo.metainterfaces.com 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed
Jan 28 21:11:36 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
12:54 PM [root@storage01 ~] mtglacier --version
MT-AWS-Glacier, Copyright 2012-2014 Victor Efimov http://mt-aws.com/
Version 1.120

mt-aws-glacier version: 1.120
Perl Version: 5.010001
AutoLoader 5.68 /usr/share/perl5/AutoLoader.pm
Carp 1.11 /usr/share/perl5/Carp.pm
Carp::Heavy undef /usr/share/perl5/Carp/Heavy.pm
Class::Struct 0.63 /usr/share/perl5/Class/Struct.pm
Config undef /usr/lib64/perl5/Config.pm
Cwd 3.3 /usr/lib64/perl5/Cwd.pm
Digest::SHA 5.47 /usr/lib64/perl5/Digest/SHA.pm
Digest::base 1.16 /usr/share/perl5/Digest/base.pm
DynaLoader 1.10 /usr/lib64/perl5/DynaLoader.pm
Encode 2.35 /usr/lib64/perl5/Encode.pm
Encode::Alias 2.12 /usr/lib64/perl5/Encode/Alias.pm
Encode::Config 2.05 /usr/lib64/perl5/Encode/Config.pm
Encode::Encoding 2.05 /usr/lib64/perl5/Encode/Encoding.pm
Errno 1.11 /usr/lib64/perl5/Errno.pm
Exporter 5.63 /usr/share/perl5/Exporter.pm
Exporter::Heavy 5.63 /usr/share/perl5/Exporter/Heavy.pm
Fcntl 1.06 /usr/lib64/perl5/Fcntl.pm
File::Basename 2.77 /usr/share/perl5/File/Basename.pm
File::Find 1.14 /usr/share/perl5/File/Find.pm
File::Path 2.08 /usr/share/perl5/File/Path.pm
File::Spec 3.3 /usr/share/perl5/File/Spec.pm
File::Spec::Unix 3.3 /usr/share/perl5/File/Spec/Unix.pm
File::Temp 0.22 /usr/share/perl5/File/Temp.pm
File::stat 1.01 /usr/share/perl5/File/stat.pm
FileHandle 2.02 /usr/share/perl5/FileHandle.pm
Getopt::Long 2.38 /usr/share/perl5/Getopt/Long.pm
HTTP::Date 5.831 /usr/share/perl5/HTTP/Date.pm
HTTP::Headers 5.827 /usr/share/perl5/HTTP/Headers.pm
HTTP::Message 5.832 /usr/share/perl5/HTTP/Message.pm
HTTP::Request 5.827 /usr/share/perl5/HTTP/Request.pm
HTTP::Response 5.824 /usr/share/perl5/HTTP/Response.pm
HTTP::Status 5.817 /usr/share/perl5/HTTP/Status.pm
I18N::Langinfo 0.02 /usr/lib64/perl5/I18N/Langinfo.pm
IO 1.25 /usr/lib64/perl5/IO.pm
IO::File 1.14 /usr/lib64/perl5/IO/File.pm
IO::Handle 1.28 /usr/lib64/perl5/IO/Handle.pm
IO::Pipe 1.13 /usr/lib64/perl5/IO/Pipe.pm
IO::Seekable 1.1 /usr/lib64/perl5/IO/Seekable.pm
IO::Select 1.17 /usr/lib64/perl5/IO/Select.pm
JSON::XS 2.27 /usr/lib64/perl5/JSON/XS.pm
LWP 5.833 /usr/share/perl5/LWP.pm
LWP::MemberMixin undef /usr/share/perl5/LWP/MemberMixin.pm
LWP::Protocol 5.829 /usr/share/perl5/LWP/Protocol.pm
LWP::UserAgent 5.833 /usr/share/perl5/LWP/UserAgent.pm
List::Util 1.21 /usr/lib64/perl5/List/Util.pm
MIME::Base64 3.08 /usr/lib64/perl5/MIME/Base64.pm
POSIX 1.17 /usr/lib64/perl5/POSIX.pm
PerlIO::encoding 0.11 /usr/lib64/perl5/PerlIO/encoding.pm
Scalar::Util 1.21 /usr/lib64/perl5/Scalar/Util.pm
SelectSaver 1.02 /usr/share/perl5/SelectSaver.pm
Storable 2.20 /usr/lib64/perl5/Storable.pm
Symbol 1.07 /usr/share/perl5/Symbol.pm
Tie::Hash 1.03 /usr/share/perl5/Tie/Hash.pm
Time::Local 1.1901 /usr/share/perl5/Time/Local.pm
URI 1.40 /usr/share/perl5/URI.pm
URI::Escape 3.29 /usr/share/perl5/URI/Escape.pm
XSLoader 0.10 /usr/lib64/perl5/XSLoader.pm
attributes 0.09 /usr/share/perl5/attributes.pm
base 2.14 /usr/share/perl5/base.pm
bytes 1.03 /usr/share/perl5/bytes.pm
common::sense 3.5 /usr/share/perl5/vendor_perl/common/sense.pm
constant 1.17 /usr/share/perl5/constant.pm
integer 1.00 /usr/share/perl5/integer.pm
overload 1.07 /usr/share/perl5/overload.pm
re 0.09 /usr/lib64/perl5/re.pm
strict 1.04 /usr/share/perl5/strict.pm
utf8 1.07 /usr/share/perl5/utf8.pm
vars 1.01 /usr/share/perl5/vars.pm
warnings 1.06 /usr/share/perl5/warnings.pm
warnings::register 1.01 /usr/share/perl5/warnings/register.pm
OK DONE
12:54 PM [root@storage01 ~]

On Tue, Sep 15, 2015 at 12:38 PM, Victor Efimov notifications@github.com
wrote:

Hello.

This happens about once every dozen or so retrieval requests.

Not too rare and nobody else noticed that.

Could this be hardware issue (memory, network)? Did you test memory, did
you tried different machine?

Anyway, could you pls give

your OS, distro and version

perl -V (notice capital "V")

mtglacier version

mtglacier --version output

—
Reply to this email directly or view it on GitHub
#115 (comment)
.

Tim Irvin
NetTempo, Inc.
130 Battery St., Suite 500
San Francisco, CA 94111
+1 415 992-4902 voice/fax
http://www.nettempo.com

vsespb · 2015-09-15T20:03:04Z

I already tested/developed with that version of perl. Don't see anything suspicious in module versions too. Let's wait memory test.

irvintim · 2015-09-16T19:58:17Z

Hi Victor,

Well, no joy on the memory check. The RAM checked out fine on the server.

I have a theory, see if you think there could be any correlation.

The machine that is doing the restores is also pushing new content to
glacier -- so there can be multiple mtglacer commands running at the same
time, they would be operating on different journal files however.

What we do when we schedule something for restore is create a new journal
file with just the relevant "CREATED" line out of journal.log into a new
file, and then run:

mtglacier restore .....

using the new journal file.

Then after 4 hours we run:

mtglacier restore-completed ....

on that journal file to download the archive. If the file fails to be
placed on the disk when this command has finished running, then the
"mtglacier restore-completed ...." will be run again at the top of the next
hour -- this will repeat for 20 hours total and if it fails to download
then this archive will be placed back in the queue of things to download
and the whole process will start again.

It appears that the failure in downloading occurs mostly (perhaps solely,
but I haven't confirmed that yet) when a different process on the same
machine is uploading archives to glacier at the same time, also using
mtglacier (but to reiterate, it's working off a different journal file).

Tim

On Tue, Sep 15, 2015 at 1:03 PM, Victor Efimov notifications@github.com
wrote:

I already tested/developed with that version of perl. Don't see anything
suspicious in module versions too. Let's wait memory test.

—
Reply to this email directly or view it on GitHub
#115 (comment)
.

Tim Irvin
NetTempo, Inc.
130 Battery St., Suite 500
San Francisco, CA 94111
+1 415 992-4902 voice/fax
http://www.nettempo.com

vsespb · 2015-09-16T20:02:54Z

ok.. let me think..

vsespb · 2015-09-16T21:36:03Z

It appears that the failure in downloading occurs mostly (perhaps solely, but I haven't confirmed that yet) when a different process on the same machine is uploading archives

Would it be true if we say that failure happens when overall machine load is high (i.e. not too much free memory, pretty much of disk activity etc)?

Also - the drive where you donwload files - is this a network drive? What technology?

irvintim · 2015-09-16T22:00:40Z

I think I can eliminate the theory about other mtglacier apps running at
the same time, looking at the logs that job finished around 8:30am today
and I've had more restore failures since then.

Looking at the machine load, it's not very high.

The machine's load average is 0.33
It has 24 cores, noe are over 5% utililization.
The machine has 24 GB or ramand 8GB are in use, with 4GB of that being used
for disk caching.
The data drive used is a LUN from a local JBOD SCSI disk array, via an
Adaptec RAID controller -- it is a stripe of 2 3 TB drives drives into a
RAID0 LUN of 5.5 TB.
The filesystem is XFS.

Here is the current iostat on that partition (5 second intervals):

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 1.60 0.00 14.40 0 72

On Wed, Sep 16, 2015 at 2:36 PM, Victor Efimov notifications@github.com
wrote:

It appears that the failure in downloading occurs mostly (perhaps solely,
but I haven't confirmed that yet) when a different process on the same
machine is uploading archives

Would it be true if we say that failure happens when overall machine load
is high (i.e. not too much free memory, pretty much of disk activity etc)?

Also - the drive where you donwload files - is this a network drive? What
technology?

—
Reply to this email directly or view it on GitHub
#115 (comment)
.

Tim Irvin
NetTempo, Inc.
130 Battery St., Suite 500
San Francisco, CA 94111
+1 415 992-4902 voice/fax
http://www.nettempo.com

vsespb · 2015-09-17T12:26:27Z

I tried simulate the error. I only get this error if data corruption happens.
I simulated like this:

--- a/lib/App/MtAws/GlacierRequest.pm
+++ b/lib/App/MtAws/GlacierRequest.pm
@@ -522,6 +522,7 @@ sub perform_lwp
                                                die exception "unknow_error" => "Unknown error, probably LWP version is too old";
                                        }
                                }
+                               substr($_[0],1,1)='x' if rand() < 0.9;
                                $self->{writer}->add_data($_[0]);
                                1;
                        });

Other errors - like data truncation, wrong HTTP headers, error in creating temp files during download does not produce such error (i.e. all works as expected).

If you use HTTP could you try HTTPS, and if you use HTTPS could you try HTTP ? Let's see how close the bug to networking library.

irvintim · 2015-09-17T17:22:27Z

We use HTTP, I had to install LWP::Protocol::https from CPAN, along with a
bunch of other things, dev tools, Net::SSLeay, gcc, openssl-devel, etc...
in order to switch to HTTPS.

The first test succeeded after switching, I'm running some more test
restores today and will let you know how they go.

Thanks for all the help!

Tim

On Thu, Sep 17, 2015 at 5:26 AM, Victor Efimov notifications@github.com
wrote:

I tried simulate the error. I only get this error if data corruption
happens.
I simulated like this:

--- a/lib/App/MtAws/GlacierRequest.pm
+++ b/lib/App/MtAws/GlacierRequest.pm
@@ -522,6 +522,7 @@ sub perform_lwp
die exception "unknow_error" => "Unknown error, probably LWP version is too old";
}
}
                          substr($_[0],1,1)='x' if rand() < 0.9;
                        $self->{writer}->add_data($_[0]);
                        1;
                });
Other errors - like data truncation, wrong HTTP headers, error in creating
temp files during download does not produce such error (i.e. all works as
expected).

If you use HTTP could you try HTTPS, and if you use HTTPS could you try
HTTP ? Let's see how close the bug to networking library.

—
Reply to this email directly or view it on GitHub
#115 (comment)
.

Tim Irvin
NetTempo, Inc.
130 Battery St., Suite 500
San Francisco, CA 94111
+1 415 992-4902 voice/fax
http://www.nettempo.com

vsespb · 2015-09-17T17:26:01Z

yes, just in case, deploy instructions for centos6 here https://github.com/vsespb/mt-aws-glacier#rhelcentos-6

irvintim · 2015-09-17T18:24:13Z

Thanks. Since we didn't use HTTPS I hadn't gone through those steps
before. But it's all in place now.

On Thu, Sep 17, 2015 at 10:26 AM, Victor Efimov notifications@github.com
wrote:

yes, just in case, deploy instructions for centos6 here
https://github.com/vsespb/mt-aws-glacier#rhelcentos-6

—
Reply to this email directly or view it on GitHub
#115 (comment)
.

Tim Irvin
NetTempo, Inc.
130 Battery St., Suite 500
San Francisco, CA 94111
+1 415 992-4902 voice/fax
http://www.nettempo.com

irvintim · 2015-09-18T01:42:21Z

No errors after 8 restores. What does this tell us? Besides that I'll leave it on HTTPS for now. It would be nice to go back to unencrypted at some point, but this workaround is fine for now.
Thanks again for all your help.
Tim

---Tim IrvinNetTempo, Inc.130 Battery St., Suite 500San Francisco, CA 94111+1-415-992-4902 voice/faxhttp://www.nettempo.com
Sent via the Samsung Galaxy S® 6 edge, an AT&T 4G LTE smartphone-------- Original message --------
From: Victor Efimov notifications@github.com
Date: 09/17/2015 10:26 AM (GMT-08:00)
To: vsespb/mt-aws-glacier mt-aws-glacier@noreply.github.com
Cc: irvintim irvin@nettempo.com
Subject: Re: [mt-aws-glacier] retrieval issues: TreeHash doesn't match (#115)

yes, just in case, deploy instructions for centos6 here https://github.com/vsespb/mt-aws-glacier#rhelcentos-6

—
Reply to this email directly or view it on GitHub.

vsespb · 2015-09-18T07:22:27Z

What does this tell us?

Bug somewhere in old versionsof LWP::UserAgent shipped with Rhel6 (not necessary HTTP-only; when you installed HTTPS you also updated HTTP part).
However bug is 1) intermitent 2) only affect data content, but not data length (no truncation, just damage).
There are a lot of bug reports for LWP but I don't see anything like that. Also I checked diff with new versions - nothing like that too. So I still suspect maybe broken router or proxy on your side.

However maybe I'll install CentOS6 on VM and try to stresstest (I have script for that). Let's leave ticket open for now.

vsespb added the investigation label Sep 15, 2015

vsespb added the bug_in_3rd_party_software label Sep 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retrieval issues: TreeHash doesn't match #115

retrieval issues: TreeHash doesn't match #115

irvintim commented Sep 15, 2015

vsespb commented Sep 15, 2015

irvintim commented Sep 15, 2015

vsespb commented Sep 15, 2015

irvintim commented Sep 16, 2015

vsespb commented Sep 16, 2015

vsespb commented Sep 16, 2015

irvintim commented Sep 16, 2015

vsespb commented Sep 17, 2015

irvintim commented Sep 17, 2015

vsespb commented Sep 17, 2015

irvintim commented Sep 17, 2015

irvintim commented Sep 18, 2015

vsespb commented Sep 18, 2015

retrieval issues: TreeHash doesn't match #115

retrieval issues: TreeHash doesn't match #115

Comments

irvintim commented Sep 15, 2015

vsespb commented Sep 15, 2015

irvintim commented Sep 15, 2015

vsespb commented Sep 15, 2015

irvintim commented Sep 16, 2015

vsespb commented Sep 16, 2015

vsespb commented Sep 16, 2015

irvintim commented Sep 16, 2015

vsespb commented Sep 17, 2015

irvintim commented Sep 17, 2015

vsespb commented Sep 17, 2015

irvintim commented Sep 17, 2015

irvintim commented Sep 18, 2015

vsespb commented Sep 18, 2015