-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on full PGA integrity verification #53
Comments
@bzz It must be 3TB, not 2.4. Either something went wrong during the download or the index misses some repos. We measured 3TB from our local HDFS copy. This is critical to find out few weeks before the paper presentation... |
We could add a Would that help, @bzz ? |
@vmarkovtsev or there were just network issues while downloading on my side. Sounds like a great idea, @campoy! So, the approach would be to add If we agree that it's the best approach, I'll be happy to look into submitting a patch for this early next week (after getting back to :flag-es: from a short vacation). |
it seems that #58 should make |
Ok, so I've been thinking about this and I found a problem. Current limitationWe do not keep how the list of downloaded Side effect of the limitationSo everything we can tell is whether the files we have in a directory are valid or not. Say I download all repositories under my user name When I do Possible solutionsThese are a couple of possible improvements to Public Git Archive and
An alternative to this is to simply read all the
Unfortunately, this doesn't solve the main problem of a new repository being added to Public Git Archive, but it is a prerequisite for the following improvement.
The main idea is to keep the We would somehow (format TBD) store those filters in some file in the destination directory. Personal opinionWhile I think providing the feature described in What do you all think? |
@campoy I have an impression that we are reinventing a huge wheel here, but I cannot list any particular prior. I note that the Torrent protocol can be handy here: same speed as HTTPS, optional offloading to other mirrors, checksum checks, partial downloads, tracking of the origin. Does not solve the problem with saving the selector query though. This goes really serious, I agree that it seems a bit too much at this point. |
I think for now, simpler solution still can be useful: a That means that On the case of
I would assume that a different version of PGA would have a different index file, so status could accept a particular "version" or however generations of updates in PGA are called. I have put together a very simple Agree, that any solution more complex than that sounds like an overengineering at this point of the project.
Sounds interesting, but I would suggest keep discussion on |
HDFS, as part of protocol, only provides an md5 of rolling 512 bytes crc32s :/ so there is no way to get md5 sum of the whole file, without streaming it to the client. Right now |
Yeah, I saw that HDFS didn't provide md5 so decided to implement it by reading the whole file. The whole If the download fails for any reason they might need to re-download the whole dataset though. But I'm thinking it might be a problem worth having for not getting into the business of building downloaders. |
There seems to be problems about identifying the version of the downloaded dataset. But the dataset should probably be versioned as a whole, and it would be a good idea that For local downloads, maybe we want to provide For any of the methods, I think we should not download files in-place (#39), but use a temporary file and move later. This ensures (or makes very unlikely) that corrupt files are present (they should be either present and correct or absent in the final dataset). By the way, this is what rsync and HDFS do. |
I agree with using |
@campoy I think @rporres uses it for other internal stuff so there should be no problem server side. We'd probably need to get the |
You can use rsync over ssh to get files from pga server. No problem at all. |
In some previous discussions I've said that BitTorrent might not be able to handle this amount of files, but after a quick test, I think it actually can. |
Ok, second pass of
I did a mistake of logging STDERR and STDOUT at the same time, which makes loges un-readable as it all looks like this
Command I used was: but
which usually means a file has beed already downloaded before
but it seems there should be much more :/
Will run it again to capture only STDERR |
@bzz I would collect the list of file names with sizes and compare it to the list retrieved from the server (you can ask Rafa to run any listing command on the server). |
@vmarkovtsev will do that. Meanwhile, have updated the
😕 is that something expected, or am I doing something stupid, how do you guys think? |
The number of lines in index matches, the number of siva files should be around 270k. This means 30k were not indexed and it is very, very bad. |
So before moving forward, we need to index the siva files which were discarded. |
Ok, .siva count of the Index is wrong above, here are updated numbers
@vmarkovtsev One more try 🥇 and now it seems that actually, all the files from the index were downloaded on the second Should we document this result as current "best practice" of first approximation of download integrity verification? As a user, I would very much prefer to use something more automated that also includes some archive integrity verification, etc |
@bzz This is great news! I am so happy you failed to download them two times and this is not an indexing issue! |
I keep on thinking about having this dataset in a git repository. If anyone in engineering wants to play with the idea, that'd be awesome! |
@mcuadros also proposed something similar in the past |
From the logs of the 3rd
|
After another run of full PGA download:
More details in #69 (comment) |
Early next week will try to 🔥 https://github.com/smola/checksum-spark by @smola and report which of 2 runs are more complete |
@bzz I ran it yesterday, there's generated checksums for the |
I've verified that It seems part of the missing files are actually not in the index, so it is expected they are not in the downloaded copy. |
🎉 🙇 to #69
This was something @vmarkovtsev was worried about. Are all 42662 missing from the index? |
@bzz Not all of them are missing from the index, my guess is that part of them were missing just because of a crash in the download process? |
I have downloaded full PGA to HDFS using
pga get -v ...
and have full logs of the process.It took a day and has finished eventually and can see that it's
Questing: how do I make sure that nothing is missing?
What would be the simplest way to verify consistency and completeness of the results of
pga get
? Options from the top of my head includehdfs dfs -ls -R hdfs://hdfs-namenode/pga/ |grep "\.siva$" | wc -l
= 239807and
pga list | wc -l
= 181481 but it's rooted repository VS actual repository :/Would appreciate any recommendations.
@campoy I guess this might be something that is worth documenting eventually, as other users might have same question eventually. Would be happy to submit a PR.
The text was updated successfully, but these errors were encountered: