Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use MD5 hashes when available #58

Merged
merged 1 commit into from
May 16, 2018
Merged

use MD5 hashes when available #58

merged 1 commit into from
May 16, 2018

Conversation

campoy
Copy link
Contributor

@campoy campoy commented May 10, 2018

Fixes #36
Fixes #37

Signed-off-by: Francesc Campoy campoy@golang.org

Signed-off-by: Francesc Campoy <campoy@golang.org>
@campoy campoy requested a review from mcuadros May 10, 2018 20:18
@campoy campoy added this to the pga-release milestone May 10, 2018
Copy link
Collaborator

@vmarkovtsev vmarkovtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what's going on here but I like how it looks like.

@bzz
Copy link
Contributor

bzz commented May 14, 2018

Nice improvement @campoy !

I have tried runnning it with the corrupted (partial) and here is an output

$ go build -o pga.md5
$ ./pga.md5 list | wc -l

WARN[0000] could not check md5 hashes for latest.csv.gz, comparing timestamps instead: could not fetch hash at latest.csv.gz.md5: 404 Not Found
Error: unexpected EOF
Usage:
  pga list [flags]

Flags:
  -f, --format string      format of the output (url, csv, or json) (default "url")
  -h, --help               help for list
  -l, --lang stringSlice   list of languages that the repositories should have
  -u, --url string         regular expression that repo urls need to match

Global Flags:
  -v, --verbose   log more information

10589

I find it a bit confusing, although #57 should help, but still may be something can be improved here?
I.e catching and printing "Most probably you have a corrupt index file in ~/.pga/" instead of printing a generic CLI help, together with Error: unexpected EOF ?

Here is FS stat

stat ~/.pga/latest.csv.gz
  File: /Users/alex/.pga/latest.csv.gz
  Size: 1573649   	Blocks: 3080       IO Block: 4096   regular file
Device: 1000004h/16777220d	Inode: 23927156    Links: 1
Access: (0644/-rw-r--r--)  Uid: (  501/    alex)   Gid: (   20/   staff)
Access: 2018-05-14 11:35:50.000000000 +0200
Modify: 2018-05-14 11:16:57.000000000 +0200
Change: 2018-05-14 11:16:57.000000000 +0200
Birth: 2018-03-27 07:20:58.000000000 +0200

}
logrus.Warnf("could not check md5 hashes for %s, comparing timestamps instead: %v", name, err)

localTime, err := dest.ModTime(name)
Copy link
Contributor

@bzz bzz May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wandering, what is the reason to use ModTime in local FS?

As posted in #58 (comment) - it seems that modification time has been updated on my system for some reason (did not do anything with that file), although the index itself is old (and corrupted).

Copy link
Contributor

@bzz bzz May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent time educating myself about golang/go#3952, POSIX-compatibility and platform-dependent os.FileInfo.Sys(), which is even supported in HDFS but seems not to provide a creation time (for the same reasons HADOOP-1377).

This makes me wonder, if modification time should work fine in this case, but it just gets updated somewhere, where it should not..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the mod time only for file system types (or files) that do not support MD5 hashes.
The main goal is to abstract the behavior so we can add other file system backends easily later on.


remoteTime, err := source.ModTime(name)
if err != nil {
logrus.Warnf("could not check mod time in %s: %v", dest.Abs(name), err)
Copy link
Contributor

@bzz bzz May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if first-time user experience with pga get will be optimal or these 2 logs would be a bit too noisy, what do you think?

As there are no files yet, it will result in #of-siva-files x 2 warnings printed in CLI and that might not be something expected or obviously-safe-to-ignore for the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
I think it's fine to show that we're downloading stuff, specially for slow connections, since otherwise it might seem like the program is stuck without doing anything.

Copy link
Contributor

@bzz bzz May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will result in #of-siva-files x 2

I think it's fine to show that we're

A bit confused, am I right, reading it that you mean that first-time user experience is fine as it is now?

@@ -66,11 +87,8 @@ func getIndex() (io.ReadCloser, error) {
dest := localFS(filepath.Join(usr.HomeDir, ".pga"))
source := urlFS(indexURL)

ok, err := copy(dest, source, indexName)
if !ok {
if err := updateCache(dest, source, indexName); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the experience running a workshop with people new to src-d stack, one common pitfall I noticed (2 out of 10 new users) is that after running pga for the first time, users do not realize it's actually downloading an index, as there is no CLI feedback and it takes some time - and just perceive it as hang and do Ctrl+C.

On the next run, pga list does not work for them, as local index is corrupted.

Ofcourse having index md5 will catch that later in #57 , but while we are here, it could be easy to mitigate this type of failure by just printing some CLI output like "Index downloading in progress, please wait."

For sure, this could be done in subesequent PRs, but with the design proposed here updateCache already has debug logging built-in but it is used for both, index and .siva files download, while the CLI output is desired only for the index file.

Wondering what would be the best way to do that? Pass isIndex to updateCache and conditionally logrus.Info from there or is there a better way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, that's exactly what the comment above does, right?
In case where the index is non existent or not up to date we warn that we're downloading a new version.

Copy link
Contributor

@bzz bzz May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, that's exactly what the comment above does, right?

only after #57
But for now I think it's still prone to the failure described above.

func copy(dest, source FileSystem, name string) (bool, error) {
// updateCache checks whether a new version of the file in url exists and downloads it
// to dest. It returns an error when it was not possible to update it.
func updateCache(dest, source FileSystem, name string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is naming like copyMaybe a thing in Golang? Might come handy explaining this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyMaybe doesn't sound idiomatic, I've never seen it before.
I think updateCache is pretty clear, but I'm happy to modify it if you think it's not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you! It's just a common naming convention in many programming languages that origins from functional programming, to signify a function that have conditional side-effects.

See examples in:

Quick search over stdlib shows some hits in both public and private APIs:

@campoy
Copy link
Contributor Author

campoy commented May 14, 2018

@bzz: regarding the failure when the index is corrupt there's probably some work to improve the usability.
Could you file an issue specifically for that case?

Thanks for the thorough review!

@campoy campoy merged commit 77d8f67 into master May 16, 2018
@campoy campoy deleted the md5 branch May 16, 2018 00:02
@bzz
Copy link
Contributor

bzz commented May 17, 2018

regarding the failure when the index is corrupt

I think #57 would avoid that. Suggestion in #58 (comment) was about adding user feedback on index file downloading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants