-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split manifest.NewBucket into storagemanifest package #2284
Conversation
@@ -136,24 +178,26 @@ func (m *Manifest) AddEntry(path string, digest Digest) error { | |||
return nil | |||
} | |||
|
|||
// Paths returns all unique paths in the manifest, order not guaranteed. If you want to iterate the | |||
// paths and their digests, consider using `Range` instead. | |||
// Paths returns all unique paths in the manifest by insertion order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not guaranteeing order is actually better, so the callers don't assume we have the blobs in the manifest ordered somehow (which we didn't before this PR). I guess I don't see what this gets us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same argument used in the protojson
package as to why they introduce random white space in their marshalling of Protobuf messages to JSON, and it causes similar issues downstream.
In this case, it caused Walk
to not guarantee iteration order, which all the other Bucket
implementations do, as the implementer of the manifest Bucket
did not take into account the non-deterministic iteration order. There was also documentation saying "use Range
if you want your ordering to be deterministic", however range was no deterministic either.
Even more to the point, one of the majority use cases of Manifests
, namely MarshalText
, was already guaranteeing ordering implicitly via sort.Strings
.
I've seen examples of code introducing the chaos monkey principle in various parts of the codebase (such as protojson
), and more often than not, it causes bugs and doesn't prevent other bugs. In this example, it caused an effective bug in Walk
, which could cause functions such as file listing to be non-deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it caused Walk to not guarantee iteration order, which all the other Bucket implementations do
I didn't know that, probably we should add it to the Walk
interface func docs, so every other implementer knows this rule.
There was also documentation saying "use Range if you want your ordering to be deterministic", however range was no deterministic either.
The previous Range documentation explicitly said the order was not guaranteed.
Even more to the point, one of the majority use cases of Manifests, namely MarshalText, was already guaranteeing ordering implicitly via sort.Strings.
Correct, because that was the only place in which we were interested in sorting the paths, when returning the manifest as a blob, so the whole module's digest would be deterministic. Order adds overhead calculation, even though is not required anywhere else in the implementation.
In this example, it caused an effective bug in Walk, which could cause functions such as file listing to be non-deterministic.
You're right, I could see a scenario in which the Walk caller expects the files to be walked in a deterministic order, even though that is not explicitly stated in the docs, which is why we didn't consider it as a bug. I still don't think it is, but probably it could be seen as a subpar experience using the pkg. Tbh, in our usages of walking this manifest bucket, we hadn't seen the need of a deterministic order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a bit of philosophy on this, but I tend to default to making iteration deterministic if possible, and it's for this reason - engineers will just assume it is unless you tell them otherwise. You're totally right that it's not documented in Walk
, but had Walk
not been done deterministically, I guarantee you there'd be downstream bugs in users of it.
// This should be an error in the style of the rest of the codebase but | ||
// this was refactored and we didn't want to change the function signature. | ||
panic(fmt.Sprintf("path %q not present in pathToDigest", path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Asking a manifest for it's digests array is a function that should never fail nor panic. The way we're iterating over the paths forces us to check an error that should never happen, and that's why we iterated over pathToDigest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For there not to be a key here, there needs to be a bug within AddEntry
that is on a similar level to a panic - you'd have to seriously mess something up for there to be a value in paths
but not pathToDigest
. Even so, yes I usually return an error in this case, but I didn't want to put even more into this refactor. This isn't why the iteration was over pathToDigest
, it was because nothing like paths
existed.
// TODO: why is this validation in newBucket, why is this not somewhere else? | ||
// This should not be in storagemanifest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to allow partial manifests buckets, so we could build a bucket with just some of the blobs instead of all of them. I don't recall if we're actually using it somewhere, I can check. I guess we could move these validations where needed, somewhere else before building the bucket, but I don't see the harm, it's a different builder with its custom options, why not offering them as new bucket options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation has nothing to do with bucket construction - it has to do with the <Manifest, BlobSet>
pair, which, if they are so tightly correlated in such a way, should probably have a pair type that does such validation.
func NewReadBucket( | ||
m *manifest.Manifest, | ||
blobSet *manifest.BlobSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two being structs and not interfaces, I'm really uneasy to receive them as pointers because it forces the func to do nil-checking, instead of the caller making sure it's a value and not a pointer. Also, passing the struct gives peace of mind to both caller and function that we're passing a copy, and no changes to the original data will be made.
I know about the rule of function pointer receivers, but from what I understand, that doesn't apply to parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not do this style in Buf's codebases - this was resolved a while ago. Use pointers unless there is a strict reason not to. In this case, you're saying that by derefencing, you are passing a copy of the Manifest
, and no changes to the underlying data can be made. This isn't actually what you are doing here - you're passing a reference the the underlying map by value, meaning that i.e. an AddEntry
call on manifest.Manifest
will still result in a mutation to the dereferenced *manifest.Manifest
you passed to the function.
As a demonstration of this issue, run the following
package main
import (
"fmt"
"sort"
)
func main() {
foo := NewFoo()
AddToDereferencedFoo(*foo, "one")
AddToDereferencedFoo(*foo, "two")
fmt.Println(foo.Bars())
}
func AddToDereferencedFoo(foo Foo, bar string) {
foo.AddBar(bar)
}
type Foo struct {
bars map[string]struct{}
}
func NewFoo() *Foo {
return &Foo{
bars: make(map[string]struct{}),
}
}
func (f *Foo) AddBar(bar string) {
f.bars[bar] = struct{}{}
}
func (f *Foo) Bars() []string {
bars := make([]string, len(f.bars))
for bar := range f.bars {
bars = append(bars, bar)
}
sort.Strings(bars)
return bars
}
You'll see that it prints out both "one" and "two".
It's this reason (among many others) that as an organization, we use pointers unless there is a strict reason not to - subtle bugs get introduced where people are relying on pass-by-value semantics an but when there's actually pass-by-reference semantics under the hood, and developers do not fully consider the semantics, and get a sense of safety when they shouldn't.
If you want immutability to be a property, have an interface that doesn't expose the mutable methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pointers unless there is a strict reason not to.
I understand that this applies to function receivers, not function parameters. But I see your underlying point, even passing them as values, inner maps are being passed as reference.
That's a very interesting finding btw, just tested and confirmed if you created a bucket from manifest and blobs, and then afterwards you add a new entry to the original manifest, the bucket will recognize the path and will try to find the associated blob.
Probably we'll want to to another small refactor to have a builder that returns immutable buckets/manifests/blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably refactor Manifest
to be an interface, but I am not advocating that we need to do that. But yes of note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change of package for the storage related code, it lives more appropriately there. I'm still on the fence about the panics.
@@ -136,24 +178,26 @@ func (m *Manifest) AddEntry(path string, digest Digest) error { | |||
return nil | |||
} | |||
|
|||
// Paths returns all unique paths in the manifest, order not guaranteed. If you want to iterate the | |||
// paths and their digests, consider using `Range` instead. | |||
// Paths returns all unique paths in the manifest by insertion order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it caused Walk to not guarantee iteration order, which all the other Bucket implementations do
I didn't know that, probably we should add it to the Walk
interface func docs, so every other implementer knows this rule.
There was also documentation saying "use Range if you want your ordering to be deterministic", however range was no deterministic either.
The previous Range documentation explicitly said the order was not guaranteed.
Even more to the point, one of the majority use cases of Manifests, namely MarshalText, was already guaranteeing ordering implicitly via sort.Strings.
Correct, because that was the only place in which we were interested in sorting the paths, when returning the manifest as a blob, so the whole module's digest would be deterministic. Order adds overhead calculation, even though is not required anywhere else in the implementation.
In this example, it caused an effective bug in Walk, which could cause functions such as file listing to be non-deterministic.
You're right, I could see a scenario in which the Walk caller expects the files to be walked in a deterministic order, even though that is not explicitly stated in the docs, which is why we didn't consider it as a bug. I still don't think it is, but probably it could be seen as a subpar experience using the pkg. Tbh, in our usages of walking this manifest bucket, we hadn't seen the need of a deterministic order.
func NewReadBucket( | ||
m *manifest.Manifest, | ||
blobSet *manifest.BlobSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pointers unless there is a strict reason not to.
I understand that this applies to function receivers, not function parameters. But I see your underlying point, even passing them as values, inner maps are being passed as reference.
That's a very interesting finding btw, just tested and confirmed if you created a bucket from manifest and blobs, and then afterwards you add a new entry to the original manifest, the bucket will recognize the path and will try to find the associated blob.
Probably we'll want to to another small refactor to have a builder that returns immutable buckets/manifests/blobs.
This moves
manifest.NewBucket
into a newstoragemanifest
package, putting the associated code as a sub-package ofstorage
similar to otherBucket
implementations. Along the way, it also cleans up the code to be in line with our other storage packages, does proper validation onBucket
methods, guaranteesWalk
iteration order (and all ordering operations on aManifest
), and uses pointers in the manner consistent with our style guide.