Split manifest.NewBucket into storagemanifest package #2284

bufdev · 2023-07-11T22:31:01Z

This moves manifest.NewBucket into a new storagemanifest package, putting the associated code as a sub-package of storage similar to other Bucket implementations. Along the way, it also cleans up the code to be in line with our other storage packages, does proper validation on Bucket methods, guarantees Walk iteration order (and all ordering operations on a Manifest), and uses pointers in the manner consistent with our style guide.

unmultimedio · 2023-07-11T22:36:31Z

private/pkg/manifest/manifest.go

@@ -136,24 +178,26 @@ func (m *Manifest) AddEntry(path string, digest Digest) error {
 	return nil
 }

-// Paths returns all unique paths in the manifest, order not guaranteed. If you want to iterate the
-// paths and their digests, consider using `Range` instead.
+// Paths returns all unique paths in the manifest by insertion order.


I think not guaranteeing order is actually better, so the callers don't assume we have the blobs in the manifest ordered somehow (which we didn't before this PR). I guess I don't see what this gets us.

This is the same argument used in the protojson package as to why they introduce random white space in their marshalling of Protobuf messages to JSON, and it causes similar issues downstream.

In this case, it caused Walk to not guarantee iteration order, which all the other Bucket implementations do, as the implementer of the manifest Bucket did not take into account the non-deterministic iteration order. There was also documentation saying "use Range if you want your ordering to be deterministic", however range was no deterministic either.

Even more to the point, one of the majority use cases of Manifests, namely MarshalText, was already guaranteeing ordering implicitly via sort.Strings.

I've seen examples of code introducing the chaos monkey principle in various parts of the codebase (such as protojson), and more often than not, it causes bugs and doesn't prevent other bugs. In this example, it caused an effective bug in Walk, which could cause functions such as file listing to be non-deterministic.

it caused Walk to not guarantee iteration order, which all the other Bucket implementations do

I didn't know that, probably we should add it to the Walk interface func docs, so every other implementer knows this rule.

There was also documentation saying "use Range if you want your ordering to be deterministic", however range was no deterministic either.

The previous Range documentation explicitly said the order was not guaranteed.

Even more to the point, one of the majority use cases of Manifests, namely MarshalText, was already guaranteeing ordering implicitly via sort.Strings.

Correct, because that was the only place in which we were interested in sorting the paths, when returning the manifest as a blob, so the whole module's digest would be deterministic. Order adds overhead calculation, even though is not required anywhere else in the implementation.

In this example, it caused an effective bug in Walk, which could cause functions such as file listing to be non-deterministic.

You're right, I could see a scenario in which the Walk caller expects the files to be walked in a deterministic order, even though that is not explicitly stated in the docs, which is why we didn't consider it as a bug. I still don't think it is, but probably it could be seen as a subpar experience using the pkg. Tbh, in our usages of walking this manifest bucket, we hadn't seen the need of a deterministic order.

I think there's a bit of philosophy on this, but I tend to default to making iteration deterministic if possible, and it's for this reason - engineers will just assume it is unless you tell them otherwise. You're totally right that it's not documented in Walk, but had Walk not been done deterministically, I guarantee you there'd be downstream bugs in users of it.

unmultimedio · 2023-07-11T22:40:14Z

private/pkg/manifest/manifest.go

+			// This should be an error in the style of the rest of the codebase but
+			// this was refactored and we didn't want to change the function signature.
+			panic(fmt.Sprintf("path %q not present in pathToDigest", path))


I disagree. Asking a manifest for it's digests array is a function that should never fail nor panic. The way we're iterating over the paths forces us to check an error that should never happen, and that's why we iterated over pathToDigest.

For there not to be a key here, there needs to be a bug within AddEntry that is on a similar level to a panic - you'd have to seriously mess something up for there to be a value in paths but not pathToDigest. Even so, yes I usually return an error in this case, but I didn't want to put even more into this refactor. This isn't why the iteration was over pathToDigest, it was because nothing like paths existed.

unmultimedio · 2023-07-11T22:46:08Z

private/pkg/storage/storagemanifest/bucket.go

+	// TODO: why is this validation in newBucket, why is this not somewhere else?
+	// This should not be in storagemanifest.


This was to allow partial manifests buckets, so we could build a bucket with just some of the blobs instead of all of them. I don't recall if we're actually using it somewhere, I can check. I guess we could move these validations where needed, somewhere else before building the bucket, but I don't see the harm, it's a different builder with its custom options, why not offering them as new bucket options?

This validation has nothing to do with bucket construction - it has to do with the <Manifest, BlobSet> pair, which, if they are so tightly correlated in such a way, should probably have a pair type that does such validation.

unmultimedio · 2023-07-11T22:58:05Z

private/pkg/storage/storagemanifest/storagemanifest.go

+func NewReadBucket(
+	m *manifest.Manifest,
+	blobSet *manifest.BlobSet,


These two being structs and not interfaces, I'm really uneasy to receive them as pointers because it forces the func to do nil-checking, instead of the caller making sure it's a value and not a pointer. Also, passing the struct gives peace of mind to both caller and function that we're passing a copy, and no changes to the original data will be made.

I know about the rule of function pointer receivers, but from what I understand, that doesn't apply to parameters.

We do not do this style in Buf's codebases - this was resolved a while ago. Use pointers unless there is a strict reason not to. In this case, you're saying that by derefencing, you are passing a copy of the Manifest, and no changes to the underlying data can be made. This isn't actually what you are doing here - you're passing a reference the the underlying map by value, meaning that i.e. an AddEntry call on manifest.Manifest will still result in a mutation to the dereferenced *manifest.Manifest you passed to the function.

As a demonstration of this issue, run the following

package main import ( "fmt" "sort" ) func main() { foo := NewFoo() AddToDereferencedFoo(*foo, "one") AddToDereferencedFoo(*foo, "two") fmt.Println(foo.Bars()) } func AddToDereferencedFoo(foo Foo, bar string) { foo.AddBar(bar) } type Foo struct { bars map[string]struct{} } func NewFoo() *Foo { return &Foo{ bars: make(map[string]struct{}), } } func (f *Foo) AddBar(bar string) { f.bars[bar] = struct{}{} } func (f *Foo) Bars() []string { bars := make([]string, len(f.bars)) for bar := range f.bars { bars = append(bars, bar) } sort.Strings(bars) return bars }

You'll see that it prints out both "one" and "two".

It's this reason (among many others) that as an organization, we use pointers unless there is a strict reason not to - subtle bugs get introduced where people are relying on pass-by-value semantics an but when there's actually pass-by-reference semantics under the hood, and developers do not fully consider the semantics, and get a sense of safety when they shouldn't.

If you want immutability to be a property, have an interface that doesn't expose the mutable methods.

Use pointers unless there is a strict reason not to.

I understand that this applies to function receivers, not function parameters. But I see your underlying point, even passing them as values, inner maps are being passed as reference.

That's a very interesting finding btw, just tested and confirmed if you created a bucket from manifest and blobs, and then afterwards you add a new entry to the original manifest, the bucket will recognize the path and will try to find the associated blob.

Probably we'll want to to another small refactor to have a builder that returns immutable buckets/manifests/blobs.

I'd probably refactor Manifest to be an interface, but I am not advocating that we need to do that. But yes of note.

unmultimedio

Thanks for the change of package for the storage related code, it lives more appropriately there. I'm still on the fence about the panics.

unmultimedio · 2023-07-12T15:31:05Z

private/pkg/manifest/manifest.go

@@ -136,24 +178,26 @@ func (m *Manifest) AddEntry(path string, digest Digest) error {
 	return nil
 }

-// Paths returns all unique paths in the manifest, order not guaranteed. If you want to iterate the
-// paths and their digests, consider using `Range` instead.
+// Paths returns all unique paths in the manifest by insertion order.


it caused Walk to not guarantee iteration order, which all the other Bucket implementations do

I didn't know that, probably we should add it to the Walk interface func docs, so every other implementer knows this rule.

There was also documentation saying "use Range if you want your ordering to be deterministic", however range was no deterministic either.

The previous Range documentation explicitly said the order was not guaranteed.

Even more to the point, one of the majority use cases of Manifests, namely MarshalText, was already guaranteeing ordering implicitly via sort.Strings.

Correct, because that was the only place in which we were interested in sorting the paths, when returning the manifest as a blob, so the whole module's digest would be deterministic. Order adds overhead calculation, even though is not required anywhere else in the implementation.

In this example, it caused an effective bug in Walk, which could cause functions such as file listing to be non-deterministic.

You're right, I could see a scenario in which the Walk caller expects the files to be walked in a deterministic order, even though that is not explicitly stated in the docs, which is why we didn't consider it as a bug. I still don't think it is, but probably it could be seen as a subpar experience using the pkg. Tbh, in our usages of walking this manifest bucket, we hadn't seen the need of a deterministic order.

unmultimedio · 2023-07-12T15:53:45Z

private/pkg/storage/storagemanifest/storagemanifest.go

+func NewReadBucket(
+	m *manifest.Manifest,
+	blobSet *manifest.BlobSet,


Use pointers unless there is a strict reason not to.

I understand that this applies to function receivers, not function parameters. But I see your underlying point, even passing them as values, inner maps are being passed as reference.

That's a very interesting finding btw, just tested and confirmed if you created a bucket from manifest and blobs, and then afterwards you add a new entry to the original manifest, the bucket will recognize the path and will try to find the associated blob.

Probably we'll want to to another small refactor to have a builder that returns immutable buckets/manifests/blobs.

commit

81d06cc

bufdev requested review from pkwarren and unmultimedio July 11, 2023 22:31

unmultimedio reviewed Jul 11, 2023

View reviewed changes

unmultimedio approved these changes Jul 12, 2023

View reviewed changes

bufdev merged commit 2ce1126 into main Jul 12, 2023
6 checks passed

bufdev deleted the fix-manifest-storage branch July 12, 2023 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split manifest.NewBucket into storagemanifest package #2284

Split manifest.NewBucket into storagemanifest package #2284

bufdev commented Jul 11, 2023

unmultimedio Jul 11, 2023

bufdev Jul 11, 2023

unmultimedio Jul 12, 2023

bufdev Jul 12, 2023

unmultimedio Jul 11, 2023

bufdev Jul 11, 2023

unmultimedio Jul 11, 2023

bufdev Jul 11, 2023

unmultimedio Jul 11, 2023

bufdev Jul 11, 2023

unmultimedio Jul 12, 2023 •

edited

Loading

bufdev Jul 12, 2023

unmultimedio left a comment

unmultimedio Jul 12, 2023

unmultimedio Jul 12, 2023 •

edited

Loading

		// TODO: why is this validation in newBucket, why is this not somewhere else?
		// This should not be in storagemanifest.

Split manifest.NewBucket into storagemanifest package #2284

Split manifest.NewBucket into storagemanifest package #2284

Conversation

bufdev commented Jul 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unmultimedio Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unmultimedio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unmultimedio Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

unmultimedio Jul 12, 2023 •

edited

Loading

unmultimedio Jul 12, 2023 •

edited

Loading