Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split manifest.NewBucket into storagemanifest package #2284

Merged
merged 1 commit into from
Jul 12, 2023

Conversation

bufdev
Copy link
Member

@bufdev bufdev commented Jul 11, 2023

This moves manifest.NewBucket into a new storagemanifest package, putting the associated code as a sub-package of storage similar to other Bucket implementations. Along the way, it also cleans up the code to be in line with our other storage packages, does proper validation on Bucket methods, guarantees Walk iteration order (and all ordering operations on a Manifest), and uses pointers in the manner consistent with our style guide.

@@ -136,24 +178,26 @@ func (m *Manifest) AddEntry(path string, digest Digest) error {
return nil
}

// Paths returns all unique paths in the manifest, order not guaranteed. If you want to iterate the
// paths and their digests, consider using `Range` instead.
// Paths returns all unique paths in the manifest by insertion order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not guaranteeing order is actually better, so the callers don't assume we have the blobs in the manifest ordered somehow (which we didn't before this PR). I guess I don't see what this gets us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same argument used in the protojson package as to why they introduce random white space in their marshalling of Protobuf messages to JSON, and it causes similar issues downstream.

In this case, it caused Walk to not guarantee iteration order, which all the other Bucket implementations do, as the implementer of the manifest Bucket did not take into account the non-deterministic iteration order. There was also documentation saying "use Range if you want your ordering to be deterministic", however range was no deterministic either.

Even more to the point, one of the majority use cases of Manifests, namely MarshalText, was already guaranteeing ordering implicitly via sort.Strings.

I've seen examples of code introducing the chaos monkey principle in various parts of the codebase (such as protojson), and more often than not, it causes bugs and doesn't prevent other bugs. In this example, it caused an effective bug in Walk, which could cause functions such as file listing to be non-deterministic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it caused Walk to not guarantee iteration order, which all the other Bucket implementations do

I didn't know that, probably we should add it to the Walk interface func docs, so every other implementer knows this rule.

There was also documentation saying "use Range if you want your ordering to be deterministic", however range was no deterministic either.

The previous Range documentation explicitly said the order was not guaranteed.

Even more to the point, one of the majority use cases of Manifests, namely MarshalText, was already guaranteeing ordering implicitly via sort.Strings.

Correct, because that was the only place in which we were interested in sorting the paths, when returning the manifest as a blob, so the whole module's digest would be deterministic. Order adds overhead calculation, even though is not required anywhere else in the implementation.

In this example, it caused an effective bug in Walk, which could cause functions such as file listing to be non-deterministic.

You're right, I could see a scenario in which the Walk caller expects the files to be walked in a deterministic order, even though that is not explicitly stated in the docs, which is why we didn't consider it as a bug. I still don't think it is, but probably it could be seen as a subpar experience using the pkg. Tbh, in our usages of walking this manifest bucket, we hadn't seen the need of a deterministic order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bit of philosophy on this, but I tend to default to making iteration deterministic if possible, and it's for this reason - engineers will just assume it is unless you tell them otherwise. You're totally right that it's not documented in Walk, but had Walk not been done deterministically, I guarantee you there'd be downstream bugs in users of it.

Comment on lines +197 to +199
// This should be an error in the style of the rest of the codebase but
// this was refactored and we didn't want to change the function signature.
panic(fmt.Sprintf("path %q not present in pathToDigest", path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Asking a manifest for it's digests array is a function that should never fail nor panic. The way we're iterating over the paths forces us to check an error that should never happen, and that's why we iterated over pathToDigest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For there not to be a key here, there needs to be a bug within AddEntry that is on a similar level to a panic - you'd have to seriously mess something up for there to be a value in paths but not pathToDigest. Even so, yes I usually return an error in this case, but I didn't want to put even more into this refactor. This isn't why the iteration was over pathToDigest, it was because nothing like paths existed.

Comment on lines +41 to +42
// TODO: why is this validation in newBucket, why is this not somewhere else?
// This should not be in storagemanifest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to allow partial manifests buckets, so we could build a bucket with just some of the blobs instead of all of them. I don't recall if we're actually using it somewhere, I can check. I guess we could move these validations where needed, somewhere else before building the bucket, but I don't see the harm, it's a different builder with its custom options, why not offering them as new bucket options?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation has nothing to do with bucket construction - it has to do with the <Manifest, BlobSet> pair, which, if they are so tightly correlated in such a way, should probably have a pair type that does such validation.

Comment on lines +24 to +26
func NewReadBucket(
m *manifest.Manifest,
blobSet *manifest.BlobSet,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two being structs and not interfaces, I'm really uneasy to receive them as pointers because it forces the func to do nil-checking, instead of the caller making sure it's a value and not a pointer. Also, passing the struct gives peace of mind to both caller and function that we're passing a copy, and no changes to the original data will be made.

I know about the rule of function pointer receivers, but from what I understand, that doesn't apply to parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not do this style in Buf's codebases - this was resolved a while ago. Use pointers unless there is a strict reason not to. In this case, you're saying that by derefencing, you are passing a copy of the Manifest, and no changes to the underlying data can be made. This isn't actually what you are doing here - you're passing a reference the the underlying map by value, meaning that i.e. an AddEntry call on manifest.Manifest will still result in a mutation to the dereferenced *manifest.Manifest you passed to the function.

As a demonstration of this issue, run the following

package main

import (
	"fmt"
	"sort"
)

func main() {
	foo := NewFoo()
	AddToDereferencedFoo(*foo, "one")
	AddToDereferencedFoo(*foo, "two")
	fmt.Println(foo.Bars())
}

func AddToDereferencedFoo(foo Foo, bar string) {
	foo.AddBar(bar)
}

type Foo struct {
	bars map[string]struct{}
}

func NewFoo() *Foo {
	return &Foo{
		bars: make(map[string]struct{}),
	}
}

func (f *Foo) AddBar(bar string) {
	f.bars[bar] = struct{}{}
}

func (f *Foo) Bars() []string {
	bars := make([]string, len(f.bars))
	for bar := range f.bars {
		bars = append(bars, bar)
	}
	sort.Strings(bars)
	return bars
}

You'll see that it prints out both "one" and "two".

It's this reason (among many others) that as an organization, we use pointers unless there is a strict reason not to - subtle bugs get introduced where people are relying on pass-by-value semantics an but when there's actually pass-by-reference semantics under the hood, and developers do not fully consider the semantics, and get a sense of safety when they shouldn't.

If you want immutability to be a property, have an interface that doesn't expose the mutable methods.

Copy link
Member

@unmultimedio unmultimedio Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pointers unless there is a strict reason not to.

I understand that this applies to function receivers, not function parameters. But I see your underlying point, even passing them as values, inner maps are being passed as reference.

That's a very interesting finding btw, just tested and confirmed if you created a bucket from manifest and blobs, and then afterwards you add a new entry to the original manifest, the bucket will recognize the path and will try to find the associated blob.

Probably we'll want to to another small refactor to have a builder that returns immutable buckets/manifests/blobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably refactor Manifest to be an interface, but I am not advocating that we need to do that. But yes of note.

Copy link
Member

@unmultimedio unmultimedio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change of package for the storage related code, it lives more appropriately there. I'm still on the fence about the panics.

@@ -136,24 +178,26 @@ func (m *Manifest) AddEntry(path string, digest Digest) error {
return nil
}

// Paths returns all unique paths in the manifest, order not guaranteed. If you want to iterate the
// paths and their digests, consider using `Range` instead.
// Paths returns all unique paths in the manifest by insertion order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it caused Walk to not guarantee iteration order, which all the other Bucket implementations do

I didn't know that, probably we should add it to the Walk interface func docs, so every other implementer knows this rule.

There was also documentation saying "use Range if you want your ordering to be deterministic", however range was no deterministic either.

The previous Range documentation explicitly said the order was not guaranteed.

Even more to the point, one of the majority use cases of Manifests, namely MarshalText, was already guaranteeing ordering implicitly via sort.Strings.

Correct, because that was the only place in which we were interested in sorting the paths, when returning the manifest as a blob, so the whole module's digest would be deterministic. Order adds overhead calculation, even though is not required anywhere else in the implementation.

In this example, it caused an effective bug in Walk, which could cause functions such as file listing to be non-deterministic.

You're right, I could see a scenario in which the Walk caller expects the files to be walked in a deterministic order, even though that is not explicitly stated in the docs, which is why we didn't consider it as a bug. I still don't think it is, but probably it could be seen as a subpar experience using the pkg. Tbh, in our usages of walking this manifest bucket, we hadn't seen the need of a deterministic order.

Comment on lines +24 to +26
func NewReadBucket(
m *manifest.Manifest,
blobSet *manifest.BlobSet,
Copy link
Member

@unmultimedio unmultimedio Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pointers unless there is a strict reason not to.

I understand that this applies to function receivers, not function parameters. But I see your underlying point, even passing them as values, inner maps are being passed as reference.

That's a very interesting finding btw, just tested and confirmed if you created a bucket from manifest and blobs, and then afterwards you add a new entry to the original manifest, the bucket will recognize the path and will try to find the associated blob.

Probably we'll want to to another small refactor to have a builder that returns immutable buckets/manifests/blobs.

@bufdev bufdev merged commit 2ce1126 into main Jul 12, 2023
6 checks passed
@bufdev bufdev deleted the fix-manifest-storage branch July 12, 2023 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants