Skip to content

Conversation

@alansy-nscale
Copy link
Collaborator

Description

This PR improves error reporting and introduces general code enhancements across the codebase. Error responses now include additional metadata in both the JSON body and HTTP headers, giving clients more context. The error codes have been completely redesigned, including new codes such as token_expired and quota_exhausted, to support automatic token refresh and clearer handling of resource limits. A helper method has also been added to extract the extended metadata into the Error struct, making it easier for clients to work with errors programmatically. In addition, general refactoring has been applied to improve code consistency across the codebase.

@squaremo
Copy link
Contributor

Going on the PR comment alone to start with: please make separate commits for

  • introducing the new machinery and any tests that go with it;
  • adding generic error codes;
  • porting existing code to the new machinery (can reasonably be in the same change as the previous if they go together);
  • refactors (before the other commits, if it doesn't depend on the new machinery; otherwise after)

@alansy-nscale alansy-nscale force-pushed the alansy/inst-45 branch 2 times, most recently from 40c396b to b5f0a7d Compare November 19, 2025 12:29
Copy link
Contributor

@spjmurray spjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some fairly high level musings to consider.

type: string
enum:
# Defined by OAuth 2.02
# Common error codes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common across what? It's in the oauth2.1 spec... why not leave as is. The other problem here is reordering things unnecessarily which makes it incredibly hard to work out if anything has been added or removed. What's the justification?


// Prefixed adds the given Prefix to the underlying Cause of an error.
// It should only be called at the end of the error construction chain.
func (e *Error) Prefixed() *Error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eww :D Can we not make the errors themselves self referential, so we can maintain some semblance of structure e.g.

error: foo
error_description: bar
cause:
  error: foo
  error_description: bar
  cause:
    error: foo
    error_description: bar

Then you can communicate error wrapping across APIs, rather than having a string and this naturally lends itself to "was the root cause blah?" and you can potentially do more graceful error handling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not self-referential. It just calls fmt.Errorf() to wrap the root cause with an error prefix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package main

import (
	"errors"
	"fmt"
	"io"

	errorsv2 "github.com/unikorn-cloud/core/pkg/server/v2/errors"
)

func main() {
	var err error

	err = errorsv2.NewInternalError().
		WithCause(io.EOF).
		Prefixed()

	fmt.Println(err) // internal: EOF

	err = errorsv2.NewConflictError().
		WithCausef("something went wrong with another layer: %w", err).
		Prefixed()

	fmt.Println(err)                    // conflict: something went wrong with another layer: internal: EOF
	fmt.Println(errors.Is(err, io.EOF)) // true
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the point is if identity errors, then it's passed to region, then it's passed back to compute, then you are left with a pretty useless string with no real programmatic parental relationships. I'm more interested about crossing service boundaries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't errors be localised within the service? If a specific error needs to be handled programmatically, the service can provide a unique error code that the caller can use to respond correspondingly.

@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 0.34247% with 291 lines in your changes missing coverage. Please review.
✅ Project coverage is 25.42%. Comparing base (d2484bd) to head (aebf26c).
⚠️ Report is 33 commits behind head on main.

Files with missing lines Patch % Lines
pkg/server/v2/errors/errors.go 0.00% 145 Missing ⚠️
pkg/server/v2/httputil/httputil.go 0.00% 69 Missing ⚠️
pkg/openapi/helpers.go 0.00% 34 Missing ⚠️
pkg/server/handler/error.go 0.00% 12 Missing ⚠️
pkg/server/util/rbac.go 0.00% 12 Missing ⚠️
pkg/server/conversion/conversion.go 12.50% 7 Missing ⚠️
pkg/server/middleware/cors/cors.go 0.00% 6 Missing ⚠️
pkg/server/util/tags.go 0.00% 5 Missing ⚠️
pkg/messaging/consumer/cascadingdelete.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #202      +/-   ##
==========================================
- Coverage   27.76%   25.42%   -2.35%     
==========================================
  Files          63       68       +5     
  Lines        4337     3808     -529     
==========================================
- Hits         1204      968     -236     
+ Misses       3044     2749     -295     
- Partials       89       91       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@spjmurray spjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some good ideas in here, especially the idea of a backtrace, but there are some concerns, feel free to chime in too @squaremo.

API errors

These have two purposes:

  • Tell the user something went wrong, and what remedial action can be taken
    • This is why I like the backtraces, service A can say here's the fix, and service B can augment it with more context if necessary or just relay it verbatim back to the user
    • Service B can actually use the information from the backtrace to potentially do some graceful handling as it's got some form of type.
  • Tell the user they can't do anything and to contact support with a w3c trace ID.

Security

Error messages are deliberately vague. This is by design. The last thing we want is to say "you tried to access resource A but that failed, it's owned by acme.com, contact owner@acme.com". Likewise we shouldn't be relaying things like, "I had an error accessing service.cloud.com". Both of these examples are exactly what a hacker would use to probe the system for users, cloud services, resources in general and use them for granting access.

Debugging

Back to the point on w3c trace IDs. Where we need to see more detail to debug something, we have OpenTelemetry that joins everything up into a trace across services.

Next Steps

I got a little concerned when I saw errors v2. Presumably that means you are going to migrate everything over piecemeal, remove v1 and rename everything back to just plain errors again.

So what I would like to see is:

  • A well thought out specification that takes into account everything I have discussed above.
  • Why the current version doesn't fully implement the specification.
  • What simple steps can be done to get from A to B that don't require everything to be torn up 😄

Add the new errorv2 package, including the unified Error type and
associated helper functions for working with structured error data.
Update all code paths to use the errorv2 package instead of the original
errors package to ensure consistent structured error handling.
@alansy-nscale alansy-nscale force-pushed the alansy/inst-45 branch 4 times, most recently from d19fff3 to aebf26c Compare November 23, 2025 21:50
@alansy-nscale
Copy link
Collaborator Author

alansy-nscale commented Nov 24, 2025

@alansy-nscale alansy-nscale force-pushed the alansy/inst-45 branch 2 times, most recently from 7dbae85 to 70be6a4 Compare November 25, 2025 10:35
Perform general refactoring to improve consistency, readability, and
maintainability across the codebase.
@spjmurray
Copy link
Contributor

Right @alansy-nscale we're ready! So, what's in pkg/server/errors/errors.go is now solely for API errors, so we will not need a V2. Don't change any names, function signatures or try to rename existing error message types yet! I think we should start with adding the authentication errors stuff and header propagation and see what that looks like in isolation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants