-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for presigned URLs in the lakeFS API #4985
Conversation
Yay, this is a great direction! One worry is how to use this from lakeFSFS, because I would really like to use this from lakeFSFS. While this is certainly not the only client, it's a pretty important one. Here I am not sure how to use this and get the optimizations from S3A and other underlying FileSystems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat stuff! Can you open an issue optionally to rewrite lakectl fs {upload,cat} --direct
to use this redirecting mode? It will make it so much simpler and more natural!
Approving as a satisfied future user of this, obviously not as a versioning engine team member.
(Still not sure how to use it efficiently in lakeFSFS, but that's a different matter!)
pkg/api/controller.go
Outdated
writeError(w, r, http.StatusInternalServerError, err) | ||
return | ||
} | ||
response.PresignedUrl = StringPtr(preSignedURL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think
response.PresignedUrl = StringPtr(preSignedURL) | |
response.PresignedUrl = &preSignedURL |
is more readable as well as more efficient.
if authResponse.Allowed { | ||
objStat.PhysicalAddress, err = c.BlockAdapter.GetPreSignedURL(ctx, block.ObjectPointer{ | ||
StorageNamespace: repo.StorageNamespace, | ||
Identifier: entry.PhysicalAddress, | ||
IdentifierType: entry.AddressType.ToIdentifierType(), | ||
}, block.PreSignModeRead) | ||
if c.handleAPIError(ctx, w, r, err) { | ||
return | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I think of it this is really neat: I will be able to read all contents of an entire directory with a single listObjects on lakeFS.
parameters: | ||
- in: query | ||
name: presign | ||
required: false | ||
schema: | ||
type: boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be a POST endpoint that redirects!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Added some comments.
And general question - do we want the pre sign duration configuration controlled in one location / settings?
pkg/block/azure/adapter.go
Outdated
Permissions: permissions.String(), | ||
} | ||
|
||
keyCredentials, ok := a.credentials.(*azblob.SharedKeyCredential) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be done once at NewAdater
and leave here just pointer check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/block/azure/adapter.go
Outdated
// Since this is a blob SAS, the URL is to the Azure storage blob. | ||
qp := sasQueryParams.Encode() | ||
return fmt.Sprintf(preSignedBlobPattern, | ||
a.credentials, qualifiedKey.ContainerName, qualifiedKey.BlobURL, qp), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We depends on credentials
- string format of a type we don't control? Just making sure we do not need to call any explicit function to format the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it's supposed to be the account name extracted from the credentials! Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it safe to return a string containing the credentials?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itaiad200 you caught this too, changed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I think this can bring value to our users!
return | ||
} | ||
if authResponse.Allowed { | ||
objStat.PhysicalAddress, err = c.BlockAdapter.GetPreSignedURL(ctx, block.ObjectPointer{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the default is 1000 objects in ListObjects. It means you're adding 1000 sequential calls to aws in a single lakeFS call. Doesn't seem like it can really scale. We've seen cases of users performing a >300 RPS of ListObjects, if we're planning to continue supporting this scale, we need to find other solutions for ListObjects. Can we get all 1000 links in a single API call to aws?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also consider the error rate. If AWS has 4 9s', the probability of 1000 calls to succeed is ~90%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetPreSignedURL
Doesn't do any network calls.
Pre-signed URLs are constructed locally by the adapter
pkg/block/azure/adapter.go
Outdated
@@ -29,21 +29,37 @@ const ( | |||
defaultMaxRetryRequests = 0 | |||
AuthMethodAccessKey = "access-key" | |||
AuthMethodMSI = "msi" | |||
|
|||
defaultPreSignedURLDuration = time.Minute * 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a block
package-level const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call - moved
pkg/api/controller.go
Outdated
@@ -333,6 +333,21 @@ func (c *Controller) GetPhysicalAddress(w http.ResponseWriter, r *http.Request, | |||
PhysicalAddress: StringPtr(qk.Format()), | |||
Token: StringValue(token), | |||
} | |||
|
|||
if params.Presign != nil && *params.Presign { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that the above was changed with the merge of #4900
return a.getPreSignedURL(ctx, obj, permissions) | ||
} | ||
|
||
func (a *Adapter) getPreSignedURL(ctx context.Context, obj block.ObjectPointer, permissions azblob.BlobSASPermissions) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/block/azure/adapter.go
Outdated
// Since this is a blob SAS, the URL is to the Azure storage blob. | ||
qp := sasQueryParams.Encode() | ||
return fmt.Sprintf(preSignedBlobPattern, | ||
a.credentials, qualifiedKey.ContainerName, qualifiedKey.BlobURL, qp), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it safe to return a string containing the credentials?
pkg/block/gs/adapter.go
Outdated
@@ -21,6 +21,8 @@ const ( | |||
delimiter = "/" | |||
partSuffix = ".part_" | |||
markerSuffix = ".multipart" | |||
|
|||
defaultPreSignedURLDuration = time.Minute * 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - if we define it equally in 2 different places, let's move it up the packages stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
pkg/block/s3/adapter.go
Outdated
@@ -29,6 +29,9 @@ const ( | |||
|
|||
// Chunks smaller than that are only allowed for the last chunk upload | |||
minChunkSize = 8 * 1024 | |||
|
|||
// DefaultPreSignedURLDuration determines how long should pre-signed URLs be valid for | |||
DefaultPreSignedURLDuration = time.Minute * 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
"github.com/treeverse/lakefs/pkg/api" | ||
) | ||
|
||
func TestPreSign(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding tests for the Stat operation! Can we also add tests to all other new operations?
Also - let's validate that the links work as expected too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - I'm still trying to get this test to pass via Esti, no luck yet :( will add the additional tests for links as well once I do.
api/swagger.yml
Outdated
@@ -3286,6 +3318,11 @@ paths: | |||
schema: | |||
type: string | |||
description: path to the table data | |||
- in: query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need it in symlinks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! We do not. removed
esti/presign_test.go
Outdated
blockStoreType := viper.GetViper().GetString("blockstore_type") | ||
expected := "" | ||
switch blockStoreType { | ||
case block.BlockstoreTypeS3: | ||
expected = "X-Amz-Signature=" | ||
case block.BlockstoreTypeGS: | ||
expected = "X-Goog-Signature=" | ||
case block.BlockstoreTypeAzure: | ||
expected = "sv=" | ||
default: | ||
t.Skip("Only GS, S3 and Azure Blob supported for pre-signed urls") | ||
} | ||
|
||
ctx, _, repo := setupTest(t) | ||
defer tearDownTest(repo) | ||
_, _ = uploadFileRandomData(ctx, t, repo, mainBranch, "foo/bar", true) | ||
|
||
// Get Object | ||
preSign := true | ||
response, err := client.StatObjectWithResponse(ctx, repo, mainBranch, &api.StatObjectParams{ | ||
Path: "foo/bar", | ||
Presign: &preSign, | ||
}) | ||
require.NoError(t, err, "failed to stat object with presign=true") | ||
signedUrl := response.JSON200.PhysicalAddress | ||
if !strings.HasPrefix(signedUrl, "http") { | ||
t.Errorf("exected an HTTP(s) URL instead of an object store URI, got %s", signedUrl) | ||
} | ||
|
||
if !strings.Contains(signedUrl, expected) { | ||
t.Errorf("expected a signed URL, got %s", signedUrl) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blockStoreType := viper.GetViper().GetString("blockstore_type") | |
expected := "" | |
switch blockStoreType { | |
case block.BlockstoreTypeS3: | |
expected = "X-Amz-Signature=" | |
case block.BlockstoreTypeGS: | |
expected = "X-Goog-Signature=" | |
case block.BlockstoreTypeAzure: | |
expected = "sv=" | |
default: | |
t.Skip("Only GS, S3 and Azure Blob supported for pre-signed urls") | |
} | |
ctx, _, repo := setupTest(t) | |
defer tearDownTest(repo) | |
_, _ = uploadFileRandomData(ctx, t, repo, mainBranch, "foo/bar", true) | |
// Get Object | |
preSign := true | |
response, err := client.StatObjectWithResponse(ctx, repo, mainBranch, &api.StatObjectParams{ | |
Path: "foo/bar", | |
Presign: &preSign, | |
}) | |
require.NoError(t, err, "failed to stat object with presign=true") | |
signedUrl := response.JSON200.PhysicalAddress | |
if !strings.HasPrefix(signedUrl, "http") { | |
t.Errorf("exected an HTTP(s) URL instead of an object store URI, got %s", signedUrl) | |
} | |
if !strings.Contains(signedUrl, expected) { | |
t.Errorf("expected a signed URL, got %s", signedUrl) | |
} | |
blockStoreType := viper.GetString("blockstore.type") | |
expectedKey := "" | |
switch blockStoreType { | |
case block.BlockstoreTypeS3: | |
expectedKey = "X-Amz-Signature" | |
case block.BlockstoreTypeGS: | |
expectedKey = "X-Goog-Signature" | |
case block.BlockstoreTypeAzure: | |
expectedKey = "sv" | |
default: | |
t.Skip("Only GS, S3 and Azure Blob supported for pre-signed urls") | |
} | |
ctx, _, repo := setupTest(t) | |
defer tearDownTest(repo) | |
_, _ = uploadFileRandomData(ctx, t, repo, mainBranch, "foo/bar", true) | |
response, err := client.StatObjectWithResponse(ctx, repo, mainBranch, &api.StatObjectParams{ | |
Path: "foo/bar", | |
Presign: swag.Bool(true), | |
}) | |
require.NoError(t, err, "failed to stat object with presign=true") | |
require.NotNil(t, response.JSON200, "successful response") | |
signedURL := response.JSON200.PhysicalAddress | |
parsedSignedURL, err := url.Parse(signedURL) | |
require.NoErrorf(t, err, "failed to parse url - %s", signedURL) | |
require.Truef(t, strings.HasPrefix(parsedSignedURL.Scheme, "http"), "URL scheme http(s) - %s", signedURL) | |
require.NotEmptyf(t, parsedSignedURL.Query().Get(expectedKey), "signature expected in key '%s' - %s", expectedKey, signedURL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! - additional comments for your consideration
@ozkatz Still missing tests for most of the APIs to get the links, including validations that links actually work. |
You're right @itaiad200 - Merged too early. Tests added in #5035 |
This one requires discussion :)
Currently, lakeFS users have 2 options when reading and writing data:
This requires a pretty hard tradeoff: (1) has implications regarding security (data going through the lakeFS server) as well as added operational complexity (i.e. the need to size a lakeFS installation for the required bandwidth) and potentially cost implications if the lakeFS installation is outside the AWS account and VPC used by its consumers (such as in lakeFS Cloud).
As for (2), it has a different set of problems: Once the metadata operation is done, direct access to the object store requires authentication and authorization. This means managing 2 different auth* systems: one for lakeFS and one for the underlying object store. Keeping both of them in sync is very hard (i.e. how do I give every user permissions to read only the exact underlying objects in a repository?)
This PR introduces pre-signed URLs as a middle ground. It allows receiving short-lived pre-signed URLs from
GetObject
,StatObject
andListObjects
with the optionalpresign=true
query parameter, as well as limited write support usingGetPhysicalAddress
where the address returned is pre-signed for a write operation.This PR currently does not handle Multipart uploads, this could be added later (and is probably a requirement before it resolves #3160)
This is currently in draft and requires discussion. I don't plan on merging this unless we agree this is desired.