Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob support for local, s3 and gcs - 2 #1592

Merged
merged 11 commits into from
Jan 17, 2023
Merged

glob support for local, s3 and gcs - 2 #1592

merged 11 commits into from
Jan 17, 2023

Conversation

k-anshul
Copy link
Member

@k-anshul k-anshul commented Jan 11, 2023

This PR fixes following

  • Adds support for csv.delimiter option for cloud connectors
  • fixes ? pattern in S3 connector
  • Adds unit test

Comment on lines 100 to 106
// Config is common config for all connectors
// Different connectors may add more configs
type Config struct {
Path string `mapstructure:"path"`
Format string `mapstructure:"format"`
CSVDelimiter string `mapstructure:"csv.delimiter"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this decentralized in the separate connectors, even if it means duplication. It may diverge quite soon (we'll probably need SQL and Kafka connectors soon, where these options do not apply)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With SQL and Kafka connectors in place, this entire code may need restructuring(Consuming via files may no longer apply).
If not creating these configs only option is to read configs from source.Properties map. In my opinion its never a good idea to read data (like configs) from a map. Makes things easier at start but becomes tough later on to keep track of where and what all data was read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we can do is let connectors expose fileParsingOptions which we can consume here. Where such options do not apply, connectors will not return valid options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the problem around needing access to this info. I thought some more about this. Some thoughts:

  • I agree that reading config from maps in general is not good, though it can be acceptable in some cases if it prevents tight coupling of packages.
  • My main worry here is about using struct embedding (inheritance?) for a simple config object, it feels like too much complexity versus simply having a few duplicated lines. The idea of having FileParseOptions seems nicer, although maybe premature for so few options.
  • About the worry of exposing Format and CSVDelimiter to the DuckDB driver, I think the solution is to have these be available on the FileIterator previously discussed instead. So the DuckDB driver should not need to re-parse the source properties, instead the iterator it gets from the connector should contain any info necessary to correctly consume the files. This also ensures looser coupling.
  • For SQL and Kafka, file-based ingestion might still be a nice naive solution. Buffering in files (for X seconds from Kafka, or for X size when loading a SQL query) and then loading the file is a pretty common way to ingest into OLAP DBs – often, INSERTs have poor performance for columnar data. (E.g. Druid doesn't even support INSERT, the main approaches are load from file in object storage or direct connect to Kafka).
    • But I agree more flexibility might also be needed at some point. One idea is to also offer a ConsumeAsArrow option. We will need to scope it when that happens (it will also largely depend on underlying DB capabilities when we get to cloud scale).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comments.
I agree with a FileIterator we should be able to remove the map based access. I will go ahead and implement it with a map itself for now.

Comment on lines 101 to 108
func gcsURLParts(path string) (string, string, error) {
trimmedPath := strings.Replace(path, "gs://", "", 1)
bucket, glob, found := strings.Cut(trimmedPath, "/")
if !found {
return "", "", fmt.Errorf("failed to parse path %s", path)
}
return u.Host, strings.Replace(u.Path, "/", "", 1), fileutil.FullExt(u.Path), nil
return bucket, glob, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we don't know if path is well-formed, so if not using url.Parse, it needs to be more defensively parsed.

For example, if the input is whoops/bags://d/match.csv, it will actually fetch bad/match.csv from the whoops bucket!

Also, let's add a comment describing why we're not using url.Parse

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies for s3. Maybe we should add a globurl util package in runtime/pkg with a function like url.Parse that works on URLs containing globs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will change.

runtime/drivers/duckdb/connectors.go Outdated Show resolved Hide resolved
runtime/drivers/duckdb/connectors.go Outdated Show resolved Hide resolved
runtime/drivers/duckdb/connectors.go Outdated Show resolved Hide resolved
runtime/drivers/duckdb/connectors.go Outdated Show resolved Hide resolved
runtime/drivers/duckdb/connectors.go Outdated Show resolved Hide resolved
runtime/connectors/blob/blobdownloader_test.go Outdated Show resolved Hide resolved
@nishantmonu51 nishantmonu51 added the blocker A release blocker issue that should be resolved before a new release label Jan 16, 2023
Copy link
Contributor

@begelundmuller begelundmuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, mainly just nits remaining.

Let's move the docs changes into a separate PR and tag @magorlick for review on it.

runtime/connectors/blob/blobdownloader_test.go Outdated Show resolved Hide resolved

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got, err := FetchFileNames(tt.args.ctx, tt.args.bucket, tt.args.config, tt.args.globPattern, tt.args.bucketPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx is nil. Use context.Background() insteand and remove the args.ctx param. Passing a nil context can lead to nil pointer errors.

Copy link
Member Author

@k-anshul k-anshul Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context is not nil. Set as context.Background as part of args initialisation
The entire args struct and test code is auto generated(apart from values). Don't want to make much changes there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, missed the place it was created

runtime/connectors/blob/blobdownloader_test.go Outdated Show resolved Hide resolved
runtime/connectors/gcs/gcs.go Outdated Show resolved Hide resolved
runtime/connectors/s3/s3.go Outdated Show resolved Hide resolved
}
}

func sourceReaderWithDelimiter(paths []string, delimitter string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Spelling mistake in delimitter
  2. The function name doesn't match getSourceReader. I also don't like the get so much (I know it was there before). Maybe makeSourceReader and makeSourceReaderCSV (implying the recursive relationship)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the getSourceReader to sourceReader (don't like get myself, didn't change existing code).

I believe lets not fret too much about the names here. Functions are small and easy to understand as well.

Comment on lines 32 to 38
t.Run(tt.name, func(t *testing.T) {
got, got1, got2, err := ParseURL(tt.args)
if (err != nil) != tt.wantErr {
t.Errorf("ParseURL() error = %v, wantErr %v", err, tt.wantErr)
return
}
if got != tt.want {
t.Errorf("ParseURL() got = %v, want %v", got, tt.want)
}
if got1 != tt.want1 {
t.Errorf("ParseURL() got1 = %v, want %v", got1, tt.want1)
}
if got2 != tt.want2 {
t.Errorf("ParseURL() got2 = %v, want %v", got2, tt.want2)
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Use require.NotEqual and similar
  2. Maybe consider not doing a matrix test for such few cases – has a readability loss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of the term matrix test so not sure what it means here. Changed the struct field names for better readability. Anyways both the actual code and test code is small and easy to read/understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's called a matrix test, just what popped into my mind for the pattern of looping over a list of test case params. Is your IDE generating these for you?

When there are few cases, I sometimes find them harder to read than something like this. But they're very common and this is just a personal preference. So no need to change.

runtime/pkg/globutil/globutil.go Outdated Show resolved Hide resolved
runtime/pkg/globutil/globutil.go Outdated Show resolved Hide resolved
Copy link
Contributor

@begelundmuller begelundmuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@begelundmuller begelundmuller merged commit 2c47b42 into main Jan 17, 2023
@begelundmuller begelundmuller deleted the csvdelimitter branch January 17, 2023 09:40
@begelundmuller begelundmuller removed the request for review from magorlick January 17, 2023 09:40
bcolloran pushed a commit that referenced this pull request Mar 7, 2023
* fixed ? in glob pattern

* initial commit

* loading local file connector

* adding uts

* liniting issues

* review comments

* ut fix

* review nits
djbarnwal pushed a commit that referenced this pull request Aug 3, 2023
* fixed ? in glob pattern

* initial commit

* loading local file connector

* adding uts

* liniting issues

* review comments

* ut fix

* review nits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker A release blocker issue that should be resolved before a new release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants