glob support for local, s3 and gcs - 2 #1592

k-anshul · 2023-01-11T13:28:45Z

This PR fixes following

Adds support for csv.delimiter option for cloud connectors
fixes ? pattern in S3 connector
Adds unit test

begelundmuller · 2023-01-12T10:12:13Z

runtime/connectors/connectors.go

+// Config is common config for all connectors
+// Different connectors may add more configs
+type Config struct {
+	Path         string `mapstructure:"path"`
+	Format       string `mapstructure:"format"`
+	CSVDelimiter string `mapstructure:"csv.delimiter"`
+}


Let's keep this decentralized in the separate connectors, even if it means duplication. It may diverge quite soon (we'll probably need SQL and Kafka connectors soon, where these options do not apply)

With SQL and Kafka connectors in place, this entire code may need restructuring(Consuming via files may no longer apply).
If not creating these configs only option is to read configs from source.Properties map. In my opinion its never a good idea to read data (like configs) from a map. Makes things easier at start but becomes tough later on to keep track of where and what all data was read.

Another thing we can do is let connectors expose fileParsingOptions which we can consume here. Where such options do not apply, connectors will not return valid options.

I understand the problem around needing access to this info. I thought some more about this. Some thoughts:

I agree that reading config from maps in general is not good, though it can be acceptable in some cases if it prevents tight coupling of packages.

My main worry here is about using struct embedding (inheritance?) for a simple config object, it feels like too much complexity versus simply having a few duplicated lines. The idea of having FileParseOptions seems nicer, although maybe premature for so few options.

About the worry of exposing Format and CSVDelimiter to the DuckDB driver, I think the solution is to have these be available on the FileIterator previously discussed instead. So the DuckDB driver should not need to re-parse the source properties, instead the iterator it gets from the connector should contain any info necessary to correctly consume the files. This also ensures looser coupling.

For SQL and Kafka, file-based ingestion might still be a nice naive solution. Buffering in files (for X seconds from Kafka, or for X size when loading a SQL query) and then loading the file is a pretty common way to ingest into OLAP DBs – often, INSERTs have poor performance for columnar data. (E.g. Druid doesn't even support INSERT, the main approaches are load from file in object storage or direct connect to Kafka).

But I agree more flexibility might also be needed at some point. One idea is to also offer a ConsumeAsArrow option. We will need to scope it when that happens (it will also largely depend on underlying DB capabilities when we get to cloud scale).

Thanks for the detailed comments.
I agree with a FileIterator we should be able to remove the map based access. I will go ahead and implement it with a map itself for now.

begelundmuller · 2023-01-12T10:18:54Z

runtime/connectors/gcs/gcs.go

+func gcsURLParts(path string) (string, string, error) {
+	trimmedPath := strings.Replace(path, "gs://", "", 1)
+	bucket, glob, found := strings.Cut(trimmedPath, "/")
+	if !found {
+		return "", "", fmt.Errorf("failed to parse path %s", path)
 	}
-	return u.Host, strings.Replace(u.Path, "/", "", 1), fileutil.FullExt(u.Path), nil
+	return bucket, glob, nil
 }


At this point, we don't know if path is well-formed, so if not using url.Parse, it needs to be more defensively parsed.

For example, if the input is whoops/bags://d/match.csv, it will actually fetch bad/match.csv from the whoops bucket!

Also, let's add a comment describing why we're not using url.Parse

Same applies for s3. Maybe we should add a globurl util package in runtime/pkg with a function like url.Parse that works on URLs containing globs?

Makes sense. Will change.

runtime/drivers/duckdb/connectors.go

runtime/connectors/blob/blobdownloader_test.go

runtime/drivers/duckdb/connectors.go

begelundmuller

Thanks, mainly just nits remaining.

Let's move the docs changes into a separate PR and tag @magorlick for review on it.

runtime/connectors/blob/blobdownloader_test.go

begelundmuller · 2023-01-16T18:19:24Z

runtime/connectors/blob/blobdownloader_test.go

+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got, err := FetchFileNames(tt.args.ctx, tt.args.bucket, tt.args.config, tt.args.globPattern, tt.args.bucketPath)


ctx is nil. Use context.Background() insteand and remove the args.ctx param. Passing a nil context can lead to nil pointer errors.

context is not nil. Set as context.Background as part of args initialisation
The entire args struct and test code is auto generated(apart from values). Don't want to make much changes there.

Got it, missed the place it was created

runtime/connectors/blob/blobdownloader_test.go

runtime/connectors/gcs/gcs.go

runtime/connectors/s3/s3.go

begelundmuller · 2023-01-16T18:45:32Z

runtime/drivers/duckdb/connectors.go

+	}
+}
+
+func sourceReaderWithDelimiter(paths []string, delimitter string) string {


Spelling mistake in delimitter

The function name doesn't match getSourceReader. I also don't like the get so much (I know it was there before). Maybe makeSourceReader and makeSourceReaderCSV (implying the recursive relationship)?

Changed the getSourceReader to sourceReader (don't like get myself, didn't change existing code).

I believe lets not fret too much about the names here. Functions are small and easy to understand as well.

begelundmuller · 2023-01-16T19:01:28Z

runtime/pkg/globutil/globutil_test.go

+		t.Run(tt.name, func(t *testing.T) {
+			got, got1, got2, err := ParseURL(tt.args)
+			if (err != nil) != tt.wantErr {
+				t.Errorf("ParseURL() error = %v, wantErr %v", err, tt.wantErr)
+				return
+			}
+			if got != tt.want {
+				t.Errorf("ParseURL() got = %v, want %v", got, tt.want)
+			}
+			if got1 != tt.want1 {
+				t.Errorf("ParseURL() got1 = %v, want %v", got1, tt.want1)
+			}
+			if got2 != tt.want2 {
+				t.Errorf("ParseURL() got2 = %v, want %v", got2, tt.want2)
+			}
+		})


Use require.NotEqual and similar

Maybe consider not doing a matrix test for such few cases – has a readability loss

I am not aware of the term matrix test so not sure what it means here. Changed the struct field names for better readability. Anyways both the actual code and test code is small and easy to read/understand.

Not sure it's called a matrix test, just what popped into my mind for the pattern of looping over a list of test case params. Is your IDE generating these for you?

When there are few cases, I sometimes find them harder to read than something like this. But they're very common and this is just a personal preference. So no need to change.

runtime/pkg/globutil/globutil.go

begelundmuller

Looks good, thanks!

* fixed ? in glob pattern * initial commit * loading local file connector * adding uts * liniting issues * review comments * ut fix * review nits

k-anshul added 6 commits January 11, 2023 16:11

fixed ? in glob pattern

e305696

initial commit

65a4238

loading local file connector

423bb2e

adding uts

55cfdfc

Merge remote-tracking branch 'origin/main' into csvdelimitter

8589b32

liniting issues

81dd823

k-anshul requested a review from begelundmuller January 12, 2023 06:30

begelundmuller requested changes Jan 12, 2023

View reviewed changes

begelundmuller reviewed Jan 12, 2023

View reviewed changes

runtime/drivers/duckdb/connectors.go Show resolved Hide resolved

k-anshul added 3 commits January 12, 2023 19:01

Merge remote-tracking branch 'origin' into csvdelimitter

1151ea6

review comments

de87574

ut fix

c858437

k-anshul requested review from begelundmuller and magorlick January 13, 2023 05:23

begelundmuller mentioned this pull request Jan 16, 2023

Implement glob support for connectors #1552

Closed

9 tasks

nishantmonu51 added the blocker A release blocker issue that should be resolved before a new release label Jan 16, 2023

begelundmuller requested changes Jan 16, 2023

View reviewed changes

k-anshul added 2 commits January 17, 2023 11:56

Merge remote-tracking branch 'origin/main' into csvdelimitter

1d9b084

review nits

8317ed2

k-anshul force-pushed the csvdelimitter branch from 20fcd4c to 8317ed2 Compare January 17, 2023 06:50

k-anshul requested a review from begelundmuller January 17, 2023 06:53

begelundmuller approved these changes Jan 17, 2023

View reviewed changes

begelundmuller merged commit 2c47b42 into main Jan 17, 2023

begelundmuller deleted the csvdelimitter branch January 17, 2023 09:40

begelundmuller removed the request for review from magorlick January 17, 2023 09:40

bcolloran pushed a commit that referenced this pull request Mar 7, 2023

Glob support for local, s3 and gcs - part 2 (#1592)

c5bc6d6

* fixed ? in glob pattern * initial commit * loading local file connector * adding uts * liniting issues * review comments * ut fix * review nits

djbarnwal pushed a commit that referenced this pull request Aug 3, 2023

Glob support for local, s3 and gcs - part 2 (#1592)

a52acef

* fixed ? in glob pattern * initial commit * loading local file connector * adding uts * liniting issues * review comments * ut fix * review nits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glob support for local, s3 and gcs - 2 #1592

glob support for local, s3 and gcs - 2 #1592

k-anshul commented Jan 11, 2023 •

edited by begelundmuller

Loading

begelundmuller Jan 12, 2023

k-anshul Jan 12, 2023

k-anshul Jan 12, 2023

begelundmuller Jan 12, 2023

k-anshul Jan 12, 2023

begelundmuller Jan 12, 2023

begelundmuller Jan 12, 2023

k-anshul Jan 12, 2023

begelundmuller left a comment

begelundmuller Jan 16, 2023

k-anshul Jan 17, 2023 •

edited

Loading

begelundmuller Jan 17, 2023

begelundmuller Jan 16, 2023

k-anshul Jan 17, 2023

begelundmuller Jan 16, 2023

k-anshul Jan 17, 2023

begelundmuller Jan 17, 2023

begelundmuller left a comment

glob support for local, s3 and gcs - 2 #1592

glob support for local, s3 and gcs - 2 #1592

Conversation

k-anshul commented Jan 11, 2023 • edited by begelundmuller Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

begelundmuller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k-anshul Jan 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

begelundmuller left a comment

Choose a reason for hiding this comment

k-anshul commented Jan 11, 2023 •

edited by begelundmuller

Loading

k-anshul Jan 17, 2023 •

edited

Loading