-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glob support for local, s3 and gcs - 2 #1592
Conversation
runtime/connectors/connectors.go
Outdated
// Config is common config for all connectors | ||
// Different connectors may add more configs | ||
type Config struct { | ||
Path string `mapstructure:"path"` | ||
Format string `mapstructure:"format"` | ||
CSVDelimiter string `mapstructure:"csv.delimiter"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this decentralized in the separate connectors, even if it means duplication. It may diverge quite soon (we'll probably need SQL and Kafka connectors soon, where these options do not apply)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With SQL and Kafka connectors in place, this entire code may need restructuring(Consuming via files may no longer apply).
If not creating these configs only option is to read configs from source.Properties map. In my opinion its never a good idea to read data (like configs) from a map. Makes things easier at start but becomes tough later on to keep track of where and what all data was read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing we can do is let connectors expose fileParsingOptions which we can consume here. Where such options do not apply, connectors will not return valid options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the problem around needing access to this info. I thought some more about this. Some thoughts:
- I agree that reading config from maps in general is not good, though it can be acceptable in some cases if it prevents tight coupling of packages.
- My main worry here is about using struct embedding (inheritance?) for a simple config object, it feels like too much complexity versus simply having a few duplicated lines. The idea of having
FileParseOptions
seems nicer, although maybe premature for so few options. - About the worry of exposing
Format
andCSVDelimiter
to the DuckDB driver, I think the solution is to have these be available on theFileIterator
previously discussed instead. So the DuckDB driver should not need to re-parse the source properties, instead the iterator it gets from the connector should contain any info necessary to correctly consume the files. This also ensures looser coupling. - For SQL and Kafka, file-based ingestion might still be a nice naive solution. Buffering in files (for X seconds from Kafka, or for X size when loading a SQL query) and then loading the file is a pretty common way to ingest into OLAP DBs – often,
INSERT
s have poor performance for columnar data. (E.g. Druid doesn't even supportINSERT
, the main approaches are load from file in object storage or direct connect to Kafka).- But I agree more flexibility might also be needed at some point. One idea is to also offer a
ConsumeAsArrow
option. We will need to scope it when that happens (it will also largely depend on underlying DB capabilities when we get to cloud scale).
- But I agree more flexibility might also be needed at some point. One idea is to also offer a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed comments.
I agree with a FileIterator
we should be able to remove the map based access. I will go ahead and implement it with a map itself for now.
runtime/connectors/gcs/gcs.go
Outdated
func gcsURLParts(path string) (string, string, error) { | ||
trimmedPath := strings.Replace(path, "gs://", "", 1) | ||
bucket, glob, found := strings.Cut(trimmedPath, "/") | ||
if !found { | ||
return "", "", fmt.Errorf("failed to parse path %s", path) | ||
} | ||
return u.Host, strings.Replace(u.Path, "/", "", 1), fileutil.FullExt(u.Path), nil | ||
return bucket, glob, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, we don't know if path
is well-formed, so if not using url.Parse
, it needs to be more defensively parsed.
For example, if the input is whoops/bags://d/match.csv
, it will actually fetch bad/match.csv
from the whoops
bucket!
Also, let's add a comment describing why we're not using url.Parse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same applies for s3
. Maybe we should add a globurl
util package in runtime/pkg
with a function like url.Parse
that works on URLs containing globs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, mainly just nits remaining.
Let's move the docs changes into a separate PR and tag @magorlick for review on it.
|
||
for _, tt := range tests { | ||
t.Run(tt.name, func(t *testing.T) { | ||
got, err := FetchFileNames(tt.args.ctx, tt.args.bucket, tt.args.config, tt.args.globPattern, tt.args.bucketPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx
is nil
. Use context.Background()
insteand and remove the args.ctx
param. Passing a nil
context can lead to nil pointer errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context is not nil. Set as context.Background as part of args initialisation
The entire args struct and test code is auto generated(apart from values). Don't want to make much changes there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, missed the place it was created
runtime/drivers/duckdb/connectors.go
Outdated
} | ||
} | ||
|
||
func sourceReaderWithDelimiter(paths []string, delimitter string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Spelling mistake in
delimitter
- The function name doesn't match
getSourceReader
. I also don't like theget
so much (I know it was there before). MaybemakeSourceReader
andmakeSourceReaderCSV
(implying the recursive relationship)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the getSourceReader to sourceReader (don't like get myself, didn't change existing code).
I believe lets not fret too much about the names here. Functions are small and easy to understand as well.
t.Run(tt.name, func(t *testing.T) { | ||
got, got1, got2, err := ParseURL(tt.args) | ||
if (err != nil) != tt.wantErr { | ||
t.Errorf("ParseURL() error = %v, wantErr %v", err, tt.wantErr) | ||
return | ||
} | ||
if got != tt.want { | ||
t.Errorf("ParseURL() got = %v, want %v", got, tt.want) | ||
} | ||
if got1 != tt.want1 { | ||
t.Errorf("ParseURL() got1 = %v, want %v", got1, tt.want1) | ||
} | ||
if got2 != tt.want2 { | ||
t.Errorf("ParseURL() got2 = %v, want %v", got2, tt.want2) | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use
require.NotEqual
and similar - Maybe consider not doing a matrix test for such few cases – has a readability loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of the term matrix test so not sure what it means here. Changed the struct field names for better readability. Anyways both the actual code and test code is small and easy to read/understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's called a matrix test, just what popped into my mind for the pattern of looping over a list of test case params. Is your IDE generating these for you?
When there are few cases, I sometimes find them harder to read than something like this. But they're very common and this is just a personal preference. So no need to change.
20fcd4c
to
8317ed2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
* fixed ? in glob pattern * initial commit * loading local file connector * adding uts * liniting issues * review comments * ut fix * review nits
* fixed ? in glob pattern * initial commit * loading local file connector * adding uts * liniting issues * review comments * ut fix * review nits
This PR fixes following
?
pattern in S3 connector