-
Notifications
You must be signed in to change notification settings - Fork 15
feat(r): Generic datasources #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
...instead of requiring explicit DataSource subclass creation
…provements Plus some improvements: - Cleaner .md file reading code in example apps - Use GPT-4.1 by default, not GPT-4 😬 - Make sqlalchemy required
…-datasource-improvements
fix: No longer need to manually calls session$ns() with shinychat (#1…
R generic datasource
Thanks for this, @npelikan! My main feedback is pretty fundamental: this PR is currently using an antipattern called "external polymorphism" to allow both in-memory data frames and databases to be supported; querychat.R contains conditionals everywhere to do one thing if data frame, another thing if database. The Python package takes an approach that I would duplicate here: make DataSource a generic concept that we implement twice, once for data frames and once for databases. We can do this in R with either S3 or R7 classes. Then the querychat.R code is greatly simplified, it just asks the data source to generate a schema, or to apply a SQL statement, or whatever. (I would normally write this out in a lot more detail, or offer to walk through this together, but I actually think you're better off asking Claude Code to elaborate--I'm guessing it will do a better job--or even perform the refactoring itself.) Or alternatively, we could try what you said and have it be all databases, all the time, and we just wrap in-memory data frames with DuckDB pretty early. |
update to use s3 classes to simplify the code
Thanks Joe! Great suggestion -- I've updated the code to use s3 classes and it's definitely much simplified now |
(Changing the target of this to main so it merges cleanly) |
I'm really sorry, this is still waiting on me, isn't it? I'm going to book some time for us to discuss in realtime if that's cool with you. |
#' @param ... Additional arguments passed to methods | ||
#' @return NULL (invisibly) | ||
#' @export | ||
cleanup_source <- function(source, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be close
?
pkg-r/R/data_source.R
Outdated
select_parts <- c( | ||
select_parts, | ||
glue::glue_sql("MIN({`col`}) as {`col`}_min", .con = conn), | ||
glue::glue_sql("MAX({`col`}) as {`col`}_max", .con = conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use more underscores (__min
) to reduce the possibility of collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Added this for the py package as well.
pkg-r/R/data_source.R
Outdated
distinct_count_key <- paste0(col, "_distinct_count") | ||
if (distinct_count_key %in% names(column_stats) && !is.na(column_stats[[distinct_count_key]])) { | ||
count <- column_stats[[distinct_count_key]] | ||
cat_info <- glue::glue(" Categorical values: {count} unique values (exceeds threshold of {categorical_threshold})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not call it "categorical" data in this case, it's just strings. It might or might not be helpful to still say how many unique values. Actually, including the count of unique values would be inaccurate as soon as the data changes. Maybe just leave this line off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
# TODO: Provide nicer looking errors here | ||
|
||
# Check that data_source is a querychat_data_source object | ||
if (!inherits(data_source, "querychat_data_source")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If data_source
is a data frame, let's automatically turn it into the correct querychat_data_source
for convenience. (Including using the right default table name based on variable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
pkg-r/R/querychat.R
Outdated
} else { | ||
DBI::dbGetQuery(conn, current_query()) | ||
} | ||
querychat::get_lazy_data(data_source, query = dplyr::sql(current_query())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue to have querychat$df()
be an eager data frame, and add a new property querychat$tbl()
for the lazy version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Alright, I think this is ready to merge @jcheng5 |
This is a WIP but seems to basically work.
One question -- I retained the basic functionality for local data.frames (that is, df -> querychat -> df), where remote data sources instead return a dbplyr lazy
tbl()
, meant for chaining. Is this too confusing of a behavior split? Should local data.frames also return atbl()
, just now connected to duckdb?A few immediate TODOs: