Skip to content

feat(r): Generic datasources #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open

Conversation

npelikan
Copy link

@npelikan npelikan commented Jun 7, 2025

This is a WIP but seems to basically work.

One question -- I retained the basic functionality for local data.frames (that is, df -> querychat -> df), where remote data sources instead return a dbplyr lazy tbl(), meant for chaining. Is this too confusing of a behavior split? Should local data.frames also return a tbl(), just now connected to duckdb?

A few immediate TODOs:

  • add more documentation
  • generate some examples
  • create shinytests
  • validate this works on more than just sqlite

jcheng5 and others added 23 commits April 3, 2025 22:47
...instead of requiring explicit DataSource subclass creation
…provements

Plus some improvements:
- Cleaner .md file reading code in example apps
- Use GPT-4.1 by default, not GPT-4 😬
- Make sqlalchemy required
fix: No longer need to manually calls session$ns() with shinychat (#1
@npelikan npelikan marked this pull request as draft June 10, 2025 02:24
@npelikan npelikan marked this pull request as ready for review June 10, 2025 02:25
@schloerke schloerke marked this pull request as draft June 10, 2025 13:42
@schloerke schloerke changed the title DRAFT: R generic datasources feat(r): Generic datasources Jun 10, 2025
@jcheng5
Copy link
Collaborator

jcheng5 commented Jun 17, 2025

Thanks for this, @npelikan! My main feedback is pretty fundamental: this PR is currently using an antipattern called "external polymorphism" to allow both in-memory data frames and databases to be supported; querychat.R contains conditionals everywhere to do one thing if data frame, another thing if database. The Python package takes an approach that I would duplicate here: make DataSource a generic concept that we implement twice, once for data frames and once for databases. We can do this in R with either S3 or R7 classes. Then the querychat.R code is greatly simplified, it just asks the data source to generate a schema, or to apply a SQL statement, or whatever.

(I would normally write this out in a lot more detail, or offer to walk through this together, but I actually think you're better off asking Claude Code to elaborate--I'm guessing it will do a better job--or even perform the refactoring itself.)

Or alternatively, we could try what you said and have it be all databases, all the time, and we just wrap in-memory data frames with DuckDB pretty early.

@npelikan
Copy link
Author

Thanks Joe! Great suggestion -- I've updated the code to use s3 classes and it's definitely much simplified now

@npelikan npelikan marked this pull request as ready for review June 20, 2025 12:03
@npelikan
Copy link
Author

(Changing the target of this to main so it merges cleanly)

@npelikan npelikan changed the base branch from generic-datasource-improvements to main June 25, 2025 22:53
@jcheng5
Copy link
Collaborator

jcheng5 commented Jun 26, 2025

I'm really sorry, this is still waiting on me, isn't it? I'm going to book some time for us to discuss in realtime if that's cool with you.

#' @param ... Additional arguments passed to methods
#' @return NULL (invisibly)
#' @export
cleanup_source <- function(source, ...) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be close?

select_parts <- c(
select_parts,
glue::glue_sql("MIN({`col`}) as {`col`}_min", .con = conn),
glue::glue_sql("MAX({`col`}) as {`col`}_max", .con = conn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use more underscores (__min) to reduce the possibility of collision.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Added this for the py package as well.

distinct_count_key <- paste0(col, "_distinct_count")
if (distinct_count_key %in% names(column_stats) && !is.na(column_stats[[distinct_count_key]])) {
count <- column_stats[[distinct_count_key]]
cat_info <- glue::glue(" Categorical values: {count} unique values (exceeds threshold of {categorical_threshold})")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not call it "categorical" data in this case, it's just strings. It might or might not be helpful to still say how many unique values. Actually, including the count of unique values would be inaccurate as soon as the data changes. Maybe just leave this line off.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

# TODO: Provide nicer looking errors here

# Check that data_source is a querychat_data_source object
if (!inherits(data_source, "querychat_data_source")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If data_source is a data frame, let's automatically turn it into the correct querychat_data_source for convenience. (Including using the right default table name based on variable)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

} else {
DBI::dbGetQuery(conn, current_query())
}
querychat::get_lazy_data(data_source, query = dplyr::sql(current_query()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue to have querychat$df() be an eager data frame, and add a new property querychat$tbl() for the lazy version.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@npelikan
Copy link
Author

Alright, I think this is ready to merge @jcheng5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants