feat(r): Generic datasources #28

npelikan · 2025-06-07T21:23:12Z

This is a WIP but seems to basically work.

One question -- I retained the basic functionality for local data.frames (that is, df -> querychat -> df), where remote data sources instead return a dbplyr lazy tbl(), meant for chaining. Is this too confusing of a behavior split? Should local data.frames also return a tbl(), just now connected to duckdb?

A few immediate TODOs:

add more documentation
generate some examples
create shinytests
validate this works on more than just sqlite

…ctor

...instead of requiring explicit DataSource subclass creation

…provements Plus some improvements: - Cleaner .md file reading code in example apps - Use GPT-4.1 by default, not GPT-4 😬 - Make sqlalchemy required

…-datasource-improvements

…default

fix: No longer need to manually calls session$ns() with shinychat (#1…

R generic datasource

jcheng5 · 2025-06-17T23:30:59Z

Thanks for this, @npelikan! My main feedback is pretty fundamental: this PR is currently using an antipattern called "external polymorphism" to allow both in-memory data frames and databases to be supported; querychat.R contains conditionals everywhere to do one thing if data frame, another thing if database. The Python package takes an approach that I would duplicate here: make DataSource a generic concept that we implement twice, once for data frames and once for databases. We can do this in R with either S3 or R7 classes. Then the querychat.R code is greatly simplified, it just asks the data source to generate a schema, or to apply a SQL statement, or whatever.

(I would normally write this out in a lot more detail, or offer to walk through this together, but I actually think you're better off asking Claude Code to elaborate--I'm guessing it will do a better job--or even perform the refactoring itself.)

Or alternatively, we could try what you said and have it be all databases, all the time, and we just wrap in-memory data frames with DuckDB pretty early.

update to use s3 classes to simplify the code

npelikan · 2025-06-19T11:30:33Z

Thanks Joe! Great suggestion -- I've updated the code to use s3 classes and it's definitely much simplified now

npelikan · 2025-06-25T22:52:57Z

(Changing the target of this to main so it merges cleanly)

jcheng5 · 2025-06-26T01:47:04Z

I'm really sorry, this is still waiting on me, isn't it? I'm going to book some time for us to discuss in realtime if that's cool with you.

jcheng5 · 2025-06-26T17:48:50Z

pkg-r/R/data_source.R

+#' @param ... Additional arguments passed to methods
+#' @return NULL (invisibly)
+#' @export
+cleanup_source <- function(source, ...) {


Should this be close?

jcheng5 · 2025-06-26T17:55:03Z

pkg-r/R/data_source.R

+      select_parts <- c(
+        select_parts,
+        glue::glue_sql("MIN({`col`}) as {`col`}_min", .con = conn),
+        glue::glue_sql("MAX({`col`}) as {`col`}_max", .con = conn)


Let's use more underscores (__min) to reduce the possibility of collision.

Done! Added this for the py package as well.

jcheng5 · 2025-06-26T18:05:25Z

pkg-r/R/data_source.R

+      distinct_count_key <- paste0(col, "_distinct_count")
+      if (distinct_count_key %in% names(column_stats) && !is.na(column_stats[[distinct_count_key]])) {
+        count <- column_stats[[distinct_count_key]]
+        cat_info <- glue::glue("  Categorical values: {count} unique values (exceeds threshold of {categorical_threshold})")


Let's not call it "categorical" data in this case, it's just strings. It might or might not be helpful to still say how many unique values. Actually, including the count of unique values would be inaccurate as soon as the data changes. Maybe just leave this line off.

jcheng5 · 2025-06-26T18:13:36Z

pkg-r/R/querychat.R

-  # TODO: Provide nicer looking errors here
+
+  # Check that data_source is a querychat_data_source object
+  if (!inherits(data_source, "querychat_data_source")) {


If data_source is a data frame, let's automatically turn it into the correct querychat_data_source for convenience. (Including using the right default table name based on variable)

jcheng5 · 2025-06-26T18:19:20Z

pkg-r/R/querychat.R

-      } else {
-        DBI::dbGetQuery(conn, current_query())
-      }
+      querychat::get_lazy_data(data_source, query = dplyr::sql(current_query()))


Let's continue to have querychat$df() be an eager data frame, and add a new property querychat$tbl() for the lazy version.

npelikan · 2025-06-27T16:07:18Z

Alright, I think this is ready to merge @jcheng5

jcheng5 and others added 23 commits April 3, 2025 22:47

First attempt at genericizing data source

973a433

Unify prompts by adding chevron Python dependency

8de0ac7

Make prompt aware of what engine is being used

53c7df3

Replace SQLite support with SQLAlchemy support

a2122f2

Don't fail when given table name's case differs from SQLAlchemy Inspe…

a218fb9

…ctor

Forgot import

dc0814e

Have server() return proper class with typed methods, instead of dict

9d95d1d

Auto-create sqlite database for example

aeb87dd

Have init() take data frame or sqlalchemy engine directly

c38b567

...instead of requiring explicit DataSource subclass creation

Merge remote-tracking branch 'origin/main' into generic-datasource-im…

e7972e8

…provements Plus some improvements: - Cleaner .md file reading code in example apps - Use GPT-4.1 by default, not GPT-4 😬 - Make sqlalchemy required

Use GPT-4.1 by default, not GPT-4, yuck

57922b3

Merge remote-tracking branch 'origin/generic-datasource' into generic…

84d30ad

…-datasource-improvements

Update README

a08764b

this should significantly speed up schema generation

374bdfb

another speedup

e294b1b

ruff formatting

b179ea6

updating so formatting checks pass

2cbe199

adding a generic r datasource

8f59aa7

critical change: should return a lazy table rather than executing by …

2ececf5

…default

edits to test suite and devtools::check() passing

f4ca445

Merge pull request #1 from posit-dev/main

c9b03da

fix: No longer need to manually calls session$ns() with shinychat (#1…

example update

48503f0

error message for a footgun

4809615

npelikan marked this pull request as draft June 10, 2025 02:24

npelikan marked this pull request as ready for review June 10, 2025 02:25

schloerke marked this pull request as draft June 10, 2025 13:42

schloerke changed the title ~~DRAFT: R generic datasources~~ feat(r): Generic datasources Jun 10, 2025

npelikan mentioned this pull request Jun 12, 2025

feat(py): generic datasources improvements re-submit #26 from Nick #32

Draft

npelikan force-pushed the main branch from 41f3238 to c1297db Compare June 12, 2025 20:52

Merge branch 'main' into r-generic-datasource

a1ae3b6

Merge pull request #4 from npelikan/r-generic-datasource

24ef182

R generic datasource

npelikan added 2 commits June 19, 2025 12:20

update to use s3 classes to simplify the code

3b289c7

Merge pull request #5 from npelikan/r-generic-datasource

7052d6e

update to use s3 classes to simplify the code

npelikan added 3 commits June 19, 2025 12:37

README update

146777a

added injection of SQL dialect into prompt. Also cleaned up test naming

9911965

more simplification

8d05d7f

npelikan marked this pull request as ready for review June 20, 2025 12:03

npelikan changed the base branch from generic-datasource-improvements to main June 25, 2025 22:53

npelikan added 2 commits June 25, 2025 15:54

Merge branch 'main' into main

b18b570

merge fix

41c9e1e

small dep edit

e347110

jcheng5 reviewed Jun 26, 2025

View reviewed changes

jcheng5 and others added 6 commits June 26, 2025 11:35

Code review

753c5af

more tests, and code review edits

1ee065b

testing changes

5492b0f

more test passing

1ff4fe5

cleaning up gitignores

eb9104c

updating python datasource to prevent collisions

09231fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(r): Generic datasources #28

feat(r): Generic datasources #28

Uh oh!

npelikan commented Jun 7, 2025 •

edited

Loading

Uh oh!

jcheng5 commented Jun 17, 2025

Uh oh!

npelikan commented Jun 19, 2025

Uh oh!

npelikan commented Jun 25, 2025

Uh oh!

jcheng5 commented Jun 26, 2025

Uh oh!

jcheng5 Jun 26, 2025

Uh oh!

jcheng5 Jun 26, 2025

Uh oh!

npelikan Jun 27, 2025

Uh oh!

jcheng5 Jun 26, 2025

Uh oh!

npelikan Jun 27, 2025

Uh oh!

jcheng5 Jun 26, 2025

Uh oh!

npelikan Jun 27, 2025

Uh oh!

jcheng5 Jun 26, 2025

Uh oh!

npelikan Jun 27, 2025

Uh oh!

npelikan commented Jun 27, 2025

Uh oh!

Uh oh!

feat(r): Generic datasources #28

Are you sure you want to change the base?

feat(r): Generic datasources #28

Uh oh!

Conversation

npelikan commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcheng5 commented Jun 17, 2025

Uh oh!

npelikan commented Jun 19, 2025

Uh oh!

npelikan commented Jun 25, 2025

Uh oh!

jcheng5 commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

npelikan commented Jun 27, 2025

Uh oh!

Uh oh!

npelikan commented Jun 7, 2025 •

edited

Loading