Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex support for str detect #77

Merged
merged 17 commits into from
Oct 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 18 additions & 7 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,23 @@
# TidierDB.jl updates
## v0.4.2 - 2024-10-
- add support for performing greater than 2 joins using TidierDB queries in a single chain and additional tests
- add `dmy`, `mdy`, `ymd` support DuckDB, Postgres, GBQ, Clickhouse, MySQL, MsSQL, Athena, MsSQL
- add date related tests
- adds `copy_to` for MsSQL to write dataframe to database
- improve Google Big Query type mapping when collecting to df
## v0.5.0 - 2024-10-15
Breaking Changes:
- All join syntax now matches TidierData's `(table1, table2, t1_col = t2_col)`
Additions:
- `@compute`for DuckDB, MySQL, PostGres, GBQ to write a table to the db and the end of a query.
- expands `@create_view` to MySQL, PostGres, GBQ
- Support for performing multiple joins of TidierDB queries in a single chain with further tests
- `dmy`, `mdy`, `ymd` support DuckDB, Postgres, GBQ, Clickhouse, MySQL, MsSQL, Athena, MsSQL
- Date related tests
- `copy_to` for MysQL to write a dataframe to MySQL database
Improvements:
- improve Google Big Query type mapping when collecting to dataframe
- change `gbq()`'s `connect()` to accept `location` as second argument
- `str_detect` now supports regex for all backends except MsSQL + some tests
- `str_detect` now supports regex for all backends except MsSQL + some tests
- `@select(!table.name)` now works to deselect a column

Docs:
- Add duckplyr/duckdb reproducible example to docs
- Improve interpolation docs

## v0.4.1 - 2024-10-02
- Adds 50 tests comparing TidierDB to TidierData to assure accuracy across a complex chains of operations, including combinations of `@mutate`, `@summarize`, `@filter`, `@select`, `@group_by` and `@join` operations.
Expand Down
5 changes: 3 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "TidierDB"
uuid = "86993f9b-bbba-4084-97c5-ee15961ad48b"
authors = ["Daniel Rizk <rizk.daniel.12@gmail.com> and contributors"]
version = "0.4.2"
version = "0.5.0"

[deps]
Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45"
Expand Down Expand Up @@ -37,10 +37,11 @@ SQLiteExt = "SQLite"
[compat]
AWS = "1.9"
Arrow = "2.7"
CSV = "0.10.1"
CSV = "0.10"
Chain = "0.6"
ClickHouse = "0.2"
DataFrames = "1.5"
Dates = "1.9"
Documenter = "0.27, 1"
DuckDB = "1.0"
GZip = "0.6"
Expand Down
81 changes: 81 additions & 0 deletions docs/examples/UserGuide/duckplyr_reprex.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# In this example, we will reproduce a DuckDB and duckplyr blog post example to demonstrate TidierDB's v0.5.0 capability.

# The (example by Hannes)[https://duckdb.org/2024/10/09/analyzing-open-government-data-with-duckplyr.html] that is being reproduced is exploring Open Data from the New Zealand government that is ~ 1GB.

# ## Set up
# First we will set up the local duckdb database and pull in the metadata for the files. Notice we are not reading this data into memory, only the paths and and column, and table names.
# To follow along, copy the set up code below after downloading the data, but add the directory to the local data.
# ```julia
# import TidierDB as DB
# db = DB.connect(DB.duckdb())

# dir = "/Downloads/nzcensus/"
# data = dir * "Data8277.csv"
# age = dir * "DimenLookupAge8277.csv"
# area = dir * "DimenLookupArea8277.csv"
# ethnic = dir * "DimenLookupEthnic8277.csv"
# sex = dir * "DimenLookupSex8277.csv"
# year = dir * "DimenLookupYear8277.csv"

# data = DB.db_table(db, data);
# age = DB.db_table(db, age);
# area = DB.db_table(db, area);
# ethnic = DB.db_table(db, ethnic);
# sex = DB.db_table(db, sex);
# year = DB.db_table(db, year);
# ```
# ## Exploration
# While this long chain could be broken up into multiple smaller chains, lets reproduce the duckplyr code from example and demonstrate how TidierDB also supports multiple joins after filtering, mutating, etc the joining tables. 6 different tables are being joined together through sequential inner joins.
# ```julia
# @chain DB.t(data) begin
# DB.@filter(str_detect(count, r"^\d+$"))
# DB.@mutate(count_ = "TRY_CAST(count AS INT)")
# DB.@filter(count_ > 0)
# DB.@inner_join(
# (@chain DB.t(age) begin
# DB.@filter(str_detect(Description, r"^\d+ years$"))
# DB.@mutate(age_ = as_integer(str_remove(Code, "years"))) end),
# Age = Code
# )
# DB.@inner_join((@chain DB.t(year) DB.@mutate(year_ = Description)), year = Code)
# DB.@inner_join((@chain DB.t(area) begin
# DB.@mutate(area_ = Description)
# DB.@filter(!str_detect(area_, r"^Total"))
# end)
# , Area = Code)
# DB.@inner_join((@chain DB.t(ethnic) begin
# DB.@mutate(ethnic_ = Description)
# DB.@filter(!str_detect( ethnic_, r"^Total",)) end), Ethnic = Code)
# DB.@inner_join((@chain DB.t(sex) begin
# DB.@mutate(sex_ = Description)
# DB.@filter(!str_detect( sex_, r"^Total"))
# end)
# , Sex = Code)
# DB.@inner_join((@chain DB.t(year) DB.@mutate(year_ = Description)), Year = Code)
# @aside DB.@show_query _
# DB.@create_view(joined_up)
# end;

# @chain DB.db_table(db, "joined_up") begin
# DB.@filter begin
# age_ >= 20
# age_ <= 40
# str_detect(area_, r"^Auckland")
# year_ == "2018"
# ethnic_ != "European"
# end
# DB.@group_by sex_
# DB.@summarise(group_count = sum(count_))
# DB.@collect
# end
# ```
# ## Results
# When we collect this to a local dataframe, we can see that the results match the duckplyr/DuckDB example.
# ```
# 2×2 DataFrame
# Row │ sex_ group_count
# │ String Int128
# ─────┼─────────────────────
# 1 │ Female 398556
# 2 │ Male 397326
# ```
165 changes: 87 additions & 78 deletions docs/examples/UserGuide/functions_pass_to_DB.jl
Original file line number Diff line number Diff line change
@@ -1,78 +1,87 @@
# How can functions pass arguments to a TidierDB chain?

# In short, you have to use a macro instead in conjuction with `@interpolate`

# ## Setting up the macro
# To write a macro that will take arguments and pass them to a TidierDB chain, there are 3 steps:
# 1. Write macro with the desired argument(s), and, after the quote, add the chain. Arguments to be changed/interpolated must be prefixed with `!!`
# 2. Use `@interpolate` to make these arguemnts accessible to the chain. `@interpolate` takes touples as argument (one for the `!!`name, and one for the actual content you want the chain to use)
# 3. Run `@interpolate` and then the chain macro sequentially

# ```
# using TidierDB
# db = connect(duckdb())
# path = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
# copy_to(db, path, "mtcars");
#
# # STEP 1
# macro f1(conditions, columns) # The arguemnt names will be names of the `!!` values
# return quote
# # add chain here
# @chain db_table(db, :mtcars) begin
# @filter(!!conditions > 3)
# @select(!!columns)
# @aside @show_query _
# @collect
# end # ends the chain
# end # ends the quote.
# end # ends the macro
# ```
# ```julia
# # STEP 2
# variable = :gear;
# cols = [:model, :mpg, :gear, :wt];
# @interpolate((conditions, variable), (columns, cols));
# @f1(variable, cols)
# ```
# ```
# 17×4 DataFrame
# Row │ model mpg gear wt
# │ String? Float64? Int32? Float64?
# ─────┼────────────────────────────────────────────
# 1 │ Mazda RX4 21.0 4 2.62
# 2 │ Mazda RX4 Wag 21.0 4 2.875
# 3 │ Datsun 710 22.8 4 2.32
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 15 │ Ferrari Dino 19.7 5 2.77
# 16 │ Maserati Bora 15.0 5 3.57
# 17 │ Volvo 142E 21.4 4 2.78
# 11 rows omitted
# ```

# Lets say you wanted to filter on new variable with a different name and select new columns,
# ```julia
# new_condition = :wt;
# new_cols = [:model, :drat]
# @interpolate((conditions, new_condition), (columns, new_cols));
# @f1(new_condition, new_cols)
# ```
# ```
# 20×2 DataFrame
# Row │ model drat
# │ String? Float64?
# ─────┼─────────────────────────────
# 1 │ Hornet 4 Drive 3.08
# 2 │ Hornet Sportabout 3.15
# 3 │ Valiant 2.76
# ⋮ │ ⋮ ⋮
# 18 │ Pontiac Firebird 3.08
# 19 │ Ford Pantera L 4.22
# 20 │ Maserati Bora 3.54
# 14 rows omitted
# ```

# You can also interpolate vectors of strings into a `@filter(col in (values))` as well by using the following syntax `@filter(col in [!!values])`

# In short, the first argument in `@interpolate` must be the name of the macro argument it refers to, and the second argument is what you would like to replace it.

# We recognize this adds friction and that it is not ideal, but given the TidierDB macro expressions/string interplay, this is currently the most graceful and functional option available and hopefully a temporary solution to better interpolation that mirrors TidierData.jl.
# On this page, we'll briefly explore how to use TidierDB macros and `$` witth `@eval` to bulid a function

# For a more indepth explanation, please check out the TidierData page on interpolation

using TidierDB, DataFrames;

db = connect(duckdb());
df = DataFrame(id = [string('A' + i ÷ 26, 'A' + i % 26) for i in 0:9],
groups = [i % 2 == 0 ? "aa" : "bb" for i in 1:10],
value = repeat(1:5, 2),
percent = 0.1:0.1:1.0);
copy_to(db, df, "dfm");
df_mem = db_table(db, "dfm");

# ## Interpolation
# Variables are interpoated using `@eval` and `$`. Place `@eval` before you begin the chain or call a TidierDb macro
# Why Use @eval? In Julia, macros like @filter are expanded at parse time, before runtime variables like vals are available. By using @eval, we force the expression to be evaluated at runtime, allowing us to interpolate the variable into the macro.

num = [3];
column = :id;
@eval @chain t(df_mem) begin
@filter(value in $num)
@select($column)
@collect
end

# ## Function set up
# Begin by defining your function as your normally would, but before `@chain` you need to use `@eval`. For the variables to be interpolated in need to be started with `$`
function test(vals, cols)
@eval @chain t(df_mem) begin
@filter(value in $vals)
@select($cols)
@collect
end
end;

vals = [1, 2, 3, 3];
test(vals, [:groups, :value, :percent])

# Now with a new variable
other_vals = [1];
cols = [:value, :percent];
test(other_vals, cols)


# Defineing a new function
function gs(groups, aggs, new_name, threshold)
@eval @chain t(df_mem) begin
@group_by($groups)
@summarize($new_name = mean($aggs))
@filter($new_name > $threshold)
@collect
end
end;

gs(:groups, :percent, :mean_percent, .5)

# Change the column and threshold
gs(:groups, :value, :mean_value, 2)


# ## Write pipeline function to use inside of chains
# Lets say there is a particular sequence of macros that you want repeatedly use. Wrap this series into a function that accepts a `t(query` as its first argument and returns a `SQLquery` and you can easily resuse it.
function moving_aggs(table, start, stop, group, order, col)
qry = @eval @chain $table begin
@group_by $group
@window_frame $start $stop
@window_order $order
@mutate(across($col, (minimum, maximum, mean)))
end
return qry
end;

@chain t(df_mem) begin
moving_aggs(-2, 1, :groups, :percent, :value)
@filter value_mean > 2.75
@aside @show_query _
@collect
end

# Filtering before the window functions
@chain t(df_mem) begin
@filter(value >=2 )
moving_aggs(-1, 1, :groups, :percent, :value)
@aside @show_query _
@collect
end
6 changes: 2 additions & 4 deletions docs/examples/UserGuide/getting_started.jl
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,11 @@
# Compute costs are relevant to backends such as AWS, databricks and Snowflake.

# To do this, save the results of `db_table` and use them with `t`. Using `t` pulls the relevant information (metadata, con, etc) from the mutable SQLquery struct, allowing you to repeatedly query and collect the table without requerying for the metadata each time
# > !Tip:
# > `t()` is an alias for `from_query` This means after saving the results of `db_table`, use `t(table)` to refer to the table or prior query
# ```julia
# table = DB.db_table(con, "path")
# @chain DB.t(table) begin
# ## data wrangling here
# end
# ```
# ---
# Tip: `t()` is an alias for `from_query`
# This means after saving the results of `db_table`, use `t(table)` to refer to the table or prior query
# ---
51 changes: 2 additions & 49 deletions docs/examples/UserGuide/key_differences.jl
Original file line number Diff line number Diff line change
Expand Up @@ -44,23 +44,6 @@ copy_to(db, df, "df_mem"); # copying over the data frame to an in-memory databas
@collect
end

# ## Joining

# There is one key difference for joining:

# The column on both the new and old table must be specified. They do not need to be the same, and given SQL behavior where both columns are kept when joining two tables, it is preferable if they have different names. This avoids "ambiguous reference" errors that would otherwise come up and complicate the use of tidy selection for columns.
# If the table that is being newly joined exists on a database, it must be written as a string or Symbol. If it is an exisiting query, it must be wrapped with `t(query)`. Visit the docstrings for more examples.
df2 = DataFrame(id2 = ["AA", "AC", "AE", "AG", "AI", "AK", "AM"],
category = ["X", "Y", "X", "Y", "X", "Y", "X"],
score = [88, 92, 77, 83, 95, 68, 74]);

copy_to(db, df2, "df_join");

@chain db_table(db, :df_mem) begin
@left_join("df_join", id2, id)
@collect
end

# ## Differences in `case_when()`

# In TidierDB, after the clause is completed, the result for the new column should is separated by a comma `,`
Expand All @@ -73,35 +56,5 @@ end
@collect
end

# ## Interpolation

# To use !! Interpolation, instead of being able to define the alternate names/value in the global context, the user has to use `@interpolate`. This will hopefully be fixed in future versions. Otherwise, the behavior is generally the same, although this creates friction around calling functions.

# Also, when using interpolation with exponenents, the interpolated value must go inside of parenthesis.
# ```julia
# @interpolate((test, :percent)); # this still supports strings, vectors of names, and values

# @chain db_table(db, :df_mem) begin
# @mutate(new_col = case_when((!!test)^2 > .5, "Pass",
# (!!test)^2 < .5, "Try Again",
# "middle"))
# @collect
# end
# ```
# ```
# 10×5 DataFrame
# Row │ id groups value percent new_col
# │ String? String? Int64? Float64? String?
# ─────┼───────────────────────────────────────────────
# 1 │ AA bb 1 0.1 Try Again
# 2 │ AB aa 2 0.2 Try Again
# 3 │ AC bb 3 0.3 Try Again
# ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
# 8 │ AH aa 3 0.8 Pass
# 9 │ AI bb 4 0.9 Pass
# 10 │ AJ aa 5 1.0 Pass
# 4 rows omitted
# ```
# ## Slicing ties

# `slice_min()` and `@slice_max()` will always return ties due to SQL behavior.
# ## Joining Tables
# When joining a table, the column from both tables will be present, in contrast to TidierData which will keep one column
3 changes: 2 additions & 1 deletion docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,9 @@ nav:
- "Using Snowflake" : "examples/generated/UserGuide/Snowflake.md"
- "Using Databricks" : "examples/generated/UserGuide/databricks.md"
- "Joining Tables" : "examples/generated/UserGuide/ex_joining.md"
- "Writing Functions/Macros with TidierDB Chains" : "examples/generated/UserGuide/functions_pass_to_DB.md"
- "Writing Functions with TidierDB Chains" : "examples/generated/UserGuide/functions_pass_to_DB.md"
- "Working With Larger than RAM Datasets" : "examples/generated/UserGuide/outofmemex.md"
- "TidierDB.jl vs Ibis" : "examples/generated/UserGuide/ibis_comp.md"
- "Reproduce a duckplyr example" : "examples/generated/UserGuide/duckplyr_reprex.md"
- "Flexible Syntax and UDFs" : "examples/generated/UserGuide/udfs_ex.md"
- "Reference" : "reference.md"
Loading