Skip to content
This repository has been archived by the owner on May 24, 2022. It is now read-only.

Groupby + Lag #7

Open
oxinabox opened this issue Mar 9, 2020 · 0 comments
Open

Groupby + Lag #7

oxinabox opened this issue Mar 9, 2020 · 0 comments
Labels
Time-Related Relating to Times, (including intervals)

Comments

@oxinabox
Copy link
Member

oxinabox commented Mar 9, 2020

(For my reference this was feature number 11 on initial list)

julia> using DataFrames

julia> dayahead = DataFrame(time=[1,1,1,2,2,2,3,3,3], node = string.(repeat('a':'c', 3)), load=rand(9))
9×3 DataFrame
│ Row │ time  │ node   │ load     │
│     │ Int64 │ String │ Float64   │
├─────┼───────┼────────┼───────────┤
│ 11     │ a      │ 0.644285  │
│ 21     │ b      │ 0.0710845 │
│ 31     │ c      │ 0.451687  │
│ 42     │ a      │ 0.0336323 │
│ 52     │ b      │ 0.835294  │
│ 62     │ c      │ 0.535693  │
│ 73     │ a      │ 0.614499  │
│ 83     │ b      │ 0.472906  │
│ 93     │ c      │ 0.597165  │

julia> groups = groupby(dayahead, :node)
GroupedDataFrame with 3 groups based on key: node
First Group (3 rows): node = "a"
│ Row │ time  │ node   │ load     │
│     │ Int64 │ String │ Float64   │
├─────┼───────┼────────┼───────────┤
│ 11     │ a      │ 0.644285  │
│ 22     │ a      │ 0.0336323 │
│ 33     │ a      │ 0.614499
Last Group (3 rows): node = "c"
│ Row │ time  │ node   │ load    │
│     │ Int64 │ String │ Float64  │
├─────┼───────┼────────┼──────────┤
│ 11     │ c      │ 0.451687 │
│ 22     │ c      │ 0.535693 │
│ 33     │ c      │ 0.597165 │

julia> mapreduce(vcat, groups) do grp
       lagged_grp = DataFrame(time = grp.time .+ 1, load_lag1 = grp.load)
       join(DataFrame(grp), lagged_grp; on=:time, kind=:left)
       end
9×4 DataFrame
│ Row │ time  │ node   │ load     │ load_lag1 │
│     │ Int64 │ String │ Float64   │ Float64⍰   │
├─────┼───────┼────────┼───────────┼────────────┤
│ 11     │ a      │ 0.644285missing    │
│ 22     │ a      │ 0.03363230.644285   │
│ 33     │ a      │ 0.6144990.0336323  │
│ 41     │ b      │ 0.0710845missing    │
│ 52     │ b      │ 0.8352940.0710845  │
│ 63     │ b      │ 0.4729060.835294   │
│ 71     │ c      │ 0.451687missing    │
│ 82     │ c      │ 0.5356930.451687   │
│ 93     │ c      │ 0.5971650.535693

@cbdavis proposes alternative:

Ugly one-liner:

using ShiftedArrays
hcat(dayahead, select(by(dayahead, :node, load_lag1 = :load => lag), Not(:node)))

The key part is this: by(dayahead, :node, load_lag1 = :load => lag) but I haven't been able to figure out a way to insert additional columns into that statement. The Not(:node) part is included since it is assumed that the order of the results are the same as the original dataframe, and we can't just join the results on the node column since the time column is relevant too, but not outputted by the by statement.

generic function:

function lag_df(df::DataFrame, groupcol::Symbol, valuecol::Symbol)
  return hcat(df, select(by(df, groupcol, lag1 = valuecol => lag), Not(groupcol)))
end

lag_df(dayahead, :node, :load)
@nickrobinson251 nickrobinson251 added the Time-Related Relating to Times, (including intervals) label Aug 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Time-Related Relating to Times, (including intervals)
Projects
None yet
Development

No branches or pull requests

2 participants