Question: Lead and lag operators? #791

sglyon · 2015-03-27T03:06:08Z

I do a lot of work with pandas and really appreciate the shift method on a DataFrame. It simply applies a lead or lag (depending on the sign of the argument) to the DataFrame, retaining the index. A simple working example:

In [345]: df = pd.DataFrame(np.arange(12).reshape(3,4))

In [346]: df
Out[346]:
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

In [348]: df.shift(1)
Out[348]:
    0   1   2   3
0 NaN NaN NaN NaN
1   0   1   2   3
2   4   5   6   7

In [349]: df.shift(-1)
Out[349]:
    0   1   2   3
0   4   5   6   7
1   8   9  10  11
2 NaN NaN NaN NaN

Notice how shift(1) moves the row originally indexed by 0 to the 1 index an fills the new index with missing data. The shift happens in the other direction with a negative argument.

I know this is achievable in part by doing things like df[1:end-1, :] and df[2:end, :] for lead and lag respectively, but the killer thing about the python routines is that it applies the shift relative to the original index. This makes things like arithmetic or building autoregressive models dead simple because the row index makes sure data is always aligned. Things like constructing the model for x_t = a + b + x_{t-1} + c y_t + d z_t + eps_t becomes very easy:

df["lx"]  = df["x"].shift(1)
mod = "x ~ lx + y + z"

and all data alignment happens automagically!

I know that the core developers of this package have chosen not to have an intrinsic bind between data and a row index (meaning the index is actually considered part of the data and this bind won't be broken unless the user explicitly indicates that it should) so this exact same functionality may not be quite as straightforward. That being said, I thought I'd ask if anyone was aware of a workflow or set of methods that would allow me to easily achieve the same functionality.

The text was updated successfully, but these errors were encountered:

tshort · 2015-03-27T11:30:46Z

One way to do this is to define a new AbstractDataVector (or AbstractNullableVector when it's ready) that is a view into a column vector. The view would automatically account for the shift when indexing. It shouldn't be that hard to write. You just need a few getindex methods to make the shifting work. Then, you don't need to do anything in the DataFrames package. This new type could go into a standalone package or into the DataArrays package (or its successor).

johnmyleswhite · 2015-03-28T01:11:51Z

This is a rough one since I don't see a way to do this for generic databases, since it's based on assumptions about the ordering of rows.

I suspect we ultimately need to fork DataFrames into a library that's useful to people who work with RDBMS and a second library that's useful to people who work with time series data.

johansigfrids · 2015-03-28T11:55:45Z

Does not TimeSeries.jl already provide a separate tabular structure for working with time series data?

tshort · 2015-03-28T12:55:52Z

Yes, but it's different; a TimeSeries is a matrix representation like a zoo object in R. DataFrames are also useful for time series data, especially for irregular data or for cases where time data is mixed with other attributes.

nalimilan · 2015-03-28T13:55:50Z

DataFrames already support sort!, which wouldn't work for RDBMSs. I don't see that as a problem: it can simply raise an error when the chosen backend does not support the operation. Same for lead/lag.

Anyway, I think people who work with time series data may well need a RDBMS too. I taught SAS with SQL to deal with stock prices data, which can easily become large enough that storing everything in memory is not a good idea. And it can be made to work, since SQL supports LEAD and LAG in queries. It just needs an ORDER BY statement to give a meaning to "next row" and "previous row". Maybe a syntax can be found in DataFramesMeta.jl, which would work for both in-memory and database-backed DataFrames?

johnmyleswhite · 2015-03-28T15:43:49Z

It would be totally reasonable to make "inherently ordered" into a trait and have things fail for DataFrame-like objects (like databases) that don't implement that trait. My main worry there is that it requires a lot of discipline to not get lazy and assume ordering when it's not really necessary, since you then write code that's much less portable than possible. I think using sort! is a pretty good example of that: I don't think a data store should be assumed to exist in a permanently sorted state since so many operations don't depend upon sorting and those that do should make their sorting explicit.

I don't think standard SQL supports LEAD and LAG, although it's true that you could make it work by hand using an ORDER BY.

My broader concern is that I'd like us to make sure we think through which operations require which assumptions to make work. Pandas makes a lot of assumptions you can't make about a system like Hive: that gives it a lot of extra power, but at the expense of generality (and also applicability to the kinds of work I do). Ideally, we'd find a way to expose both sets of operations, but I can imagine that some of the behavior you'd want from Pandas would require a change in architecture to match the assumptions that Pandas makes.

nalimilan · 2015-03-28T17:20:59Z

LEAD and LAG appear to have been introduced by ANSI SQL 99 as part of "analytic functions"/window functions. They are supported by several major implementations (among which apparently CUBRID, DB2, Oracle, PostgreSQL, Microsoft SQL Server, and Sybase SQL Anywhere -- there are convoluted workarounds for e.g. MySQL). Hive supports them since 2013: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Not supporting this would make Julia lag (pun intended) behind other programs.

The way dplyr supports this is interesting: http://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html Basically, it translates many commands that are valid R code into SQL statements. Its lead and lag functions take an order_by argument. Also, the arrange function defines an order which is automatically passed as ORDER BY in subsequent SQL statements when not overridden. This is a very clever strategy, which effectively makes the table behave as if it was physically sorted (which might indeed be the case if the data frame is stored in memory). Generally, dplyr is a great source of inspiration.

sglyon · 2015-03-29T01:12:32Z

@johansigfrids TimeSeries.jl does do this, but I prefer working with DataFrames. Also, the behavior of TimeSeries with lead and lag is simply to drop observations completely from the object (so lag(ta, 1) with have 1 less row than the original ta) rather than mark them as missing. This puts the burden of data alignment back on the user, especially when using the data to construct statistical models in formulas. At that point I think I'd be better off using a DataFrame from the start.

@johnmyleswhite I like the goal of DataFrames being a general front-end into many data RDBMSs , but I also like the quick and powerful functionality I am used to with pandas. I realize that much of this comes as a result of pandas making some assumptions for me, but most of the time I'm ok with that. Do you think there is a way we can support both "modes" of operation -- one fully flexible making little or no assumptions for you and the other making some pandas-esque assumptions to offer more out of the box flexibility?

dplyr is great. I have been happy to watch DataFramesMeta implement many of the dplyr ideas.

sglyon · 2015-03-29T01:14:46Z

Other comment about TimeSeries.jl -- often my "time-series" data is just data I have simulated. There are no real-world dates or times associated with any of my data so I don't want to go through the extra work to create real date/datetime indices for them.

I'm looking for a way to get classic time-series functions (like moving windows, lead/lag), without the hassle of making up dates for my data.

garborg · 2015-03-30T15:35:25Z

@nalimilan's suggestion of dplyr-/DataFramesMeta-like translation of Base Julia syntax to the backend language sounds right to me.

Overall, I've slowly been coming over to @johnmyleswhite's side -- I think that the core DataFrames API should align with RDBMS ops, and these special operations that we find convenient on in-memory tables should more explicitly belong to abstract subtypes of AbstractDataFrame.

milktrader · 2015-04-11T18:45:35Z

@spencerlyon2 once we move to v0.4, TimeSeries.jl will likely mark as missing the observations consumed by a computation such as lag or lead and use the Nullable type as a sentinel. JuliaStats/TimeSeries.jl#179

nalimilan · 2018-01-29T21:12:49Z

See also https://github.com/piever/ShiftedArrays.jl#shifting-the-data.

bkamins · 2020-07-31T07:32:32Z

The recommended way to do it now is:

julia> using DataFrames, ShiftedArrays

julia> df = DataFrame(reshape(1:20, 5,4))
5×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 6     │ 11    │ 16    │
│ 2   │ 2     │ 7     │ 12    │ 17    │
│ 3   │ 3     │ 8     │ 13    │ 18    │
│ 4   │ 4     │ 9     │ 14    │ 19    │
│ 5   │ 5     │ 10    │ 15    │ 20    │

julia> mapcols(x -> lag(x, -1), df)
5×4 DataFrame
│ Row │ x1      │ x2      │ x3      │ x4      │
│     │ Int64?  │ Int64?  │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 2       │ 7       │ 12      │ 17      │
│ 2   │ 3       │ 8       │ 13      │ 18      │
│ 3   │ 4       │ 9       │ 14      │ 19      │
│ 4   │ 5       │ 10      │ 15      │ 20      │
│ 5   │ missing │ missing │ missing │ missing │

julia> mapcols(lag, df)
5×4 DataFrame
│ Row │ x1      │ x2      │ x3      │ x4      │
│     │ Int64?  │ Int64?  │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ missing │ missing │ missing │ missing │
│ 2   │ 1       │ 6       │ 11      │ 16      │
│ 3   │ 2       │ 7       │ 12      │ 17      │
│ 4   │ 3       │ 8       │ 13      │ 18      │
│ 5   │ 4       │ 9       │ 14      │ 19      │

I think it is easy enough so I am closing this issue.

If you feel otherwise please reopen.

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

bkamins closed this as completed Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Lead and lag operators? #791

Question: Lead and lag operators? #791

sglyon commented Mar 27, 2015

tshort commented Mar 27, 2015

johnmyleswhite commented Mar 28, 2015

johansigfrids commented Mar 28, 2015

tshort commented Mar 28, 2015

nalimilan commented Mar 28, 2015

johnmyleswhite commented Mar 28, 2015

nalimilan commented Mar 28, 2015

sglyon commented Mar 29, 2015

sglyon commented Mar 29, 2015

garborg commented Mar 30, 2015

milktrader commented Apr 11, 2015

nalimilan commented Jan 29, 2018

bkamins commented Jul 31, 2020

Question: Lead and lag operators? #791

Question: Lead and lag operators? #791

Comments

sglyon commented Mar 27, 2015

tshort commented Mar 27, 2015

johnmyleswhite commented Mar 28, 2015

johansigfrids commented Mar 28, 2015

tshort commented Mar 28, 2015

nalimilan commented Mar 28, 2015

johnmyleswhite commented Mar 28, 2015

nalimilan commented Mar 28, 2015

sglyon commented Mar 29, 2015

sglyon commented Mar 29, 2015

garborg commented Mar 30, 2015

milktrader commented Apr 11, 2015

nalimilan commented Jan 29, 2018

bkamins commented Jul 31, 2020