-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Lead and lag operators? #791
Comments
One way to do this is to define a new AbstractDataVector (or AbstractNullableVector when it's ready) that is a view into a column vector. The view would automatically account for the shift when indexing. It shouldn't be that hard to write. You just need a few |
This is a rough one since I don't see a way to do this for generic databases, since it's based on assumptions about the ordering of rows. I suspect we ultimately need to fork DataFrames into a library that's useful to people who work with RDBMS and a second library that's useful to people who work with time series data. |
Does not TimeSeries.jl already provide a separate tabular structure for working with time series data? |
Yes, but it's different; a TimeSeries is a matrix representation like a zoo object in R. DataFrames are also useful for time series data, especially for irregular data or for cases where time data is mixed with other attributes. |
Anyway, I think people who work with time series data may well need a RDBMS too. I taught SAS with SQL to deal with stock prices data, which can easily become large enough that storing everything in memory is not a good idea. And it can be made to work, since SQL supports |
It would be totally reasonable to make "inherently ordered" into a trait and have things fail for DataFrame-like objects (like databases) that don't implement that trait. My main worry there is that it requires a lot of discipline to not get lazy and assume ordering when it's not really necessary, since you then write code that's much less portable than possible. I think using I don't think standard SQL supports My broader concern is that I'd like us to make sure we think through which operations require which assumptions to make work. Pandas makes a lot of assumptions you can't make about a system like Hive: that gives it a lot of extra power, but at the expense of generality (and also applicability to the kinds of work I do). Ideally, we'd find a way to expose both sets of operations, but I can imagine that some of the behavior you'd want from Pandas would require a change in architecture to match the assumptions that Pandas makes. |
The way dplyr supports this is interesting: http://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html Basically, it translates many commands that are valid R code into SQL statements. Its |
@johansigfrids TimeSeries.jl does do this, but I prefer working with DataFrames. Also, the behavior of TimeSeries with lead and lag is simply to drop observations completely from the object (so @johnmyleswhite I like the goal of DataFrames being a general front-end into many data RDBMSs , but I also like the quick and powerful functionality I am used to with pandas. I realize that much of this comes as a result of pandas making some assumptions for me, but most of the time I'm ok with that. Do you think there is a way we can support both "modes" of operation -- one fully flexible making little or no assumptions for you and the other making some pandas-esque assumptions to offer more out of the box flexibility? dplyr is great. I have been happy to watch DataFramesMeta implement many of the dplyr ideas. |
Other comment about TimeSeries.jl -- often my "time-series" data is just data I have simulated. There are no real-world dates or times associated with any of my data so I don't want to go through the extra work to create real date/datetime indices for them. I'm looking for a way to get classic time-series functions (like moving windows, lead/lag), without the hassle of making up dates for my data. |
@nalimilan's suggestion of dplyr-/DataFramesMeta-like translation of Base Julia syntax to the backend language sounds right to me. Overall, I've slowly been coming over to @johnmyleswhite's side -- I think that the core DataFrames API should align with RDBMS ops, and these special operations that we find convenient on in-memory tables should more explicitly belong to abstract subtypes of AbstractDataFrame. |
@spencerlyon2 once we move to |
The recommended way to do it now is:
I think it is easy enough so I am closing this issue. If you feel otherwise please reopen. |
I do a lot of work with pandas and really appreciate the
shift
method on aDataFrame
. It simply applies a lead or lag (depending on the sign of the argument) to the DataFrame, retaining the index. A simple working example:Notice how
shift(1)
moves the row originally indexed by0
to the1
index an fills the new index with missing data. The shift happens in the other direction with a negative argument.I know this is achievable in part by doing things like
df[1:end-1, :]
anddf[2:end, :]
for lead and lag respectively, but the killer thing about the python routines is that it applies the shift relative to the original index. This makes things like arithmetic or building autoregressive models dead simple because the row index makes sure data is always aligned. Things like constructing the model forx_t = a + b + x_{t-1} + c y_t + d z_t + eps_t
becomes very easy:and all data alignment happens automagically!
I know that the core developers of this package have chosen not to have an intrinsic bind between data and a row index (meaning the index is actually considered part of the data and this bind won't be broken unless the user explicitly indicates that it should) so this exact same functionality may not be quite as straightforward. That being said, I thought I'd ask if anyone was aware of a workflow or set of methods that would allow me to easily achieve the same functionality.
The text was updated successfully, but these errors were encountered: