Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add expression evaluation functionality via eval #4162

Merged
merged 0 commits into from
Jul 8, 2013
Merged

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jul 8, 2013

closes #3393.

cc @jreback

eval Roadmap

PyTables

allows natural syntax for queries, with in-line variables allowed

some examples
"ts>=Timestamp('2012-02-01') & users=['a','b','c']"

['A > 0']

dates=pd.date_range('20130101',periods=5)
'index>dates'

Todos:

  • update docs / need new examples
  • API changes (disallow dict, show deprecations for separated expressions)
  • tests for | and ~ operators, and invalid filter expressions
  • code clean up, maybe create new Expr base class to opt in/out of allowed operations
    (e.g. Numexpr doesn't want to allow certain operations, but PyTables needs them,
    so maybe define in the base class for simplicity, and just not allow them (checked in visit),
    which right now doesn't allow a generic_visitor)
  • can ops.Value be equiv of ops.Constant?

Documentation

  • enhancing performance section
  • eval docstring
  • Warn users that this is not a magic bullet
    • E.g., x = 3; y = 2; pd.eval('x // y', engine='python') is 1000 times slower than the same operation in actual Python.

Engines

  • python engine for testing (basically a restricted eval, should make sure nothing dangerous a la eval('os.system("sudo rm -rf /")')) is possible here
  • numexpr engine

Should python be the default engine for small frames?

Functions

  • toplevel isexpr
  • move eval to top-level pandas

Error Handling

  • syntax error handling
  • name error handling

Alignment

  • Algorithm
    1. flatten the tree of terms
    2. resolve the variables, special casing only numpy arrays/scalars/single unary operations (probably room for more abstraction here)
    3. align by joining on the indices using the current broadcasting behavior
    4. update the original namespace with the newly aligned version of the underlying numpy array
    5. reconstruct the final object with the aforementioned joined indices and type
  • punt to 'python' engine when a Series and DataFrame both have DatetimeIndexes (I would rather not do this and just enforce the to-be-deprecated index-aligning-if-both-indices-are-datetimeindex behavior)
  • PeriodIndexes don't work (well, they won't pass my tests) because of a stack overflow bug when joining a DatetimeIndex and a PeriodIndex (that bug is slated for 0.13, but a design decision needs to be made)
  • alignment for NDFrame (is this really necessary?)
  • unary alignment
  • binary alignment

Scope

  • deal with global scoping issues (not sure if this is tested well enough...which means probably not)
  • allow Expr objects to be passed to eval (Expr will pull in the local and global variables from the calling stack frame)

Miscellaneous

  • replace current evaluate function
  • Expr in NDFrame.__getitem__ cc @jreback
  • reconstruct the final object to return the expected output type, e.g., df + df2 should return a DataFrame if df and df2 are both DataFrame objects.
  • change Expression to Expr
  • add truediv keyword to Expr, default to True

Operators

  • %
  • **
  • unary ops don't play well when they are operands of binary ops

Tests

  • DatetimeIndex and PeriodIndex objects are untested because of a few join and alignment issues.
  • rewrite a few tests in the python evaluator to make sure the // operator works as it should (it works, but there may be edge cases that I haven't though of yet)
  • // numexpr only supports scalar evaluation here: variables are not allowed so this will always eval in python space (half-assed it a bit in the tests here need to fix that)
  • truediv keyword tests
  • add tests for nans
  • fix ** associativity issue in testing

Near Future

  • allow and, or and not aliases for the operators &, | and ~, respectively
  • attribute accessing e.g., df.A syntax
  • reductions (sum and prod)
  • math ops (sin, cos, tan, sqrt, etc.)

Far Future

  • non-python indexing (or just punt to python)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

excellent. can i rebase and push ... not sure if that's going to cause a bunch of merge conflicts when you pull down...?

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

go ahead and try.....

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

looks ok to me

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

actually it interleaved our commits (which is actually correct)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

u could pull down without any pain?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

yeah i know that was nice

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

and travis working on it.....and i think you rebased off of current master...as using thew travis scheme :)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

i did git rebase --interactive upstream/master; i do that so i don't have to remember where the merge parent is by SHA1

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

well actually grbi upstream/master cuz who wants to type out that thing all the time?!

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

this (hard case) will be needed i think....sorry. i guess there's really no way to recover gracefully if you've squashed or fixed up things. won't do anymore....maybe we should just wait until we're finished to do that...

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

I think we can rebase our own commits without too much problems?

not a big deal to rebase at the end (or as long as we coordinate)

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

hmm.....actuall my history is now different that the main branch, I don't see the reorderd commits.....?? (mine are in the original order)

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

did u get a big merge commit when you pulled? that's the issue...i rewrote history so i think then git finds the nearest parent and tries to merge since it can't know that i deleted N commits in between ... probably least confusing to just wait until the end to squash

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

hm you're right i see the same thing my original order

* e039ce5 - (HEAD, upstream/eval-3393, eval-3393) COMPAT: allow prior 0.12 query syntax for terms, e.g. Term('index','>',5) (and show deprecation warning) (17 minutes ago) <jreback>
* 1029dc1 - BUG: py3 fixes; revise scoping rules to be more broad      record variable states from the furthermost part of the stack      to the most recent, overwriting if shadow variables occur (17 minutes ago) <jreback>
* 2f87264 - TST: fixed remaining tests (17 minutes ago) <jreback>
* 4abe3c5 - BUG: fixed scoping issues by _ensure_term at the top-level (17 minutes ago) <jreback>
* bb7d620 - ENH: use non_implemented function call in ExprVisitor (17 minutes ago) <jreback>
* 2faeeac - BUG: process visit_Index (17 minutes ago) <jreback>
* de28b79 - BUG: added HDFStore to inherit from Stringmixin (17 minutes ago) <jreback>
* 7830af8 - TST: more test changes (17 minutes ago) <jreback>
* 1f208b4 - WIP: conditions working now, filtering still only ok      good parsing of attributes, subscripting, e.g df.index[0:4] works! (17 minutes ago) <jreback>
* 706276d - WIP: still some debugging statements in (17 minutes ago) <jreback>
* 7ba3aaa - ENH: initial commit for adding Expr based terms for pytables support (17 minutes ago) <jreback>
* e117f1a - ENH: rewrite assignment as equal comparison (17 minutes ago) <Phillip Cloud>
* 16cb733 - ENH: add modest type inference (17 minutes ago) <Phillip Cloud>
* 48719b3 - CLN: generlize operator/expression printing (17 minutes ago) <Phillip Cloud>
* 73415c4 - CLN/TST: fail with a NotImplementedError on Python boolean operators (17 minutes ago) <Phillip Cloud>

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

http://git-scm.com/book/en/Distributed-Git-Contributing-to-a-Project

I don't think we can rebase at all...just pull/push/merge

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

why don't you create a commit and try pushing...then i'll do the same

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

it's annoying that the embedded HTML travis tag on github says failing when master is passing fine, but this isn't :(

pushed a commit...

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

arg it's a stupid network yahoo test failure...

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

i pulled down your commit....what am I donig wrong?

[sheep-jreback-~/pandas] git push https://github.com/pydata/pandas.git origin/eval-3393
Username: 
Password: 
To https://github.com/pydata/pandas.git
 ! [rejected]        origin/eval-3393 -> origin/eval-3393 (non-fast-forward)
error: failed to push some refs to 'https://github.com/pydata/pandas.git'
To prevent you from losing history, non-fast-forward updates were rejected
Merge the remote changes (e.g. 'git pull') before pushing again.  See the
'Note about fast-forwards' section of 'git push --help' for details.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

force push it

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

I have something wrong it updated to the eval4 branch and not eval-3393....?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

hm why not just do git branch --set-upstream-to=origin/eval-3393 and avoid having to type that long command?

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

I did that....

git push doesn't do anything then....hmmm

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

ok how about git stash && git checkout -b eval-N --track origin/eval-3393 && git stash pop...not sure why that should be different than just manually setting the upstream ...

@jreback jreback merged commit f5d673f into master Jul 8, 2013
@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

merged into master? that doesn't seem like a great idea just yet...

@cpcloud
Copy link
Member Author

cpcloud commented Jul 8, 2013

oh nevermind something else is going on

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

no....it didn't actually merge

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

yeh...that's weird...nothing showed up...hmmm

(I didn't mean to do that)...but I am not sure what actually happened

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

in my eval5 branch this is what I did

git push -u https://github.com/pydata/pandas.git origin:eval-3393 -f

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

it looks like it merged master back into this branch somehow....

see if you can resurrect somehow (maybe create a new branch)..

I don't think master actually changed..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: eval function
2 participants