Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: use eval expression parsing as replacement for Term in HDFStore #4155

Closed
wants to merge 48 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jul 7, 2013

extension of @cpcloud #4037

allows natural syntax for queries, with in-line variables allowed

some examples
"ts>=Timestamp('2012-02-01') & users=['a','b','c']"

['A > 0']

dates=pd.date_range('20130101',periods=5)
'index>dates'

Todos:

  • update docs / need new examples
  • API changes (disallow dict, and separated expressions)
  • tests for | and ~ operators, and invalid filter expressions
  • code clean up, maybe create new Expr base class to opt in/out of allowed operations
    (e.g. Numexpr doesn't want to allow certain operations, but PyTables needs them,
    so maybe define in the base class for simplicity, and just not allow them (checked in visit),
    which right now doesn't allow a generic_visitor)
  • can ops.Value be equiv of ops.Constant?

cpcloud added 30 commits July 6, 2013 11:40
@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

That should wait for meta data though. Would be nice if you could encapsulate a bunch of related frames of different size and columns with some overlapping without having to merge them to make a frame with all of them. That's a bit out of scope here, but you can raise an issue maybe. Maybe there's a way to simulate this currently?

@alvorithm
Copy link

I think what one needs for that is an Schema object that will point to in memory and eventually offline columns in data frames.

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

@cpcloud ok....i am sold on the query, where you can specify column names and have them picked up.....i think pretty useful

so instead of

df[df.a>0 & df.b>0]

this

df.query('a>0 & b>0') and bonus it uses ne

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

Exactly.

@alvorithm
Copy link

Excellent! Just to make sure, indexing cannot support this syntax because it would be hard to disambiguate column names from query expressions?

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

Should be ok since the parsed expression of a single column name would just return the column. It's either a column name or an expression so we could support that.

@alvorithm
Copy link

As soon as a special character appears it will /not/ be a column.

Where would it be a good place to argue for KeyError's on columns to check first on named indexes before giving up? I could also give a try at implementing it if there is interest.

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

@Meteore
not sure what you mean exactly about the special character...

it's perfectly reasonable to have unconventional column names...in fact I sometimes name my columns with latex markup if i'm going to plot it later.

e.g., with df['å < ∑'], df['ß'], etc., then if you have columns with those names that would work, but maybe this isn't what you're talking about...unicode column names would be fine...

@jreback
how can we coordinate our commits with minimal fuss? can we set up a branch on master and push to that? and then just push and pull (no rebase for now)?

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

@Meteore

re raising KeyError: wait until we refactor the parsers' code. i think it should go in the name resolution method of Terms...i haven't fully groked @jreback's stuff yet so that may change a bit...been busy setting up the faster travis stuff. should be able to get the refactor going tonight or later today.

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

@cpcloud
why don't u setup an eval branch in master then we can both push/pull
I think we can then track and rebase to master independently I think

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

@jreback done

pydata@f6feae7785f1331b01bd140653f8853d21bade1a

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

hmm....i must be doing something wrong....

i created a local branch which tracks eval-3393, then my branch eval3 tracks that
how to i push to update eval-3393 though? (i just create a new branch eval3 on the main repo, which is wrong)

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

assuming you're on eval3

git branch --set-upstream-to=upstream/eval-3393 # track the remote from eval3
git branch working-branch --track eval3 # working-branch tracks eval3
git fetch upstream
git rebase upstream/eval-3393
git checkout working-branch
git pull --rebase # <- rebase maybe not necessary

should do what you want and leaves you in the working branch

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

I think i am tracking origin/eval-3393

# On branch eval4
# Your branch is ahead of 'origin/eval-3393' by 11 commits.
#
nothing to commit (working directory clean)

and my commits are there
but when I

git push https://github.com/pydata/pandas.git origin/eval-3393

nothing gets updates (even if I -f)

??

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

what is the remote when u do git branch -vv?

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

This should be a direct tracking branch, not an indirect

* eval4           200b59f [origin/eval-3393: ahead 11] COMPAT: allow prior 0.12 query syntax for terms, e.g. Term('index','>',5) (and show deprecation warning)

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

what is origin? is that yours or pandas master?

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

i pushed your changes...

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

you should be able to checkout a branch now that tracks upstream/eval-3393 assuming upstream is git@github.com:pydata/pandas.git or https://github.com/pydata/pandas.git if you don't want to use the git protocol

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

ok..let me start with that....origin is https://github.com/pydata/pandas.git

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

ok...all set, do we need to rebase? how do we act on this?

e.g. I make a commit and push...no prob

but I should git pull before (in case you pushed?)

what about rebasing?

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

everything is set to go. we can both push/pull, and we should both git pull before we git push every time just to make sure we have the latest

@cpcloud
Copy link
Member

cpcloud commented Jul 8, 2013

i'm not sure about rebasing though, i think we should be able to rebase at will...but i could be wrong there...e.g,. if you squash a bunch and then i pull will git want to keep the commits that i haven't squashed?

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

ok...cool.....should we do a PR on this just to have a central place?

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

and then close other ones?

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2013

closing in favor of #4162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants