Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTables enhancements for selection #1996

Closed
jreback opened this issue Sep 30, 2012 · 7 comments
Closed

PyTables enhancements for selection #1996

jreback opened this issue Sep 30, 2012 · 7 comments

Comments

@jreback
Copy link
Contributor

jreback commented Sep 30, 2012

now

changes to pandas.io.pytables to support more natural selection (from tables):

  1. rename column -> major, index -> minor ( to be more consistent with panel nomenclature)
  2. provide parsable string selection methodology - pretty easy to do - and can be backwards compatible

store.select('mypanel', where = [ 'major>=20120103', 'major<=20120401', dict(minor = ['A','B','C' ]))

rather than existing

store.select('mypanel', where = [ 
dict(field = 'column', op = '>=', value = datetime.datetime(2012,1,3)), 
dict(field = 'column', op = '<=', value = datetime.datetime(2012,4,1)), 
dict(field = 'index', value = ['A','B','C'])  ])

future

not sure that pandas should get really fancy just yet with operations - (e.g. 'or' operations, and actual value selection)

where = [ ( 'major>20120901' & dict(minor = ['A','B','C']) | (minor = ['D']) ]
where = [ item['foo']>2.0 ]

but probably necessary once pandas support 'chunking' type operations on pytables

need to build a full-fledged selection parser to translate to the numexpr type operations (maybe with a patsy backend????)
BUT this may actually be useful to support generic operations in this way on in-memory panels/frames

not sure of use cases here though - I usually just read in 'about' what data I need and sub-select from there
unless you have hundreds of millions of rows I don't know if its necessary to optimize more (in which case it is!)

jreback added a commit to jreback/pandas that referenced this issue Nov 15, 2012
  1. added __str__ (to do __repr__)
  2. row removal in tables is much faster if rows are consecutive
  3. added Term class, refactored Selection (this is backdwards compatible)
     Term is a concise way of specifying conditions for queries, e.g.

        Term(dict(field = 'index', op = '>', value = '20121114'))
        Term('index', '20121114')
        Term('index', '>', '20121114')
        Term('index', ['20121114','20121114'])
        Term('index', datetime(2012,11,14))
        Term('index>20121114')

     updated tests for same

  this should close GH pandas-dev#1996
jreback added a commit to jreback/pandas that referenced this issue Nov 24, 2012
      1. added __str__ (to do __repr__)
      2. added __delitem__ to support store deletion syntatic sugar
      3. row removal in tables is much faster if rows are consecutive
      4. added Term class, refactored Selection (this is backwards compatible)
         Term is a concise way of specifying conditions for queries, e.g.

            Term(dict(field = 'index', op = '>', value = '20121114'))
            Term('index', '20121114')
            Term('index', '>', '20121114')
            Term('index', ['20121114','20121114'])
            Term('index', datetime(2012,11,14))
            Term('index>20121114')

            added alias to the Term class; you can specify the nomial indexers (e.g. index in DataFrame, major_axis/minor_axis or alias in Panel)

	 this should close GH pandas-dev#1996

      5. added Col class to manage the column conversions
      6. added min_itemsize parameter and checks in pytables to allow setting of indexer columns minimum size
      7. added indexing support via method create_table_index (requires 2.3 in PyTables)
      	 btw now works quite well as Int64 indicies are used as opposed to the Time64Col which has a bug); includes a check on the pytables version requirement

	 this should close GH pandas-dev#698

      8. signficantlly updated docs for pytables to reflect all changes; added docs for Table sections
      9. BUG: a store would fail if appending but the a put had not been done before (see test_append)
         this the result of incompatibility testing on the index_kind
     10. BUG: minor change to select and remove: require a table ONLY if where is also provided (and not None)

      all tests pass; tests added for new features
@jreback jreback closed this as completed Nov 24, 2012
@hayd
Copy link
Contributor

hayd commented May 14, 2013

I wonder if this would be a useful feature to extend this notation to regular DataFrames... has it been discussed before? (I think it may have been.)

Someone was trying to roll theIr own DSL for this on SO...

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

absolutely, hopefuly in #3202, #3393 going to implement df.eval() which will allow things like:

df['df>0 & df<5'] (which .eval will actually execute), also see the expressions module, some beginning of support for this as well

The theory is to accept a numpy-like DSL (but with frames/series/constants) that potentially need alignment and then pass the numpified to numexpr for evaluation

Which is also similar to the expressions in HDFStore queries (via Terms)

a bit non-trivial as to have to take the string expression, compile/parse it, walk the ast tree to find the aligning sections, then repackage to numexpr

@hayd up for it????

@hayd
Copy link
Contributor

hayd commented May 14, 2013

Do you think going via Terms is a good solution:

df.select(where=[Term(...), Term(..)])  # just like HDFStore

and then the eval'd string would be parsed into that.

That way, we could first get select working with Terms (which shouldn't be too bad), and then write the parser for the DSL (we have to come to a consensus on the grammar...). ?

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

That is definitely a good start on it. The thoughts I had were:

  1. move Term to the expressions module and generalized it
  2. a list of Termss is an and, so need an or, maybe something like:

import expressions as e
_or = e.OR
_and = e.AND
_not = e.NOT
_t = e.Term

expr = e.Expression(_or(_and(_t('foo','=',2), _t('bar','=',1)), _t('foobar','>',5)))
mask = expr.eval(df)

Roughty equivalent to:

mask = ((df['foo'] == 2) & (df['bar'] == 1)) | (df['foobar']>5)

which is an easy way to compose (not that user friendly though),
essentially anything can be built up toExpressions

This then could replace the syntax in HDFStore for queries, and be used (until we do parsing)
in pretty much any expression

want to give it a try?

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

@hayd so we don't lose this thread....why don't we continue on #3393

@hayd
Copy link
Contributor

hayd commented May 14, 2013

Happy to give this a try. Will thrash this out to expressions later in the week, and ping back on the other thread. :)

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

great!

FYI the numexpr package does basically this (unfortunately we can't use it directly because we have to preprocess / align and such, you might get some inspiration from that (we in fact call this with expressions.evaluate)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants