- 
                Notifications
    
You must be signed in to change notification settings  - Fork 835
 
Text parser optimization (~4.5x perf) #282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
          
 Those aspects of the text format are not optional, they must be implemented to have a correct parser.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance improvements would be great, however the format has many ways it can be represented and this code only parses a subset of the potential valid input. If you can manage to make it work with that, that'd be great.
        
          
                prometheus_client/core.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A __str__ would make more sense I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed it to __str__ and having repr call str because it's useful for comparing unit test output:
E       First differing element 0:
E       <Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1), (u'a', {u'foo': u'baez'}, 2), (u'a', {u'foo': u'buz'}, 3)]>
E       <Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1.0), (u'a', {u'foo': u'baz'}, 2.0), (u'a', {u'foo': u'buz'}, 3.0)]>
E
E       - [<Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1), (u'a', {u'foo': u'baez'}, 2), (u'a', {u'foo': u'buz'}, 3)]>]
E       ?                                                                                                                  -
E
E       + [<Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1.0), (u'a', {u'foo': u'baz'}, 2.0), (u'a', {u'foo': u'buz'}, 3.0)]>]
E       ?
instead of
E       First differing element 0:
E       <prometheus_client.core.CounterMetricFamily object at 0x108018c50>
E       <prometheus_client.core.Metric object at 0x108018cd0>
E
E       - [<prometheus_client.core.CounterMetricFamily object at 0x108018c50>]
E       ?                          -------      ------                    ^
E
E       + [<prometheus_client.core.Metric object at 0x108018cd0>]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually str calls repr rather than the other way around. repr should usually be an instantiatable version of object, while str is more human readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified repr to fit what it's supposed to be, still way more readable for tests outputs 👍
        
          
                prometheus_client/parser.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help and label values have different escaping rules (double quote is the difference), you need two functions for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added another function specific for Help
        
          
                prometheus_client/parser.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep comments as full sentances
        
          
                prometheus_client/parser.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any mix of any number of spaces and tabs is permitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some additional unit tests around this 👍
        
          
                prometheus_client/parser.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there can be whitespace after the brace, and basically everywhere else between things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be covered, I added one additional test case in test_spaces
        
          
                prometheus_client/parser.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A label value could contain a }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already taken into account with the use of rindex, added a test case to validate this point.
        
          
                prometheus_client/parser.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be a trailing comma after the last "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be already covered, test_commas is validating this
41edcb8    to
    a25d444      
    Compare
  
    Signed-off-by: Pierre Margueritte <mfpierre@gmail.com>
a25d444    to
    1d7190c      
    Compare
  
    | i = 0 | ||
| while i < len(value_substr): | ||
| i = value_substr.index('"', i) | ||
| if value_substr[i - 1] != "\\": | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if if you have x="" as the label? i - 1 will be -1, which might have unexpected results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a unit-test around empty labels, but this works fine 👍
Signed-off-by: Pierre <mfpierre@gmail.com>
| 
           Thanks!  | 
    
| 
           Hey @brian-brazil, any plans to do a release soon? 🙇  | 
    
| 
           I've added it to my todo list  | 
    
Hi,
As we extensively use this library for parsing prometheus text format, we have observed that for big payloads (like https://github.com/kubernetes/kube-state-metrics for a big cluster) that can contains 400k+ lines the text parsing can be quite slow. (up to 27secs)
We tried to optimize the parser by dropping the state machine and leveraging native python functions such as
indexorfindand it gives us an average of x5 performances.Here are some benchmark using
timeit:For the KSM payload (400k lines) the parsing goes from ~27sec to ~4.7sec
Note: We could go up to almost 10x performance if we dropped some edge-cases treatment (like escaping, tab/space, etc...) could we consider a "strict" parsing mode that we could optionally use for "good citizens"?