Skip to content
/ rebulk Public

Define simple search patterns in bulk to perform advanced matching on any string

License

Notifications You must be signed in to change notification settings

Toilal/rebulk

Repository files navigation

ReBulk

Latest Version MIT License Build Status Coveralls semantic-release

ReBulk is a python library that performs advanced searches in strings that would be hard to implement using re module or String methods only.

It includes some features like Patterns, Match, Rule that allows developers to build a custom and complex string matcher using a readable and extendable API.

This project is hosted on GitHub: https://github.com/Toilal/rebulk

Install

$ pip install rebulk

Usage

Regular expression, string and function based patterns are declared in a Rebulk object. It use a fluent API to chain string, regex, and functional methods to define various patterns types.

>>> from rebulk import Rebulk
>>> bulk = Rebulk().string('brown').regex(r'qu\w+').functional(lambda s: (20, 25))

When Rebulk object is fully configured, you can call matches method with an input string to retrieve all Match objects found by registered pattern.

>>> bulk.matches("The quick brown fox jumps over the lazy dog")
[<brown:(10, 15)>, <quick:(4, 9)>, <jumps:(20, 25)>]

If multiple Match objects are found at the same position, only the longer one is kept.

>>> bulk = Rebulk().string('lakers').string('la')
>>> bulk.matches("the lakers are from la")
[<lakers:(4, 10)>, <la:(20, 22)>]

String Patterns

String patterns are based on str.find method to find matches, but returns all matches in the string. ignore_case can be enabled to ignore case.

>>> Rebulk().string('la').matches("lalalilala")
[<la:(0, 2)>, <la:(2, 4)>, <la:(6, 8)>, <la:(8, 10)>]

>>> Rebulk().string('la').matches("LalAlilAla")
[<la:(8, 10)>]

>>> Rebulk().string('la', ignore_case=True).matches("LalAlilAla")
[<La:(0, 2)>, <lA:(2, 4)>, <lA:(6, 8)>, <la:(8, 10)>]

You can define several patterns with a single string method call.

>>> Rebulk().string('Winter', 'coming').matches("Winter is coming...")
[<Winter:(0, 6)>, <coming:(10, 16)>]

Regular Expression Patterns

Regular Expression patterns are based on a compiled regular expression. re.finditer method is used to find matches.

If regex module is available, it can be used by rebulk instead of default re module. Enable it with REBULK_REGEX_ENABLED=1 environment variable.

>>> Rebulk().regex(r'l\w').matches("lolita")
[<lo:(0, 2)>, <li:(2, 4)>]

You can define several patterns with a single regex method call.

>>> Rebulk().regex(r'Wint\wr', r'com\w{3}').matches("Winter is coming...")
[<Winter:(0, 6)>, <coming:(10, 16)>]

All keyword arguments from re.compile are supported.

>>> import re  # import required for flags constant
>>> Rebulk().regex('L[A-Z]KERS', flags=re.IGNORECASE) \
...         .matches("The LaKeRs are from La")
[<LaKeRs:(4, 10)>]

>>> Rebulk().regex('L[A-Z]', 'L[A-Z]KERS', flags=re.IGNORECASE) \
...         .matches("The LaKeRs are from La")
[<La:(20, 22)>, <LaKeRs:(4, 10)>]

>>> Rebulk().regex(('L[A-Z]', re.IGNORECASE), ('L[a-z]KeRs')) \
...         .matches("The LaKeRs are from La")
[<La:(20, 22)>, <LaKeRs:(4, 10)>]

If regex module is available, it automatically supports repeated captures.

>>> # If regex module is available, repeated_captures is True by default.
>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+').matches("01-02-03-04")
>>> matches[0].children # doctest:+SKIP
[<01:(0, 2)>, <02:(3, 5)>, <03:(6, 8)>, <04:(9, 11)>]

>>> # If regex module is not available, or if repeated_captures is forced to False.
>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+', repeated_captures=False) \
...                   .matches("01-02-03-04")
>>> matches[0].children
[<01:(0, 2)+initiator=01-02-03-04>, <04:(9, 11)+initiator=01-02-03-04>]
  • abbreviations

    Defined as a list of 2-tuple, each tuple is an abbreviation. It simply replace tuple[0] with tuple[1] in the expression.

    >>> Rebulk().regex(r'Custom-separators', abbreviations=[("-", r"[W_]+")])... .matches("Custom_separators using-abbreviations") [<Custom_separators:(0, 17)>]

Functional Patterns

Functional Patterns are based on the evaluation of a function.

The function should have the same parameters as Rebulk.matches method, that is the input string, and must return at least start index and end index of the Match object.

>>> def func(string):
...     index = string.find('?')
...     if index > -1:
...         return 0, index - 11
>>> Rebulk().functional(func).matches("Why do simple ? Forget about it ...")
[<Why:(0, 3)>]

You can also return a dict of keywords arguments for Match object.

You can define several patterns with a single functional method call, and function used can return multiple matches.

Chain Patterns

Chain Patterns are ordered composition of string, functional and regex patterns. Repeater can be set to define repetition on chain part.

>>> r = Rebulk().regex_defaults(flags=re.IGNORECASE)\
...             .defaults(children=True, formatter={'episode': int, 'version': int})\
...             .chain()\
...             .regex(r'e(?P<episode>\d{1,4})').repeater(1)\
...             .regex(r'v(?P<version>\d+)').repeater('?')\
...             .regex(r'[ex-](?P<episode>\d{1,4})').repeater('*')\
...             .close() # .repeater(1) could be omitted as it's the default behavior
>>> r.matches("This is E14v2-15-16-17").to_dict()  # converts matches to dict
MatchesDict([('episode', [14, 15, 16, 17]), ('version', 2)])

Patterns parameters

All patterns have options that can be given as keyword arguments.

  • validator

    Function to validate Match value given by the pattern. Can also be a dict, to use validator with pattern named with key.

    >>> def check_leap_year(match):
    ...     return int(match.value) in [1980, 1984, 1988]
    >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \
    ...                   .matches("In year 1982 ...")
    >>> len(matches)
    0
    >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \
    ...                   .matches("In year 1984 ...")
    >>> len(matches)
    1

Some base validator functions are available in rebulk.validators module. Most of those functions have to be configured using functools.partial to map them to function accepting a single match argument.

  • formatter

    Function to convert Match value given by the pattern. Can also be a dict, to use formatter with matches named with key.

    >>> def year_formatter(value):
    ...     return int(value)
    >>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \
    ...                   .matches("In year 1982 ...")
    >>> isinstance(matches[0].value, int)
    True
  • pre_match_processor / post_match_processor

    Function to mutagen or invalidate a match generated by a pattern.

    Function has a single parameter which is the Match object. If function returns False, it will be considered as an invalid match. If function returns a match instance, it will replace the original match with this instance in the process.

  • post_processor

    Function to change the default output of the pattern. Function parameters are Matches list and Pattern object.

  • name

    The name of the pattern. It is automatically passed to Match objects generated by this pattern.

  • tags

    A list of string that qualifies this pattern.

  • value

    Override value property for generated Match objects. Can also be a dict, to use value with pattern named with key.

  • validate_all

    By default, validator is called for returned Match objects only. Enable this option to validate them all, parent and children included.

  • format_all

    By default, formatter is called for returned Match values only. Enable this option to format them all, parent and children included.

  • disabled

    A function(context) to disable the pattern if returning True.

  • children

    If True, all children Match objects will be retrieved instead of a single parent Match object.

  • private

    If True, Match objects generated from this pattern are available internally only. They will be removed at the end of Rebulk.matches method call.

  • private_parent

    Force parent matches to be returned and flag them as private.

  • private_children

    Force children matches to be returned and flag them as private.

  • private_names

    Matches names that will be declared as private

  • ignore_names

    Matches names that will be ignored from the pattern output, after validation.

  • marker

    If true, Match objects generated from this pattern will be markers matches instead of standard matches. They won't be included in Matches sequence, but will be available in Matches.markers sequence (see Markers section).

Match

A Match object is the result created by a registered pattern.

It has a value property defined, and position indices are available through start, end and span properties.

In some case, it contains children Match objects in children property, and each child Match object reference its parent in parent property. Also, a name property can be defined for the match.

If groups are defined in a Regular Expression pattern, each group match will be converted to a single Match object. If a group has a name defined ((?P<name>group)), it is set as name property in a child Match object. The whole regexp match (re.group(0)) will be converted to the main Match object, and all subgroups (1, 2, ... n) will be converted to children matches of the main Match object.

>>> matches = Rebulk() \
...         .regex(r"One, (?P<one>\w+), Two, (?P<two>\w+), Three, (?P<three>\w+)") \
...         .matches("Zero, 0, One, 1, Two, 2, Three, 3, Four, 4")
>>> matches
[<One, 1, Two, 2, Three, 3:(9, 33)>]
>>> for child in matches[0].children:
...     '%s = %s' % (child.name, child.value)
'one = 1'
'two = 2'
'three = 3'

It's possible to retrieve only children by using children parameters. You can also customize the way structure is generated with every, private_parent and private_children parameters.

>>> matches = Rebulk() \
...         .regex(r"One, (?P<one>\w+), Two, (?P<two>\w+), Three, (?P<three>\w+)", children=True) \
...         .matches("Zero, 0, One, 1, Two, 2, Three, 3, Four, 4")
>>> matches
[<1:(14, 15)+name=one+initiator=One, 1, Two, 2, Three, 3>, <2:(22, 23)+name=two+initiator=One, 1, Two, 2, Three, 3>, <3:(32, 33)+name=three+initiator=One, 1, Two, 2, Three, 3>]

Match object has the following properties that can be given to Pattern objects

  • formatter

    Function to convert Match value given by the pattern. Can also be a dict, to use formatter with matches named with key.

    >>> def year_formatter(value):
    ...     return int(value)
    >>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \
    ...                   .matches("In year 1982 ...")
    >>> isinstance(matches[0].value, int)
    True
  • format_all

    By default, formatter is called for returned Match values only. Enable this option to format them all, parent and children included.

  • conflict_solver

    A function(match, conflicting_match) used to solve conflict. Returned object will be removed from matches by ConflictSolver default rule. If __default__ string is returned, it will fallback to default behavior keeping longer match.

Matches

A Matches object holds the result of Rebulk.matches method call. It's a sequence of Match objects and it behaves like a list.

All methods accepts a predicate function to filter Match objects using a callable, and an index int to retrieve a single element from default returned matches.

It has the following additional methods and properties on it.

  • starting(index, predicate=None, index=None)

    Retrieves a list of Match objects that starts at given index.

  • ending(index, predicate=None, index=None)

    Retrieves a list of Match objects that ends at given index.

  • previous(match, predicate=None, index=None)

    Retrieves a list of Match objects that are previous and nearest to match.

  • next(match, predicate=None, index=None)

    Retrieves a list of Match objects that are next and nearest to match.

  • tagged(tag, predicate=None, index=None)

    Retrieves a list of Match objects that have the given tag defined.

  • named(name, predicate=None, index=None)

    Retrieves a list of Match objects that have the given name.

  • range(start=0, end=None, predicate=None, index=None)

    Retrieves a list of Match objects for given range, sorted from start to end.

  • holes(start=0, end=None, formatter=None, ignore=None, predicate=None, index=None)

    Retrieves a list of hole Match objects for given range. A hole match is created for each range where no match is available.

  • conflicting(match, predicate=None, index=None)

    Retrieves a list of Match objects that conflicts with given match.

  • chain_before(self, position, seps, start=0, predicate=None, index=None):

    Retrieves a list of chained matches, before position, matching predicate and separated by characters from seps only.

  • chain_after(self, position, seps, end=None, predicate=None, index=None):

    Retrieves a list of chained matches, after position, matching predicate and separated by characters from seps only.

  • at_match(match, predicate=None, index=None)

    Retrieves a list of Match objects at the same position as match.

  • at_span(span, predicate=None, index=None)

    Retrieves a list of Match objects from given (start, end) tuple.

  • at_index(pos, predicate=None, index=None)

    Retrieves a list of Match objects from given position.

  • names

    Retrieves a sequence of all Match.name properties.

  • tags

    Retrieves a sequence of all Match.tags properties.

  • to_dict(details=False, first_value=False, enforce_list=False)

    Convert to an ordered dict, with Match.name as key and Match.value as value.

    It's a subclass of OrderedDict, that contains a matches property which is a dict with Match.name as key and list of Match objects as value.

    If first_value is True and distinct values are found for the same name, value will be wrapped to a list. If False, first value only will be kept and values lists can be retrieved with values_list which is a dict with Match.name as key and list of Match.value as value.

    if enforce_list is True, all values will be wrapped to a list, even if a single value is found.

    If details is True, Match.value objects are replaced with complete Match object.

  • markers

    A custom Matches sequences specialized for markers matches (see below)

Markers

If you have defined some patterns with markers property, then Matches.markers points to a special Matches sequence that contains only markers matches. This sequence supports all methods from Matches.

Markers matches are not intended to be used in final result, but can be used to implement a Rule.

Rules

Rules are a convenient and readable way to implement advanced conditional logic involving several Match objects. When a rule is triggered, it can perform an action on Matches object, like filtering out, adding additional tags or renaming.

Rules are implemented by extending the abstract Rule class. They are registered using Rebulk.rule method by giving either a Rule instance, a Rule class or a module containing Rule classes only.

For a rule to be triggered, Rule.when method must return True, or a non empty list of Match objects, or any other truthy object. When triggered, Rule.then method is called to perform the action with when_response parameter defined as the response of Rule.when call.

Instead of implementing Rule.then method, you can define consequence class property with a Consequence classe or instance, like RemoveMatch, RenameMatch or AppendMatch. You can also use a list of consequence when required : when_response must then be iterable, and elements of this iterable will be given to each consequence in the same order.

When many rules are registered, it can be useful to set priority class variable to define a priority integer between all rule executions (higher priorities will be executed first). You can also define dependency to declare another Rule class as dependency for the current rule, meaning that it will be executed before.

For all rules with the same priority value, when is called before, and then is called after all.

>>> from rebulk import Rule, RemoveMatch

>>> class FirstOnlyRule(Rule):
...     consequence = RemoveMatch
...
...     def when(self, matches, context):
...         grabbed = matches.named("grabbed", 0)
...         if grabbed and matches.previous(grabbed):
...             return grabbed

>>> rebulk = Rebulk()

>>> rebulk.regex("This match(.*?)grabbed", name="grabbed")
<...Rebulk object ...>
>>> rebulk.regex("if it's(.*?)first match", private=True)
<...Rebulk object at ...>
>>> rebulk.rules(FirstOnlyRule)
<...Rebulk object at ...>

>>> rebulk.matches("This match is grabbed only if it's the first match")
[<This match is grabbed:(0, 21)+name=grabbed>]
>>> rebulk.matches("if it's NOT the first match, This match is NOT grabbed")
[]

About

Define simple search patterns in bulk to perform advanced matching on any string

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages