Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision refs resolver #1832

Merged
merged 49 commits into from
Nov 1, 2023
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
7f606e3
WIP for internal refs resolver
longhotsummer Sep 8, 2023
35bfbb9
use resolver
longhotsummer Sep 9, 2023
96f4885
lookup remote docs
longhotsummer Sep 9, 2023
8f2a613
temporarily get new internal refs working; popups
longhotsummer Sep 10, 2023
812f8b4
put popovers above, allow html
longhotsummer Sep 10, 2023
67054b1
refactoring
longhotsummer Sep 10, 2023
c45717c
canopy grammar and parser
longhotsummer Sep 10, 2023
fe88aa0
begin integrating grammer into resolver
longhotsummer Sep 11, 2023
c9bb97c
re-work grammar
longhotsummer Sep 11, 2023
e616bd6
recursive reference resolver
longhotsummer Sep 11, 2023
96d7dd0
adjust tests
longhotsummer Sep 12, 2023
12aa6a9
Merge remote-tracking branch 'origin/master' into refs-resolver
longhotsummer Sep 12, 2023
40d0b40
targets
longhotsummer Sep 12, 2023
16bb0c6
simplify object model
longhotsummer Sep 12, 2023
275766b
translations
longhotsummer Sep 13, 2023
a9c6485
correct refs
longhotsummer Sep 13, 2023
822890b
refactor; pull in tests from old refs resolver
longhotsummer Sep 13, 2023
54c0a5d
repeating top-level references
longhotsummer Sep 13, 2023
e485492
todos
longhotsummer Sep 13, 2023
e2546d6
remove old internal refs finder
longhotsummer Sep 13, 2023
23f2286
ambiguous levels
longhotsummer Sep 13, 2023
014bd07
improved ranges support
longhotsummer Sep 14, 2023
e7c8660
ignore refs in embeddedStructure etc.
longhotsummer Sep 14, 2023
804e19a
typo
longhotsummer Sep 14, 2023
c9a2694
fix annoying css issue that causes very small scrolling
longhotsummer Sep 14, 2023
661a213
support paragraph (a), etc.
longhotsummer Sep 14, 2023
0568233
internal ref popups (WIP)
longhotsummer Sep 14, 2023
e79f130
Update indigo/analysis/refs/provision_refs.peg
longhotsummer Sep 20, 2023
1459c71
Update indigo/analysis/refs/provision_refs.peg
longhotsummer Sep 20, 2023
59d492f
Update indigo/analysis/refs/provisions.py
longhotsummer Sep 20, 2023
e717620
Update indigo/analysis/refs/provisions.py
longhotsummer Sep 20, 2023
a8d3327
fix popups, compile script
longhotsummer Sep 20, 2023
a68eccb
build parser grammar
longhotsummer Sep 20, 2023
b2d914a
artikel can be section or article
longhotsummer Sep 23, 2023
0b3b619
correct breadth-first-search
longhotsummer Oct 11, 2023
1e44098
rename test
longhotsummer Oct 11, 2023
eb30303
sub-section, sub-paragraph
longhotsummer Oct 11, 2023
2393070
mark test as expected failure
longhotsummer Oct 16, 2023
d24e16d
Update indigo/analysis/refs/provisions.py
longhotsummer Oct 17, 2023
1d4ae57
Update indigo/tests/test_provision_refs.py
longhotsummer Oct 17, 2023
b7f5e52
synonyms, regulations, subregulations
longhotsummer Oct 17, 2023
c337da9
markup all references, right to left
longhotsummer Oct 18, 2023
ff078bb
support multiple relative internal targets
longhotsummer Oct 18, 2023
0ef3c3c
support term lookups
longhotsummer Oct 18, 2023
ffe5270
support "of this Act" for subleg
longhotsummer Oct 18, 2023
d2b240e
fix bugs
longhotsummer Oct 20, 2023
aa9d29b
Update indigo/analysis/refs/provisions.py
longhotsummer Oct 31, 2023
786b2d8
Update indigo/analysis/refs/provisions.py
longhotsummer Oct 31, 2023
60f7236
crossheading; tests for ignored tags
longhotsummer Oct 31, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,12 @@ jobs:
- name: Install dependencies
run: npm ci --no-audit --prefer-offline --ignore-scripts

- name: Build
- name: Build javascript
run: npx webpack

- name: Build provision grammar
run: canopy indigo/analysis/refs/provision_refs.peg --lang python

- name: Copy node dependencies
# Keep the local copy of some node dependencies up to date
# indigo-web: only used for PDF exports
Expand All @@ -36,5 +39,5 @@ jobs:
- name: Push
uses: EndBug/add-and-commit@v7
with:
add: 'indigo_app/static/javascript/indigo/bluebell-monaco.js indigo_app/static/lib/external-imports.js indigo_app/static/javascript/indigo-app.js indigo_app/static/lib/indigo-web/ --force'
add: 'indigo_app/static/javascript/indigo/bluebell-monaco.js indigo_app/static/lib/external-imports.js indigo_app/static/javascript/indigo-app.js indigo_app/static/lib/indigo-web/ indigo/analysis/refs/provision_refs.py --force'
message: 'Update compiled bluebell-monaco.js, external-imports.js and indigo-app.js'
1 change: 1 addition & 0 deletions indigo/analysis/refs/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
import indigo.analysis.refs.base # noqa
import indigo.analysis.refs.provisions # noqa
139 changes: 1 addition & 138 deletions indigo/analysis/refs/base.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
from lxml import etree
import re

from indigo.analysis.markup import TextPatternMarker, MultipleTextPatternMarker
from indigo.analysis.markup import TextPatternMarker
from indigo.plugins import LocaleBasedMatcher, plugins
from indigo.xmlutils import closest
from indigo_api.models import Subtype, Work


Expand Down Expand Up @@ -181,139 +180,3 @@ def is_valid(self, node, match):

def make_href(self, match):
return self.cap_numbers[match.group('num')]


class BaseInternalRefsFinder(LocaleBasedMatcher, MultipleTextPatternMarker):
""" Finds internal references in documents, such as to sections.

The item_re and pattern_re patterns must both have a named capture group
called 'ref', which is the full reference to me marked up.
"""
marker_tag = 'ref'

# the ancestor elements that can contain references
ancestors = ['body', 'mainBody', 'conclusions']

def find_references_in_document(self, document):
""" Find references in +document+, which is an Indigo Document object.
"""
# we need to use etree, not objectify, so we can't use document.doc.root,
# we have to re-parse it
root = etree.fromstring(document.content.encode('utf-8'))
self.setup(root)
self.markup_patterns(root)
document.content = etree.tostring(root, encoding='unicode')

def is_valid(self, node, match):
return self.find_target(node, match) is not None

def is_item_valid(self, node, match):
return self.is_valid(node, match)

def markup_match(self, node, match):
ref = etree.Element(self.marker_tag)
ref.text = match.group('ref')
ref.set('href', self.make_href(node, match))
return ref, match.start('ref'), match.end('ref')

def find_target(self, node, match):
""" Return the target element that this reference targets.
"""
raise NotImplementedError()

def make_href(self, node, match):
""" Return the target href for this match.
"""
target = self.find_target(node, match)
return '#' + target.get('eId')


@plugins.register('internal-refs')
class SectionRefsFinderENG(BaseInternalRefsFinder):
""" Finds internal references to sections in documents, of the form:

# singletons
section 26
section 26B

# lists
sections 22 and 32
and sections 19, 22 and 23, unless it appears to him
and sections 19, 22, and 23 (oxford comma)
and sections 19,22 and 23 (incorrect spacing)
Sections 24, 26, 28, 36, 42(2), 46, 48, 49(2), 52, 53, 54 and 56 shall mutatis mutandis
sections 23, 24, 25, 26 and 28;
sections 22(1) and 25(3)(b);
sections 18, 61 and 62(1).
in terms of section 2 or 7
A person who contravenes sections 4(1) and (2), 6(3), 10(1) and (2), 11(1), 12(1), 19(1), 19(3), 20(1), 20(2), 21(1), 22(1), 24(1), 25(3), (4) , (5) and (6) , 26(1), (2), (3) and (5), 28(1), (2) and (3) is guilty of an offence.

TODO: match subsections
TODO: match paragraphs
TODO: match ranges of sections
"""

# country, language, locality
locale = (None, 'eng', None)

pattern_re = re.compile(
r'''\b
(
(?P<ref>
(?<!-)sections?\s+
(?P<num>\d+[A-Z0-9]*) # first section number, including subsections
)
(
(\s*(,|and|or))* # list separators
(\s*\([A-Z0-9]+\))+ # bracketed subsections of first number
)*
(\s* # optional list of sections
(\s*(,|and|or))* # list separators
(
\s*\d+[A-Z0-9]*(
(\s*(,|and|or))* # list separators
(\s*\([A-Z0-9]+\))+
)*
)
)*
)
(\s+of\s+(this)?|\s+thereof)?
''',
re.X | re.IGNORECASE)

# individual numbers in the list grouping above
# we use <ref> and <num> named captures so that the is_valid and make_ref
# methods can handle matches from both ref_re and this re.
# negative lookaround for parentheses around each number in the run guards against subsections being picked up as section numbers, e.g.
# sections 4(1) and (2), 25(3), (4), (5) and (6), etc
item_re = re.compile(r'(?P<ref>(?P<num>(?<!\()\d+[A-Z0-9]*(?!\))))(\s*\([A-Z0-9]+\))*', re.IGNORECASE)

candidate_xpath = ".//text()[contains(translate(., 'S', 's'), 'section') and not(ancestor::a:ref)]"
match_cache = {}

def setup(self, root):
super().setup(root)
self.ancestor_tags = set(f'{{{self.ns}}}{t}' for t in self.ancestors)

def is_valid(self, node, match):
# check that it's not an external reference
ref = match.group(0)
if ref.endswith('of ') or ref.endswith('thereof'):
return False
return True

def is_item_valid(self, node, match):
return self.find_target(node, match) is not None

def find_target(self, node, match):
num = match.group('num')
# find the closest ancestor to scope the lookups to
ancestor = closest(node, lambda e: e.tag in self.ancestor_tags)
candidate_elements = ancestor.xpath(f".//a:section[a:num[text()='{num}.']]", namespaces=self.nsmap)
if candidate_elements:
self.match_cache[num] = candidate_elements[0]
return candidate_elements[0]

def make_href(self, node, match):
target = self.match_cache[match.group('num')]
return '#' + target.get('eId')
116 changes: 116 additions & 0 deletions indigo/analysis/refs/provision_refs.peg
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
grammar ProvisionRefs
# This grammar matches runs of references, such as:
# Section 32(a), (b) and (f)(ii), 33 and chapter 4 of the Act
#
# The main reference is the "section 32" or "chapter 4", and form the root used by
# the subsequent references.

root <- references (to_and_or references)* target? .* %root

# section 32
# section (a)
# section 32(a)(c)
# section 32.1a.2
references <- unit WS+ main_ref (to_and_or main_ref)* %references

unit <- unit_en / unit_af

main_ref <- (main_num / num) (WS* sub_refs)? %main_ref

# section 32
# section 32A
# section 32.1a.2
main_num <- digit alpha_num_dot* %main_num

# (a)(ii) to (iv)
# (a), (b) and (f)(ii)
# (a)(ii), (b)(iii) and (v), and (d)
sub_refs <- sub_ref (to_and_or sub_ref)* %sub_refs

# (a)
# (a) (ii)
sub_ref <- num (WS* num)* %sub_ref

# (a)
# (a1)
# (a-bis)
# (iv)
num <- "(" alpha_num_dot+ ")" %num

to_and_or <- range / and_or

# (a) to (b)
# (a), to (b)
range <- (WS* comma)? WS+ to WS+ %range

# (a) and (b)
# (a), or (b)
# (a), (b)
and_or <- ((WS* comma)? WS* _and WS+) %and_or
/ ((WS* comma)? WS* _or WS+) %and_or
/ (WS* comma WS*) %and_or

target <- of_this / of / thereof

# of the Act
of_this <- comma? WS* (of_this_en / of_this_af) WS+ %of_this

# of the Act
of <- comma? WS* (of_en / of_af) WS+ %of

# thereof
thereof <- comma? WS* (thereof_en / thereof_af) %thereof

# --------
# terminals
# --------

# english
unit_en <- `articles` / `article` /
`chapters` / `chapter` /
`items` / `item` /
`paragraphs` / `paragraph` /
longhotsummer marked this conversation as resolved.
Show resolved Hide resolved
`parts` / `part` /
`points` / `point` /
`sections` / `section` /
`subparagraphs` / `subparagraph` /
`sub-paragraphs` / `sub-paragraph` /
`subsections` / `subsection` /
`sub-sections` / `sub-section`

# afrikaans
unit_af <- `afdelings` / `afdeling` /
`artikels` / `artikel`
longhotsummer marked this conversation as resolved.
Show resolved Hide resolved
`dele` / `deel`
`hoofstukke` / `hoofstuk` /
`paragrawe` / `paragraaf` /
`punte` / `punt` /
`subafdelings` / `subafdeling` /
`subparagrawe` / `subparagraaf`

_and <- and_en / and_af
and_en <- `and`
and_af <- `en`

_or <- or_en / or_af
or_en <- `or`
or_af <- `of`

# note that "tot" must come before "to" because the latter is a substring of the former
to <- to_af / to_en
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to add - and (and maybe even ) to this at some point

to_en <- `to`
to_af <- `tot`

of_en <- `of`
of_af <- `van`

of_this_en <- `of this`
of_this_af <- `van hierdie`

thereof_en <- `thereof`
thereof_af <- `daarvan`

comma <- [,;]
digit <- [0-9]
alpha_num_dot <- [a-zA-Z0-9.-]
WS <- " " / "\t"
Loading
Loading