Skip to content

Label offest no accurate in case of table #92

Open
@ShakedAharonn

Description

@ShakedAharonn

Hi,
I encuontered this bug while trying to scarpe a specific site:

`
page = """

  • item1
  • item2
  • item3
  • item4
  • item5
  • item6
  • item7
  • item8
"""

rules = {'ul':['ul'], 'table':['table']}

output = get_annotated_text(page, ParserConfig(annotation_rules=rules)) // {'text': ' * item1 * item5\n * item2 * item6\n * item3 * item7\n * item4 * item8\n', 'label': [(0, 85, 'table'), (0, 40, 'ul'), (11, 51, 'ul')]}

(start_index, end_index, annotation) = output['label'][1]
(output['text'][start_index:end_index]) //' * item1 * item5\n * item2 * item'
`

as can be seen, accessing the text of the relevant label isn't working as the offsets aren't accurate when viewing a table

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions