Skip to content

Singular Visual Line Should Be Identified as a Single TextElement #78

Open
@deenaawny-github-account

Description

Problem

For MSFT 0000950170-23-014423, the top section title "PART I. FINANCIAL INFORMATION " is identified as two semantic elements:
{
"cls_name": "TopSectionTitle",
"level": 0,
"section_type": "part1",
"text_content": "PART I. FINANCI"
},
{
"cls_name": "TitleElement",
"level": 0,
"text_content": "AL INFORMATION"
}

This should be:
{ "cls_name": "TopSectionTitle",
"level": 0,
"section_type": "part1",
"text_content": "PART I. FINANCIAL INFORMATION"
}

Ideas about a possible solution

Adjust text element merger to keep merging elements until a new visual line.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    for-internal-teamIntended for completion by the internal teamstatus:deferredDeferred for future consideration.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions