Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HighlightedTextClassifier work with <b> tags #61

Open
Elijas opened this issue Dec 22, 2023 · 5 comments
Open

Make HighlightedTextClassifier work with <b> tags #61

Elijas opened this issue Dec 22, 2023 · 5 comments
Labels
contributions-welcome Intended for completion by you, the contributor feature:elements Parsing all the other elements correctly

Comments

@Elijas
Copy link
Member

Elijas commented Dec 22, 2023

Discussed in https://github.com/orgs/alphanome-ai/discussions/56

Originally posted by Elijas November 24, 2023

Example document

https://www.sec.gov/Archives/edgar/data/1675149/000119312518236766/d828236d10q.htm

image
 <p style="margin-top:9pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman">
  Options to purchase 1 million shares of common stock at a weighted average exercise price of $36.28 were
outstanding as of June 30, 2017, but were not included in the computation of diluted EPS because they were anti-dilutive, as the exercise prices of the options were greater than the average market price of Alcoa Corporation�s common stock.
 </p>
 <p style="margin-top:13pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman">
  <b>
   G. Accumulated Other Comprehensive Loss
  </b>
 </p>
 <p style="margin-top:6pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman">
  The following table details the activity of the three components that comprise Accumulated other comprehensive loss for both Alcoa
Corporation�s shareholders and Noncontrolling interest:
 </p>

Goal

The "G. Accumulated Other Comprehensive Loss" should be recognized as HighlightedTextElement (and therefore, TitleElement).

Most likely, you will have to get a percentage of text that is covered inside the <b> tag, by reusing the parts implemented in the HighlightedTextElement. This will help you avoid situations where text text text <b>bold</b> text text is recognized as higlighted

@Elijas Elijas added the contributions-welcome Intended for completion by you, the contributor label Dec 22, 2023
@Elijas Elijas transferred this issue from alphanome-ai/sec-ai Dec 22, 2023
@john0isaac
Copy link
Contributor

I would like to work on this issue.

@Elijas Elijas added the status:in-progress Work underway. Reach out if you're interested in helping! label Dec 22, 2023
@Elijas Elijas removed their assignment Dec 27, 2023
@Elijas Elijas added the feature:elements Parsing all the other elements correctly label Dec 27, 2023
@Elijas
Copy link
Member Author

Elijas commented Dec 30, 2023

image

I noticed that there is a need for a middle ground between synthetic unit tests and entire-document end-to-end tests. Let's call them integration tests.

So I propose to have a special type of unit tests, where input is a HTML snippet and expected output is stored in a JSON file.

This will allow for a very easy unit test creation. Just paste the snippet from a document of interest, then automatically generate a JSON file, then proceed to manually edit it to what will be the expected output.

As for the fully-annotated documents (used in the "accuracy tests"), having a few of these integration tests, then fixing them will help us reach a point where creating the fully annotated document becomes much easier as all of the major issues will be fixed.

Otherwise, it takes a lot of time to manually annotate all the different issues in the full document, so we're annotating them in these small integration tests.

Let me know if this makes sense!

TL;DR

  • In folder tests/integration create a file .html with problematic HTML source code snippet
  • Run task unit-tests -- --create-missing to generate the .json with the current problematic output from sec-parser
  • Modify the .json manually to the desired state. (Running task unit-tests will now start failing)
  • Improve the sec-parser until running task unit-tests succeeds

This solves wasting time when annotating entire documents, when there is a single bug recurring hundreds of times in a single document

So we just take one instance (or a few instances) of it and put in these little tests

And the file-oriented structure makes it much easier to manage, than keeping the inputs and outputs in the source code itself (as would be the case in regular unit tests)

@john0isaac
Copy link
Contributor

Sorry, I was too busy to notify you that I will no longer be able to work on this issue due to my obligations.

@john0isaac john0isaac removed their assignment Feb 14, 2024
@Elijas Elijas removed the status:in-progress Work underway. Reach out if you're interested in helping! label Feb 14, 2024
@Elijas
Copy link
Member Author

Elijas commented Feb 14, 2024

Sorry, I was too busy to notify you that I will no longer be able to work on this issue due to my obligations.

No worries, thanks for letting us know!

@JeliHacker
Copy link
Contributor

I'd like to work on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions-welcome Intended for completion by you, the contributor feature:elements Parsing all the other elements correctly
Projects
Development

No branches or pull requests

3 participants