-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve remove_markup handling of Wikipedia headers #2622
base: develop
Are you sure you want to change the base?
Conversation
… fix it does. Merge branch 'fixing_2561' into develop
The link to the issue in the PR description is good - but ideally the title will also describe the matters affected. But also: see my comment on #2561 – despite the original reporter's preference that headings be stripped, I doubt that's the right default for everyone. |
I've taken the liberty of adjusting the title. @gojomo Perhaps we can include a |
@mpenkov A parameter-controlled option would certainly support the needs of the reporter, @0x0badc0de, and address my concern about inconveniencing others. But this need might be narrow enough that the right advice, to those who need it, is to run the ~1-liner to strip headers yourself if you need it. (Roughly just a |
See the further discussion on #2561 for clarification on what kind of fix the original reporter was expecting, and would be best. |
@@ -256,6 +262,9 @@ def remove_markup(text, promote_remaining=True, simplify_links=True): | |||
text = re.sub(RE_P13, '\n', text) # leave only cell content | |||
text = re.sub(RE_P17, '\n', text) # remove formatting lines | |||
|
|||
if not retain_heading_markup: | |||
text = re.sub(RE_P18, '', text) # remove headings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this retain the words of the heading? (I don't think so.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gojomo,
The re.sub
function using RE_P18
does retain the words in the heading. Please refer to the screenshot attached where I test if this works on a jupyter notebook.
A comment below asks for unit tests; If you could guide me, I could do that and fix any issues that could have been introduced due to the changes. I would appreciate any pointers and your expectations from the tests.
Please add unit tests for your new functionaity. |
Fix #2561