Open
Description
What steps will reproduce the problem?
1. Certain text is not correctly output during parsing. For example, the text
in the HTML shown below (from the page for the year 1979) is not correctly
extracted. It appears there is a problem dealing with certain anchor tags
(problem with a regular expression?).
What is the expected output? What do you see instead?
For this code...
<li><a href="/wiki/May_27" title="May 27">May 27</a> – <a
href="/wiki/1979_Indianapolis_500" title="1979 Indianapolis 500">Indianapolis
500</a>: <a href="/wiki/Rick_Mears" title="Rick Mears">Rick Mears</a> wins the
race for the first time, and car owner <a href="/wiki/Roger_Penske"
title="Roger Penske">Roger Penske</a> for the second time.</li>
The extracted text is: * wins the race for the first time, and car owner
Roger Penske for the second time.
Instead of: * May 27 – Indianapolis 500: Rick Mears wins the race for the
first time, and car owner Roger Penske for the second time.
And...for this code:
...
<li>The <a href="/wiki/United_States" title="United States">United States</a>
and the <a href="/wiki/People%27s_Republic_of_China" title="People's Republic
of China">People's Republic of China</a> establish full <a
href="/wiki/Sino-American_relations" title="Sino-American relations">diplomatic
relations</a>.</li>
...
The extracted text is: diplomatic relations.
Instead of: * The United States and the People's Republic of China establish
full diplomatic relations.
Cheers
Original issue reported on code.google.com by andre.bi...@gmail.com
on 25 May 2011 at 2:21