HTMLSentenceTokenizer (HTMLST)

A library which extracts sentences from HTML

HTMLST breaks down HTML documents into paragraph-like sections by taking into account the HTML5 Specification and web developers' typical usage of HTML tags and tokenizes the sentences in these sections using NLTK's Punkt Sentence Tokenizer to disambiguate sentence boundaries.

Example:

HTMLSentenceTokenizer can parse the following HTML in one line.

<!DOCTYPE html>
<html lang="en">
<head>
</head>
<body>
<div>
    <h1>Header One</h1>
    Hello, my <span id="foo">name</span> is Geronimo. What's yours?
    <div>
        Here is a cool picture:
        <br>
        <img src="http://www.sourcecertain.com/img/Example.png">

        Now, isn't that nice?
    </div>

    He<span>r<b>e</b></span>'s a wei<mark>rd</mark> sen<b>t<i>e<span>n</span>c</i>e wit</b>h <i>lots</i> of inline tags.
</div>
</body>
</html>

The following code prints out the sentences of this HTML, which is stored in example_html_one.html.

from htmlst import HTMLSentenceTokenizer

example_html_one = open('example_html_one.html', 'r').read()
parsed_sentences = HTMLSentenceTokenizer().feed(example_html_one)
print(parsed_sentences)

The output is ['Hello, my name is Geronimo.', "What's yours?", 'Here is a cool picture:', "Now, isn't that nice?", "Here's a weird sentence with lots of inline tags."].

Installation:

HTMLST is available on pip. To install, enter the following into terminal:

pip install htmlst

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
htmlst		htmlst
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTMLSentenceTokenizer (HTMLST)

Example:

Installation:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

wardbradt/HTMLST

Folders and files

Latest commit

History

Repository files navigation

HTMLSentenceTokenizer (HTMLST)

Example:

Installation:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages