Skip to content

Commit

Permalink
feat: add ability to pass headers in partition_html (Unstructured-IO#397
Browse files Browse the repository at this point in the history
)

Also adds pytest-mock requirement, those fixtures are nice to have!

Implements issue/feature Unstructured-IO#396 .
  • Loading branch information
cragwolfe authored Mar 24, 2023
1 parent a4394f6 commit ce9fc26
Show file tree
Hide file tree
Showing 21 changed files with 247 additions and 147 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
## 0.5.7-dev2
## 0.5.7-dev3

### Enhancements

* Refactored codebase using `exactly_one`
* Adds ability to pass headers when passing a url in partition_html()

### Features

Expand Down
14 changes: 6 additions & 8 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ alabaster==0.7.13
# via sphinx
babel==2.12.1
# via sphinx
beautifulsoup4==4.11.2
beautifulsoup4==4.12.0
# via furo
certifi==2022.12.7
# via
Expand All @@ -20,13 +20,13 @@ docutils==0.18.1
# via
# sphinx
# sphinx-rtd-theme
furo==2022.12.7
furo==2023.3.23
# via -r requirements/build.in
idna==3.4
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==6.0.0
importlib-metadata==6.1.0
# via sphinx
jinja2==3.1.2
# via sphinx
Expand All @@ -52,6 +52,7 @@ sphinx==6.1.3
# furo
# sphinx-basic-ng
# sphinx-rtd-theme
# sphinxcontrib-jquery
sphinx-basic-ng==1.0.0b1
# via furo
sphinx-rtd-theme==1.2.0rc3
Expand All @@ -62,18 +63,15 @@ sphinxcontrib-devhelp==1.0.2
# via sphinx
sphinxcontrib-htmlhelp==2.0.1
# via sphinx
sphinxcontrib-jquery==3.0.0
sphinxcontrib-jquery==4.1
# via sphinx-rtd-theme
sphinxcontrib-jsmath==1.0.1
# via sphinx
sphinxcontrib-qthelp==1.0.3
# via sphinx
sphinxcontrib-serializinghtml==1.1.5
# via sphinx
urllib3==1.26.14
urllib3==1.26.15
# via requests
zipp==3.15.0
# via importlib-metadata

# The following packages are considered to be unsafe in a requirements file:
# setuptools
21 changes: 20 additions & 1 deletion docs/source/bricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,10 +210,13 @@ Examples:

The ``partition_html`` function partitions an HTML document and returns a list
of document ``Element`` objects. ``partition_html`` can take a filename, file-like
object, or string as input. The three examples below all produce the same output.
object, string, or url as input.

Examples:

These three invocations of partition_html() result are essentially equivalent:


.. code:: python
from unstructured.partition.html import partition_html
Expand All @@ -228,6 +231,22 @@ Examples:
elements = partition_html(text=text)
The following illustrates fetching a url and partition it the response content.

.. code:: python
from unstructured.partition.html import partition_html
elements = partition_html(url="https://python.org/")
# you can also provide custom headers:
elements = partition_html(url="https://python.org/",
headers={"User-Agent": "YourScriptName/1.0 ..."})
``partition_pdf``
---------------------

Expand Down
2 changes: 1 addition & 1 deletion examples/sec-sentiment-analysis/fetch.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def get_forms_by_cik(session: requests.Session, cik: Union[str, int]) -> dict:
response.raise_for_status()
content = json.loads(response.content)
recent_forms = content["filings"]["recent"]
form_types = {k: v for k, v in zip(recent_forms["accessionNumber"], recent_forms["form"])}
form_types = dict(zip(recent_forms["accessionNumber"], recent_forms["form"]))
return form_types


Expand Down
17 changes: 7 additions & 10 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,9 @@
#
# pip-compile --output-file=requirements/base.txt
#
--extra-index-url https://pypi.ngc.nvidia.com
--trusted-host pypi.ngc.nvidia.com

anyio==3.6.2
# via httpcore
argilla==1.4.0
argilla==1.5.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
Expand Down Expand Up @@ -40,7 +37,7 @@ idna==3.4
# anyio
# requests
# rfc3986
importlib-metadata==6.0.0
importlib-metadata==6.1.0
# via markdown
joblib==1.2.0
# via nltk
Expand All @@ -49,7 +46,7 @@ lxml==4.9.2
# python-docx
# python-pptx
# unstructured (setup.py)
markdown==3.4.1
markdown==3.4.3
# via unstructured (setup.py)
monotonic==1.6
# via argilla
Expand All @@ -59,7 +56,7 @@ numpy==1.23.5
# via
# argilla
# pandas
openpyxl==3.1.1
openpyxl==3.1.2
# via unstructured (setup.py)
packaging==23.0
# via argilla
Expand All @@ -71,7 +68,7 @@ pillow==9.4.0
# via
# python-pptx
# unstructured (setup.py)
pydantic==1.10.6
pydantic==1.10.7
# via argilla
pygments==2.14.0
# via rich
Expand All @@ -87,7 +84,7 @@ python-pptx==0.6.21
# via unstructured (setup.py)
pytz==2022.7.1
# via pandas
regex==2022.10.31
regex==2023.3.23
# via nltk
requests==2.28.2
# via unstructured (setup.py)
Expand All @@ -110,7 +107,7 @@ typing-extensions==4.5.0
# via
# pydantic
# rich
urllib3==1.26.14
urllib3==1.26.15
# via requests
wrapt==1.14.1
# via
Expand Down
14 changes: 6 additions & 8 deletions requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ alabaster==0.7.13
# via sphinx
babel==2.12.1
# via sphinx
beautifulsoup4==4.11.2
beautifulsoup4==4.12.0
# via furo
certifi==2022.12.7
# via
Expand All @@ -20,13 +20,13 @@ docutils==0.18.1
# via
# sphinx
# sphinx-rtd-theme
furo==2022.12.7
furo==2023.3.23
# via -r requirements/build.in
idna==3.4
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==6.0.0
importlib-metadata==6.1.0
# via sphinx
jinja2==3.1.2
# via sphinx
Expand All @@ -52,6 +52,7 @@ sphinx==6.1.3
# furo
# sphinx-basic-ng
# sphinx-rtd-theme
# sphinxcontrib-jquery
sphinx-basic-ng==1.0.0b1
# via furo
sphinx-rtd-theme==1.2.0rc3
Expand All @@ -62,18 +63,15 @@ sphinxcontrib-devhelp==1.0.2
# via sphinx
sphinxcontrib-htmlhelp==2.0.1
# via sphinx
sphinxcontrib-jquery==3.0.0
sphinxcontrib-jquery==4.1
# via sphinx-rtd-theme
sphinxcontrib-jsmath==1.0.1
# via sphinx
sphinxcontrib-qthelp==1.0.3
# via sphinx
sphinxcontrib-serializinghtml==1.1.5
# via sphinx
urllib3==1.26.14
urllib3==1.26.15
# via requests
zipp==3.15.0
# via importlib-metadata

# The following packages are considered to be unsafe in a requirements file:
# setuptools
Loading

0 comments on commit ce9fc26

Please sign in to comment.