Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ from textxtract.core.utils import FileInfo
| `.docx` | `DOCXHandler` | `python-docx` |
| `.doc` | `DOCHandler` | `antiword` |
| `.md` | `MDHandler` | `markdown`, `beautifulsoup4` |
| `.rtf` | `RTFHandler` | `pyrtf-ng` |
| `.rtf` | `RTFHandler` | `striprtf` |
| `.html`, `.htm` | `HTMLHandler` | `beautifulsoup4`, `lxml` |
| `.csv` | `CSVHandler` | None |
| `.json` | `JSONHandler` | None |
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ text = asyncio.run(extract_text())
| PDF | `.pdf` | `pip install textxtract[pdf]` | PyMuPDF |
| Word | `.docx` | `pip install textxtract[docx]` | python-docx |
| Word Legacy | `.doc` | `pip install textxtract[doc]` | antiword |
| Rich Text | `.rtf` | `pip install textxtract[rtf]` | pyrtf-ng |
| Rich Text | `.rtf` | `pip install textxtract[rtf]` | striprtf |
| HTML | `.html`, `.htm` | `pip install textxtract[html]` | beautifulsoup4 |
| CSV | `.csv` | Built-in | stdlib |
| JSON | `.json` | Built-in | stdlib |
Expand Down
4 changes: 2 additions & 2 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,15 +63,15 @@ pip install textxtract[all]
| `docx` | `python-docx` | `.docx` |
| `doc` | `antiword` | `.doc` |
| `md` | `markdown`, `beautifulsoup4` | `.md` |
| `rtf` | `pyrtf-ng` | `.rtf` |
| `rtf` | `striprtf` | `.rtf` |
| `html` | `beautifulsoup4`, `lxml` | `.html`, `.htm` |
| `xml` | `lxml` | `.xml` |
| `all` | All of the above | All supported types |

## 🐍 Python Version Requirements

- **Python 3.9 or higher** is required
- Tested on Python 3.9, 3.10, 3.11, and 3.12
- Tested on Python 3.9, 3.10, 3.11, 3.12, and 3.13

## 🔄 Upgrading

Expand Down
2 changes: 1 addition & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ text = await extractor.extract(file_bytes, "document.pdf")
| .doc | [doc] | antiword |
| .txt | | stdlib |
| .md | [md] | markdown |
| .rtf | [rtf] | pyrtf-ng |
| .rtf | [rtf] | striprtf |
| .html/.htm| [html] | beautifulsoup4 |
| .csv | | stdlib |
| .json | | stdlib |
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,15 @@ pdf = ["pymupdf"]
docx = ["python-docx"]
doc = ["antiword"]
md = ["markdown"]
rtf = ["pyrtf-ng"]
rtf = ["striprtf"]
html = ["beautifulsoup4", "lxml"]
xml = ["lxml"]
all = [
"pymupdf",
"python-docx",
"antiword",
"markdown",
"pyrtf-ng",
"striprtf",
"beautifulsoup4",
"lxml"
]
Expand Down
3 changes: 1 addition & 2 deletions textxtract/core/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import logging
from typing import Dict, Type, Optional, List
from functools import lru_cache
from pathlib import Path

from textxtract.core.base import FileTypeHandler
from textxtract.core.exceptions import FileTypeNotSupportedError
Expand Down Expand Up @@ -73,7 +72,7 @@ def _load_default_handlers(self):

self._handlers[".rtf"] = RTFHandler
except ImportError:
logger.debug("RTF handler not available - pyrtf-ng not installed")
logger.debug("RTF handler not available - striprtf not installed")

try:
from textxtract.handlers.html import HTMLHandler
Expand Down
13 changes: 12 additions & 1 deletion textxtract/handlers/doc.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,21 @@ def _extract_with_antiword(self, file_path: Path) -> str:
raise ExtractionError("antiword extraction timed out")
except subprocess.CalledProcessError as e:
error_msg = e.stderr.decode() if e.stderr else str(e)
# Check if the error is due to missing libreoffice dependency
if (
"libreoffice" in error_msg.lower()
or "no such file or directory" in error_msg.lower()
):
# Trigger fallback by raising FileNotFoundError
raise FileNotFoundError(
"antiword requires libreoffice which is not available"
)
raise ExtractionError(f"antiword extraction failed: {error_msg}")

def _extract_with_fallback(
self, file_path: Path, config: Optional[dict] = None
self,
file_path: Path,
config: Optional[dict] = None,
) -> str:
"""Fallback extraction methods when antiword is not available."""

Expand Down
1,397 changes: 1,397 additions & 0 deletions uv.lock

Large diffs are not rendered by default.