Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
e43632b
Initial work updating signatures.
afourney Mar 3, 2025
7bc6d82
Experimeting with new signaures.
afourney Mar 4, 2025
4129f30
More progress.
afourney Mar 4, 2025
df372fa
Progress on HTML converter.
afourney Mar 4, 2025
4d09a4c
Updating converters.
afourney Mar 4, 2025
7879028
Added Outlook messages.
afourney Mar 5, 2025
4a034da
Stream exiftool.
afourney Mar 5, 2025
c426cb8
Most converters are now working.
afourney Mar 5, 2025
a9ceb13
Added support for vaious audio files.
afourney Mar 5, 2025
736e0ae
Fixed exif warning test.
afourney Mar 5, 2025
b3d6009
Small cleanup.
afourney Mar 5, 2025
36a4980
Updated sample plugin to new Converter interface.
afourney Mar 5, 2025
aa57757
Updated plugin README.
afourney Mar 5, 2025
5f0b63b
Remove stale comments.
afourney Mar 5, 2025
cc38144
Updated project readme with notes about changes, and use-cases.
afourney Mar 5, 2025
4d097aa
Updated markdownify dependency.
afourney Mar 5, 2025
c281844
ported over unit tests from prev branch
KennyZhang1 Mar 5, 2025
30e5189
removed dupe priority setting
KennyZhang1 Mar 5, 2025
8c3dd01
black formatting
KennyZhang1 Mar 5, 2025
a96a6a0
more formatting
KennyZhang1 Mar 5, 2025
1ce769e
Fixed formatting.
afourney Mar 5, 2025
1eb8b92
Add type hint, resolving circular import.
afourney Mar 5, 2025
fe1d57a
Updated DocumentConverter documentation.
afourney Mar 5, 2025
aa94bce
Bumped version.
afourney Mar 5, 2025
84f8198
Fixed many mypy errors.
afourney Mar 6, 2025
a7ae7c5
Move priority to outside DocumentConverter, allowing them to be repri…
afourney Mar 6, 2025
ae5fd74
Updated README
afourney Mar 6, 2025
ce792ec
Fixed typos.
afourney Mar 6, 2025
aa86ae9
Fixed flow of README.
afourney Mar 6, 2025
3a58865
Update .gitattributes
afourney Mar 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
tests/test_files/** linguist-vendored
packages/markitdown/tests/test_files/** linguist-vendored
packages/markitdown-sample-plugin/tests/test_files/** linguist-vendored
17 changes: 15 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@
> [!IMPORTANT]
> Breaking changes between 0.0.1 to 0.0.2:
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install markitdown[all]` to have backward-compatible behavior.
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.

MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
It supports:
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.

At present, MarkItDown supports:

- PDF
- PowerPoint
Expand All @@ -23,6 +25,17 @@ It supports:
- Youtube URLs
- ... and more!

## Why Markdown?

Markdown is extremely close to plain text, with minimal markup or formatting, but still
provides a way to represent important document structure. Mainstream LLMs, such as
OpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into their
responses unprompted. This suggests that they have been trained on vast amounts of
Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions
are also highly token-efficient.

## Installation

To install MarkItDown, use pip: `pip install markitdown[all]`. Alternatively, you can install it from the source:

```bash
Expand Down
51 changes: 33 additions & 18 deletions packages/markitdown-sample-plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,23 +10,38 @@ This project shows how to create a sample plugin for MarkItDown. The most import
Next, implement your custom DocumentConverter:

```python
from typing import Union
from markitdown import DocumentConverter, DocumentConverterResult
from typing import BinaryIO, Any
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo

class RtfConverter(DocumentConverter):
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not an RTF file
extension = kwargs.get("file_extension", "")
if extension.lower() != ".rtf":
return None

# Implement the conversion logic here ...

# Return the result
return DocumentConverterResult(
title=title,
text_content=text_content,
)

def __init__(
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
):
super().__init__(priority=priority)

def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:

# Implement logic to check if the file stream is an RTF file
# ...
raise NotImplementedError()


def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:

# Implement logic to convert the file stream to Markdown
# ...
raise NotImplementedError()
```

Next, make sure your package implements and exports the following:
Expand Down Expand Up @@ -71,10 +86,10 @@ Once the plugin package is installed, verify that it is available to MarkItDown
markitdown --list-plugins
```

To use the plugin for a conversion use the `--use-plugins` flag. For example, to convert a PDF:
To use the plugin for a conversion use the `--use-plugins` flag. For example, to convert an RTF file:

```bash
markitdown --use-plugins path-to-file.pdf
markitdown --use-plugins path-to-file.rtf
```

In Python, plugins can be enabled as follows:
Expand All @@ -83,7 +98,7 @@ In Python, plugins can be enabled as follows:
from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True)
result = md.convert("path-to-file.pdf")
result = md.convert("path-to-file.rtf")
print(result.text_content)
```

Expand Down
2 changes: 1 addition & 1 deletion packages/markitdown-sample-plugin/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ classifiers = [
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = [
"markitdown",
"markitdown>=0.0.2a2",
"striprtf",
]

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.0.1a2"
__version__ = "0.0.1a3"
Original file line number Diff line number Diff line change
@@ -1,12 +1,26 @@
from typing import Union
import locale
from typing import BinaryIO, Any
from striprtf.striprtf import rtf_to_text

from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
from markitdown import (
MarkItDown,
DocumentConverter,
DocumentConverterResult,
StreamInfo,
)


__plugin_interface_version__ = (
1 # The version of the plugin interface that this plugin uses
)

ACCEPTED_MIME_TYPE_PREFIXES = [
"text/rtf",
"application/rtf",
]

ACCEPTED_FILE_EXTENSIONS = [".rtf"]


def register_converters(markitdown: MarkItDown, **kwargs):
"""
Expand All @@ -22,18 +36,41 @@ class RtfConverter(DocumentConverter):
Converts an RTF file to in the simplest possible way.
"""

def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a RTF
extension = kwargs.get("file_extension", "")
if extension.lower() != ".rtf":
return None
def __init__(
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
):
super().__init__(priority=priority)

def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()

if extension in ACCEPTED_FILE_EXTENSIONS:
return True

for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True

return False

# Read the RTF file
with open(local_path, "r") as f:
rtf = f.read()
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
# Read the file stream into an str using hte provided charset encoding, or using the system default
encoding = stream_info.charset or locale.getpreferredencoding()
stream_data = file_stream.read().decode(encoding)

# Return the result
return DocumentConverterResult(
title=None,
text_content=rtf_to_text(rtf),
markdown=rtf_to_text(stream_data),
)
20 changes: 12 additions & 8 deletions packages/markitdown-sample-plugin/tests/test_sample_plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import os
import pytest

from markitdown import MarkItDown
from markitdown import MarkItDown, StreamInfo
from markitdown_sample_plugin import RtfConverter

TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
Expand All @@ -15,18 +15,22 @@

def test_converter() -> None:
"""Tests the RTF converter dirctly."""
converter = RtfConverter()
result = converter.convert(
os.path.join(TEST_FILES_DIR, "test.rtf"), file_extension=".rtf"
)
with open(os.path.join(TEST_FILES_DIR, "test.rtf"), "rb") as file_stream:
converter = RtfConverter()
result = converter.convert(
file_stream=file_stream,
stream_info=StreamInfo(
mimetype="text/rtf", extension=".rtf", filename="test.rtf"
),
)

for test_string in RTF_TEST_STRINGS:
assert test_string in result.text_content
for test_string in RTF_TEST_STRINGS:
assert test_string in result.text_content


def test_markitdown() -> None:
"""Tests that MarkItDown correctly loads the plugin."""
md = MarkItDown()
md = MarkItDown(enable_plugins=True)
result = md.convert(os.path.join(TEST_FILES_DIR, "test.rtf"))

for test_string in RTF_TEST_STRINGS:
Expand Down
7 changes: 5 additions & 2 deletions packages/markitdown/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ classifiers = [
dependencies = [
"beautifulsoup4",
"requests",
"markdownify~=0.14.1",
"markdownify",
"puremagic",
"pathvalidate",
"charset-normalizer",
Expand Down Expand Up @@ -78,11 +78,14 @@ extra-dependencies = [
]

[tool.hatch.envs.types]
features = ["all"]
extra-dependencies = [
"openai",
"mypy>=1.0.0",
]

[tool.hatch.envs.types.scripts]
check = "mypy --install-types --non-interactive {args:src/markitdown tests}"
check = "mypy --install-types --non-interactive --ignore-missing-imports {args:src/markitdown tests}"

[tool.coverage.run]
source_pkgs = ["markitdown", "tests"]
Expand Down
2 changes: 1 addition & 1 deletion packages/markitdown/src/markitdown/__about__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.0.2a1"
__version__ = "0.0.2a2"
12 changes: 10 additions & 2 deletions packages/markitdown/src/markitdown/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,20 @@
# SPDX-License-Identifier: MIT

from .__about__ import __version__
from ._markitdown import MarkItDown
from ._markitdown import (
MarkItDown,
PRIORITY_SPECIFIC_FILE_FORMAT,
PRIORITY_GENERIC_FILE_FORMAT,
)
from ._base_converter import DocumentConverterResult, DocumentConverter
from ._stream_info import StreamInfo
from ._exceptions import (
MarkItDownException,
MissingDependencyException,
FailedConversionAttempt,
FileConversionException,
UnsupportedFormatException,
)
from .converters import DocumentConverter, DocumentConverterResult

__all__ = [
"__version__",
Expand All @@ -23,4 +28,7 @@
"FailedConversionAttempt",
"FileConversionException",
"UnsupportedFormatException",
"StreamInfo",
"PRIORITY_SPECIFIC_FILE_FORMAT",
"PRIORITY_GENERIC_FILE_FORMAT",
]
Loading