Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 26 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)

> [!IMPORTANT]
> MarkItDown 0.0.2 alpha 1 (0.0.2a1) introduces a plugin-based architecture. As much as was possible, command-line and Python interfaces have remained the same as 0.0.1a3 to support backward compatibility. Please report any issues you encounter. Some interface changes may yet occur as we continue to refine MarkItDown to a first non-alpha release.
> Breaking changes between 0.0.1 to 0.0.2:
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install markitdown[all]` to have backward-compatible behavior.

MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
It supports:
Expand All @@ -22,12 +23,12 @@ It supports:
- Youtube URLs
- ... and more!

To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source:
To install MarkItDown, use pip: `pip install markitdown[all]`. Alternatively, you can install it from the source:

```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown
pip install -e packages/markitdown[all]
```

## Usage
Expand All @@ -50,6 +51,28 @@ You can also pipe content:
cat path-to-file.pdf | markitdown
```

### Optional Dependencies
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:

```bash
pip install markitdown[pdf, docx, pptx]
```

will install only the dependencies for PDF, DOCX, and PPTX files.

At the moment, the following optional dependencies are available:

* `[all]` Installs all optional dependencies
* `[pptx]` Installs dependencies for PowerPoint files
* `[docx]` Installs dependencies for Word files
* `[xlsx]` Installs dependencies for Excel files
* `[xls]` Installs dependencies for older Excel files
* `[pdf]` Installs dependencies for PDF files
* `[outlook]` Installs dependencies for Outlook messages
* `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
Comment on lines +73 to +74

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference...

Suggested change
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
* `[audio]` Installs dependencies for audio transcription of wav and mp3 files
* `[youtube]` Installs dependencies for fetching YouTube video transcription

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree shorter is better. However, we can still get metadata for YouTube (title, video descriptions etc.) even if the transcription library is not installed. Likewise we can get metadata for audio files (runtime, track title, artist, album, etc) even when transcription is not enabled. To this end, I felt it was worth the added characters for precision.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly disagree; developer precision has nothing to do with user experience.


### Plugins

MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins:
Expand Down
4 changes: 2 additions & 2 deletions packages/markitdown/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@
From PyPI:

```bash
pip install markitdown
pip install markitdown[all]
```

From source:

```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown
pip install -e packages/markitdown[all]
```

## Usage
Expand Down
36 changes: 28 additions & 8 deletions packages/markitdown/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,25 +26,36 @@ classifiers = [
dependencies = [
"beautifulsoup4",
"requests",
"mammoth",
"markdownify~=0.14.1",
"numpy",
"puremagic",
"pathvalidate",
"charset-normalizer",
]

[project.optional-dependencies]
all = [
"python-pptx",
"mammoth",
"pandas",
"openpyxl",
"xlrd",
"pdfminer.six",
"puremagic",
"pydub",
"olefile",
"youtube-transcript-api",
"pydub",
"SpeechRecognition",
"pathvalidate",
"charset-normalizer",
"openai",
"youtube-transcript-api",
"azure-ai-documentintelligence",
"azure-identity"
]
pptx = ["python-pptx"]
docx = ["mammoth"]
xlsx = ["pandas", "openpyxl"]
xls = ["pandas", "xlrd"]
pdf = ["pdfminer.six"]
outlook = ["olefile"]
audio-transcription = ["pydub", "SpeechRecognition"]
youtube-transcription = ["youtube-transcript-api"]
az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]

[project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme"
Expand All @@ -57,6 +68,15 @@ path = "src/markitdown/__about__.py"
[project.scripts]
markitdown = "markitdown.__main__:main"

[tool.hatch.envs.default]
features = ["all"]

[tool.hatch.envs.hatch-test]
features = ["all"]
extra-dependencies = [
"openai",
]

[tool.hatch.envs.types]
extra-dependencies = [
"mypy>=1.0.0",
Expand Down
4 changes: 2 additions & 2 deletions packages/markitdown/src/markitdown/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from ._markitdown import MarkItDown
from ._exceptions import (
MarkItDownException,
ConverterPrerequisiteException,
MissingDependencyException,
FailedConversionAttempt,
FileConversionException,
UnsupportedFormatException,
Expand All @@ -19,7 +19,7 @@
"DocumentConverter",
"DocumentConverterResult",
"MarkItDownException",
"ConverterPrerequisiteException",
"MissingDependencyException",
"FailedConversionAttempt",
"FileConversionException",
"UnsupportedFormatException",
Expand Down
24 changes: 16 additions & 8 deletions packages/markitdown/src/markitdown/_exceptions.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
from typing import Optional, List, Any

MISSING_DEPENDENCY_MESSAGE = """{converter} recognized the input as a potential {extension} file, but the dependencies needed to read {extension} files have not been installed. To resolve this error, include the optional dependency [{feature}] or [all] when installing MarkItDown. For example:

* pip install markitdown[{feature}]
* pip install markitdown[all]
* pip install markitdown[{feature}, ...]
* etc."""


class MarkItDownException(Exception):
"""
Expand All @@ -9,15 +16,16 @@ class MarkItDownException(Exception):
pass


class ConverterPrerequisiteException(MarkItDownException):
class MissingDependencyException(MarkItDownException):
"""
Thrown when instantiating a DocumentConverter in cases where
a required library or dependency is not installed, an API key
is not set, or some other prerequisite is not met.

This is not necessarily a fatal error. If thrown during
MarkItDown's plugin loading phase, the converter will simply be
skipped, and a warning will be issued.
Converters shipped with MarkItDown may depend on optional
dependencies. This exception is thrown when a converter's
convert() method is called, but the required dependency is not
installed. This is not necessarily a fatal error, as the converter
will simply be skipped (an error will bubble up only if no other
suitable converter is found).

Error messages should clearly indicate which dependency is missing.
"""

pass
Expand Down
1 change: 0 additions & 1 deletion packages/markitdown/src/markitdown/_markitdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@
from ._exceptions import (
FileConversionException,
UnsupportedFormatException,
ConverterPrerequisiteException,
FailedConversionAttempt,
)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,16 +1,24 @@
from typing import Any, Union
import re

# Azure imports
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
AnalyzeDocumentRequest,
AnalyzeResult,
DocumentAnalysisFeature,
)
from azure.identity import DefaultAzureCredential
import sys

from ._base import DocumentConverter, DocumentConverterResult
from .._exceptions import MissingDependencyException

# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
AnalyzeDocumentRequest,
AnalyzeResult,
DocumentAnalysisFeature,
)
from azure.identity import DefaultAzureCredential
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()


# TODO: currently, there is a bug in the document intelligence SDK with importing the "ContentFormat" enum.
Expand All @@ -30,6 +38,16 @@ def __init__(
):
super().__init__(priority=priority)

# Raise an error if the dependencies are not available.
# This is different than other converters since this one isn't even instantiated
# unless explicitly requested.
if _dependency_exc_info is not None:
raise MissingDependencyException(
"DocumentIntelligenceConverter requires the optional dependency [az-doc-intel] (or [all]) to be installed. E.g., `pip install markitdown[az-doc-intel]`"
) from _dependency_exc_info[1].with_traceback(
_dependency_exc_info[2]
) # Restore the original traceback

self.endpoint = endpoint
self.api_version = api_version
self.doc_intel_client = DocumentIntelligenceClient(
Expand Down
26 changes: 24 additions & 2 deletions packages/markitdown/src/markitdown/converters/_docx_converter.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,23 @@
from typing import Union
import sys

import mammoth
from typing import Union

from ._base import (
DocumentConverterResult,
)

from ._base import DocumentConverter
from ._html_converter import HtmlConverter
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE

# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import mammoth
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()


class DocxConverter(HtmlConverter):
Expand All @@ -26,6 +36,18 @@ def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
if extension.lower() != ".docx":
return None

# Check: the dependencies
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".docx",
feature="docx",
)
) from _dependency_exc_info[1].with_traceback(
_dependency_exc_info[2]
) # Restore the original traceback

result = None
with open(local_path, "rb") as docx_file:
style_map = kwargs.get("style_map", None)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

class ImageConverter(MediaConverter):
"""
Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an llm_client is configured).
Converts images to markdown via extraction of metadata (if `exiftool` is installed), and description via a multimodal LLM (if an llm_client is configured).
"""

def __init__(
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
import olefile
import sys
from typing import Any, Union
from ._base import DocumentConverter, DocumentConverterResult
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE

# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import olefile
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()


class OutlookMsgConverter(DocumentConverter):
Expand All @@ -24,6 +34,18 @@ def convert(
if extension.lower() != ".msg":
return None

# Check: the dependencies
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".msg",
feature="outlook",
)
) from _dependency_exc_info[1].with_traceback(
_dependency_exc_info[2]
) # Restore the original traceback

try:
msg = olefile.OleFileIO(local_path)
# Extract email metadata
Expand Down Expand Up @@ -59,10 +81,12 @@ def convert(
f"Could not convert MSG file '{local_path}': {str(e)}"
)

def _get_stream_data(
self, msg: olefile.OleFileIO, stream_path: str
) -> Union[str, None]:
def _get_stream_data(self, msg: Any, stream_path: str) -> Union[str, None]:
"""Helper to safely extract and decode stream data from the MSG file."""
assert isinstance(
msg, olefile.OleFileIO
) # Ensure msg is of the correct type (type hinting is not possible with the optional olefile package)

try:
if msg.exists(stream_path):
data = msg.openstream(stream_path).read()
Expand Down
26 changes: 24 additions & 2 deletions packages/markitdown/src/markitdown/converters/_pdf_converter.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
import pdfminer
import pdfminer.high_level
import sys
from typing import Union
from ._base import DocumentConverter, DocumentConverterResult
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE

# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import pdfminer
import pdfminer.high_level
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()


class PdfConverter(DocumentConverter):
Expand All @@ -20,6 +30,18 @@ def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
if extension.lower() != ".pdf":
return None

# Check the dependencies
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".pdf",
feature="pdf",
)
) from _dependency_exc_info[1].with_traceback(
_dependency_exc_info[2]
) # Restore the original traceback

return DocumentConverterResult(
title=None,
text_content=pdfminer.high_level.extract_text(local_path),
Expand Down
Loading