Extract text from any document with more power and a more wide extension scope. No more muss. No more fuss.
- Install Package -
pip install textract-plus
Import and Extract:
import textractplus as tp text=tp.process('/path/to/document') print(text)
Textract Plus supports a growing and extended list of file types for text extraction than textract. If you don't see your favorite file type here, Please recommend other file types by either mentioning them on the issue tracker or by :ref:`contributing a pull request <contributing>`.
.csvvia python builtins.tsvand.tabvia python builtins.docvia antiword.docxvia python-docx2txt.emlvia python builtins.epubvia ebooklib.gifvia tesseract-ocr.jpgand.jpegvia tesseract-ocr.jsonvia python builtins.htmland.htmvia beautifulsoup4.mp3via sox, SpeechRecognition, and pocketsphinx.msgvia msg-extractor.odtvia python builtins.oggvia sox, SpeechRecognition, and pocketsphinx.pdfvia pdftotext (default) or pdfminer.six.pngvia tesseract-ocr.pptxvia python-pptx.psvia ps2ascii.rtfvia unrtf.tiffand.tifvia tesseract-ocr.txtvia python builtins.wavvia SpeechRecognition and pocketsphinx.xlsxvia xlrd.xlsvia xlrd
.dotxvia docx2python.docmvia docx2python.pptmvia python-pptx