The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
-
Updated
Apr 15, 2025 - Java
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
Elasticsearch File System Crawler (FS Crawler)
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
A cross-platform command line tool for parallelised content extraction and analysis.
ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.
Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.
Distributed, fault tolerant batch processing for Natural Language Applications and Search, using remote partitioning
Apache NiFi Custom Processor Extracting Text From Files with Apache Tika
📄🚀 Unleash a powerful Document Search Engine with Apache NiFi for lightning-fast, comprehensive text indexing and search.
Search Engine projects
Incremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) imap(s) servers or your own arbitrary data sources. LeechCrawler offers additional Tika parsers providing these crawling capabilities.
Tika per page PDF extractor server returning content as JSON.
Content-Type and language recognition library
It Parses PDF result provided By Pune University automatically into the Database,Generates reports and notifies student about his/her result on email
Elasticsearch File System Crawler (FS Crawler)
Tika detector for MKV and WebM
Add a description, image, and links to the tika topic page so that developers can more easily learn about it.
To associate your repository with the tika topic, visit your repo's landing page and select "manage topics."