forked from langchain4j/langchain4j
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implementing RAG applications is hard. Especially for those who are just getting started exploring LLMs and RAG. This PR introduces an "Easy RAG" feature that should help developers to get started with RAG as easy as possible. With it, there is no need to learn about chunking/splitting/segmentation, embeddings, embedding models, vector databases, retrieval techniques and other RAG-related concepts. This is similar to how one can simply upload one or multiple files into [OpenAI Assistants API](https://platform.openai.com/docs/assistants/overview) and the LLM will automagically know about their contents when answering questions. Easy RAG is using local embedding model running in your CPU (GPU support can be added later). Your files are ingested into an in-memory embedding store. Please note that "Easy RAG" will not replace manual RAG setups and especially [advanced RAG techniques](langchain4j#538), but will provide an easier way to get started with RAG. The quality of an "Easy RAG" should be sufficient for demos, proof of concepts and for getting started. To use "Easy RAG", simply import `langchain4j-easy-rag` dependency that includes everything needed to do RAG: - Apache Tika document loader (to parse all document types automatically) - Quantized [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) in-process embedding model which has an impressive (for it's size) 51.68 [score](https://huggingface.co/spaces/mteb/leaderboard) for retrieval Here is the proposed API: ```java List<Document> documents = FileSystemDocumentLoader.loadDocuments(directoryPath); // one can also load documents recursively and filter with glob/regex EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); // we will use an in-memory embedding store for simplicity EmbeddingStoreIngestor.ingest(documents, embeddingStore); Assistant assistant = AiServices.builder(Assistant.class) .chatLanguageModel(model) .contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore)) .build(); String answer = assistant.chat("Who is Charlie?"); // Charlie is a carrot... ``` `FileSystemDocumentLoader` in the above code loads documents using `DocumentParser` available in classpath via SPI, in this case an `ApacheTikaDocumentParser` imported with the `langchain4j-easy-rag` dependency. The `EmbeddingStoreIngestor` in the above code: - splits documents into smaller text segments using a `DocumentSplitter` loaded via SPI from the `langchain4j-easy-rag` dependency. Currently it uses `DocumentSplitters.recursive(300, 30, new HuggingFaceTokenizer())` - embeds text segments using an `AllMiniLmL6V2QuantizedEmbeddingModel` loaded via SPI from the `langchain4j-easy-rag` dependency - stores text segments and their embeddings into the specified embedding store When using `InMemoryEmbeddingStore`, one can serialize/persist it into a JSON string on into a file. This way one can skip loading documents and embedding them on each application run. It is easy to customize the ingestion in the above code, just change ```java EmbeddingStoreIngestor.ingest(documents, embeddingStore); ``` into ```java EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder() //.documentTransformer(...) // you can optionally transform (clean, enrich, etc) documents before splitting //.documentSplitter(...) // you can optionally specify another splitter //.textSegmentTransformer(...) // you can optionally transform (clean, enrich, etc) segments before embedding //.embeddingModel(...) // you can optionally specify another embedding model to use for embedding .embeddingStore(embeddingStore) .build(); ingestor.ingest(documents) ``` Over time, we can add an auto-eval feature that will find the most suitable hyperparametes for a given documents (e.g. which embedding model to use, which splitting method, possibly advanced RAG techniques, etc.) so that "easy RAG" can be comparable to the "advanced RAG". Related: langchain4j/langchain4j-embeddings#16 --------- Co-authored-by: dliubars <dliubars@redhat.com>
- Loading branch information
1 parent
6af51a5
commit 2f425da
Showing
28 changed files
with
1,006 additions
and
61 deletions.
There are no files selected for viewing
61 changes: 61 additions & 0 deletions
61
document-parsers/langchain4j-document-parser-apache-tika/pom.xml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<project xmlns="http://maven.apache.org/POM/4.0.0" | ||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||
<modelVersion>4.0.0</modelVersion> | ||
|
||
<parent> | ||
<groupId>dev.langchain4j</groupId> | ||
<artifactId>langchain4j-parent</artifactId> | ||
<version>0.29.0-SNAPSHOT</version> | ||
<relativePath>../../langchain4j-parent/pom.xml</relativePath> | ||
</parent> | ||
|
||
<artifactId>langchain4j-document-parser-apache-tika</artifactId> | ||
<name>LangChain4j :: Document parser :: Apache Tika</name> | ||
<packaging>jar</packaging> | ||
|
||
<properties> | ||
<apache-tika.version>2.9.1</apache-tika.version> | ||
</properties> | ||
|
||
<dependencies> | ||
|
||
<dependency> | ||
<groupId>dev.langchain4j</groupId> | ||
<artifactId>langchain4j-core</artifactId> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.apache.tika</groupId> | ||
<artifactId>tika-core</artifactId> | ||
<version>${apache-tika.version}</version> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.apache.tika</groupId> | ||
<artifactId>tika-parsers-standard-package</artifactId> | ||
<version>${apache-tika.version}</version> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.junit.jupiter</groupId> | ||
<artifactId>junit-jupiter-engine</artifactId> | ||
<scope>test</scope> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.junit.jupiter</groupId> | ||
<artifactId>junit-jupiter-params</artifactId> | ||
<scope>test</scope> | ||
</dependency> | ||
|
||
<dependency> | ||
<groupId>org.assertj</groupId> | ||
<artifactId>assertj-core</artifactId> | ||
<scope>test</scope> | ||
</dependency> | ||
|
||
</dependencies> | ||
|
||
</project> |
71 changes: 71 additions & 0 deletions
71
.../main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
package dev.langchain4j.data.document.parser.apache.tika; | ||
|
||
import dev.langchain4j.data.document.Document; | ||
import dev.langchain4j.data.document.DocumentParser; | ||
import org.apache.tika.metadata.Metadata; | ||
import org.apache.tika.parser.AutoDetectParser; | ||
import org.apache.tika.parser.ParseContext; | ||
import org.apache.tika.parser.Parser; | ||
import org.apache.tika.sax.BodyContentHandler; | ||
import org.xml.sax.ContentHandler; | ||
|
||
import java.io.InputStream; | ||
|
||
import static dev.langchain4j.internal.Utils.getOrDefault; | ||
|
||
/** | ||
* Parses files into {@link Document}s using Apache Tika library, automatically detecting the file format. | ||
* This parser supports various file formats, including PDF, DOC, PPT, XLS. | ||
* For detailed information on supported formats, | ||
* please refer to the <a href="https://tika.apache.org/2.9.1/formats.html">Apache Tika documentation</a>. | ||
*/ | ||
public class ApacheTikaDocumentParser implements DocumentParser { | ||
|
||
private static final int NO_WRITE_LIMIT = -1; | ||
|
||
private final Parser parser; | ||
private final ContentHandler contentHandler; | ||
private final Metadata metadata; | ||
private final ParseContext parseContext; | ||
|
||
/** | ||
* Creates an instance of an {@code ApacheTikaDocumentParser} with the default Tika components. | ||
* It uses {@link AutoDetectParser}, {@link BodyContentHandler} without write limit, | ||
* empty {@link Metadata} and empty {@link ParseContext}. | ||
*/ | ||
public ApacheTikaDocumentParser() { | ||
this(null, null, null, null); | ||
} | ||
|
||
/** | ||
* Creates an instance of an {@code ApacheTikaDocumentParser} with the provided Tika components. | ||
* If some of the components are not provided ({@code null}, the defaults will be used. | ||
* | ||
* @param parser Tika parser to use. Default: {@link AutoDetectParser} | ||
* @param contentHandler Tika content handler. Default: {@link BodyContentHandler} without write limit | ||
* @param metadata Tika metadata. Default: empty {@link Metadata} | ||
* @param parseContext Tika parse context. Default: empty {@link ParseContext} | ||
*/ | ||
public ApacheTikaDocumentParser(Parser parser, | ||
ContentHandler contentHandler, | ||
Metadata metadata, | ||
ParseContext parseContext) { | ||
this.parser = getOrDefault(parser, AutoDetectParser::new); | ||
this.contentHandler = getOrDefault(contentHandler, () -> new BodyContentHandler(NO_WRITE_LIMIT)); | ||
this.metadata = getOrDefault(metadata, Metadata::new); | ||
this.parseContext = getOrDefault(parseContext, ParseContext::new); | ||
} | ||
|
||
// TODO allow automatically extract metadata (e.g. creator, last-author, created/modified timestamp, etc) | ||
|
||
@Override | ||
public Document parse(InputStream inputStream) { | ||
try { | ||
parser.parse(inputStream, contentHandler, metadata, parseContext); | ||
String text = contentHandler.toString(); | ||
return Document.from(text); | ||
} catch (Exception e) { | ||
throw new RuntimeException(e); | ||
} | ||
} | ||
} |
12 changes: 12 additions & 0 deletions
12
...ava/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParserFactory.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
package dev.langchain4j.data.document.parser.apache.tika; | ||
|
||
import dev.langchain4j.data.document.DocumentParser; | ||
import dev.langchain4j.spi.data.document.parser.DocumentParserFactory; | ||
|
||
public class ApacheTikaDocumentParserFactory implements DocumentParserFactory { | ||
|
||
@Override | ||
public DocumentParser create() { | ||
return new ApacheTikaDocumentParser(); | ||
} | ||
} |
1 change: 1 addition & 0 deletions
1
...esources/META-INF/services/dev.langchain4j.spi.data.document.parser.DocumentParserFactory
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
dev.langchain4j.data.document.parser.apache.tika.ApacheTikaDocumentParserFactory |
50 changes: 50 additions & 0 deletions
50
...t/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParserTest.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
package dev.langchain4j.data.document.parser.apache.tika; | ||
|
||
import dev.langchain4j.data.document.Document; | ||
import dev.langchain4j.data.document.DocumentParser; | ||
import org.apache.tika.parser.AutoDetectParser; | ||
import org.junit.jupiter.params.ParameterizedTest; | ||
import org.junit.jupiter.params.provider.ValueSource; | ||
|
||
import java.io.InputStream; | ||
|
||
import static org.assertj.core.api.Assertions.assertThat; | ||
|
||
class ApacheTikaDocumentParserTest { | ||
|
||
@ParameterizedTest | ||
@ValueSource(strings = { | ||
"test-file.doc", | ||
"test-file.docx", | ||
"test-file.ppt", | ||
"test-file.pptx", | ||
"test-file.pdf" | ||
}) | ||
void should_parse_doc_ppt_and_pdf_files(String fileName) { | ||
|
||
DocumentParser parser = new ApacheTikaDocumentParser(); | ||
InputStream inputStream = getClass().getClassLoader().getResourceAsStream(fileName); | ||
|
||
Document document = parser.parse(inputStream); | ||
|
||
assertThat(document.text()).isEqualToIgnoringWhitespace("test content"); | ||
assertThat(document.metadata().asMap()).isEmpty(); | ||
} | ||
|
||
@ParameterizedTest | ||
@ValueSource(strings = { | ||
"test-file.xls", | ||
"test-file.xlsx" | ||
}) | ||
void should_parse_xls_files(String fileName) { | ||
|
||
DocumentParser parser = new ApacheTikaDocumentParser(new AutoDetectParser(), null, null, null); | ||
InputStream inputStream = getClass().getClassLoader().getResourceAsStream(fileName); | ||
|
||
Document document = parser.parse(inputStream); | ||
|
||
assertThat(document.text()) | ||
.isEqualToIgnoringWhitespace("Sheet1\ntest content\nSheet2\ntest content"); | ||
assertThat(document.metadata().asMap()).isEmpty(); | ||
} | ||
} |
Binary file added
BIN
+22 KB
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.doc
Binary file not shown.
Binary file added
BIN
+11.9 KB
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.docx
Binary file not shown.
Binary file added
BIN
+22.4 KB
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.pdf
Binary file not shown.
Binary file added
BIN
+40.5 KB
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.ppt
Binary file not shown.
Binary file added
BIN
+32.3 KB
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.pptx
Binary file not shown.
Binary file added
BIN
+25.5 KB
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.xls
Binary file not shown.
Binary file added
BIN
+9.3 KB
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.xlsx
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
15 changes: 15 additions & 0 deletions
15
...4j-core/src/main/java/dev/langchain4j/spi/data/document/parser/DocumentParserFactory.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
package dev.langchain4j.spi.data.document.parser; | ||
|
||
import dev.langchain4j.data.document.DocumentParser; | ||
|
||
/** | ||
* A factory for creating {@link DocumentParser} instances through SPI. | ||
* <br> | ||
* Available implementations: {@code ApacheTikaDocumentParserFactory} | ||
* in the {@code langchain4j-document-parser-apache-tika} module. | ||
* For the "Easy RAG", import {@code langchain4j-easy-rag} module. | ||
*/ | ||
public interface DocumentParserFactory { | ||
|
||
DocumentParser create(); | ||
} |
Oops, something went wrong.