Skip to content

Commit

Permalink
POC: Easy RAG (langchain4j#686)
Browse files Browse the repository at this point in the history
Implementing RAG applications is hard. Especially for those who are just
getting started exploring LLMs and RAG.

This PR introduces an "Easy RAG" feature that should help developers to
get started with RAG as easy as possible.

With it, there is no need to learn about
chunking/splitting/segmentation, embeddings, embedding models, vector
databases, retrieval techniques and other RAG-related concepts.

This is similar to how one can simply upload one or multiple files into
[OpenAI Assistants
API](https://platform.openai.com/docs/assistants/overview) and the LLM
will automagically know about their contents when answering questions.

Easy RAG is using local embedding model running in your CPU (GPU support
can be added later).
Your files are ingested into an in-memory embedding store.

Please note that "Easy RAG" will not replace manual RAG setups and
especially [advanced RAG
techniques](langchain4j#538), but
will provide an easier way to get started with RAG.
The quality of an "Easy RAG" should be sufficient for demos, proof of
concepts and for getting started.


To use "Easy RAG", simply import `langchain4j-easy-rag` dependency that
includes everything needed to do RAG:
- Apache Tika document loader (to parse all document types
automatically)
- Quantized [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) in-process embedding model which has an impressive (for it's size) 51.68 [score](https://huggingface.co/spaces/mteb/leaderboard) for retrieval


Here is the proposed API:

```java
List<Document> documents = FileSystemDocumentLoader.loadDocuments(directoryPath); // one can also load documents recursively and filter with glob/regex

EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); // we will use an in-memory embedding store for simplicity

EmbeddingStoreIngestor.ingest(documents, embeddingStore);

Assistant assistant = AiServices.builder(Assistant.class)
                .chatLanguageModel(model)
                .contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
                .build();

String answer = assistant.chat("Who is Charlie?"); // Charlie is a carrot...
```

`FileSystemDocumentLoader` in the above code loads documents using
`DocumentParser` available in classpath via SPI, in this case an
`ApacheTikaDocumentParser` imported with the `langchain4j-easy-rag`
dependency.

The `EmbeddingStoreIngestor` in the above code:
- splits documents into smaller text segments using a `DocumentSplitter`
loaded via SPI from the `langchain4j-easy-rag` dependency. Currently it
uses `DocumentSplitters.recursive(300, 30, new HuggingFaceTokenizer())`
- embeds text segments using an `AllMiniLmL6V2QuantizedEmbeddingModel`
loaded via SPI from the `langchain4j-easy-rag` dependency
- stores text segments and their embeddings into the specified embedding
store

When using `InMemoryEmbeddingStore`, one can serialize/persist it into a
JSON string on into a file.
This way one can skip loading documents and embedding them on each
application run.

It is easy to customize the ingestion in the above code, just change
```java
EmbeddingStoreIngestor.ingest(documents, embeddingStore);
```
into
```java
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                //.documentTransformer(...) // you can optionally transform (clean, enrich, etc) documents before splitting
                //.documentSplitter(...) // you can optionally specify another splitter
                //.textSegmentTransformer(...) // you can optionally transform (clean, enrich, etc) segments before embedding
                //.embeddingModel(...) // you can optionally specify another embedding model to use for embedding
                .embeddingStore(embeddingStore)
                .build();

ingestor.ingest(documents)
```

Over time, we can add an auto-eval feature that will find the most
suitable hyperparametes for a given documents (e.g. which embedding
model to use, which splitting method, possibly advanced RAG techniques,
etc.) so that "easy RAG" can be comparable to the "advanced RAG".

Related:
langchain4j/langchain4j-embeddings#16

---------

Co-authored-by: dliubars <dliubars@redhat.com>
  • Loading branch information
langchain4j and dliubars authored Mar 21, 2024
1 parent 6af51a5 commit 2f425da
Show file tree
Hide file tree
Showing 28 changed files with 1,006 additions and 61 deletions.
61 changes: 61 additions & 0 deletions document-parsers/langchain4j-document-parser-apache-tika/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-parent</artifactId>
<version>0.29.0-SNAPSHOT</version>
<relativePath>../../langchain4j-parent/pom.xml</relativePath>
</parent>

<artifactId>langchain4j-document-parser-apache-tika</artifactId>
<name>LangChain4j :: Document parser :: Apache Tika</name>
<packaging>jar</packaging>

<properties>
<apache-tika.version>2.9.1</apache-tika.version>
</properties>

<dependencies>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-core</artifactId>
</dependency>

<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>${apache-tika.version}</version>
</dependency>

<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>${apache-tika.version}</version>
</dependency>

<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-params</artifactId>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
<scope>test</scope>
</dependency>

</dependencies>

</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
package dev.langchain4j.data.document.parser.apache.tika;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.InputStream;

import static dev.langchain4j.internal.Utils.getOrDefault;

/**
* Parses files into {@link Document}s using Apache Tika library, automatically detecting the file format.
* This parser supports various file formats, including PDF, DOC, PPT, XLS.
* For detailed information on supported formats,
* please refer to the <a href="https://tika.apache.org/2.9.1/formats.html">Apache Tika documentation</a>.
*/
public class ApacheTikaDocumentParser implements DocumentParser {

private static final int NO_WRITE_LIMIT = -1;

private final Parser parser;
private final ContentHandler contentHandler;
private final Metadata metadata;
private final ParseContext parseContext;

/**
* Creates an instance of an {@code ApacheTikaDocumentParser} with the default Tika components.
* It uses {@link AutoDetectParser}, {@link BodyContentHandler} without write limit,
* empty {@link Metadata} and empty {@link ParseContext}.
*/
public ApacheTikaDocumentParser() {
this(null, null, null, null);
}

/**
* Creates an instance of an {@code ApacheTikaDocumentParser} with the provided Tika components.
* If some of the components are not provided ({@code null}, the defaults will be used.
*
* @param parser Tika parser to use. Default: {@link AutoDetectParser}
* @param contentHandler Tika content handler. Default: {@link BodyContentHandler} without write limit
* @param metadata Tika metadata. Default: empty {@link Metadata}
* @param parseContext Tika parse context. Default: empty {@link ParseContext}
*/
public ApacheTikaDocumentParser(Parser parser,
ContentHandler contentHandler,
Metadata metadata,
ParseContext parseContext) {
this.parser = getOrDefault(parser, AutoDetectParser::new);
this.contentHandler = getOrDefault(contentHandler, () -> new BodyContentHandler(NO_WRITE_LIMIT));
this.metadata = getOrDefault(metadata, Metadata::new);
this.parseContext = getOrDefault(parseContext, ParseContext::new);
}

// TODO allow automatically extract metadata (e.g. creator, last-author, created/modified timestamp, etc)

@Override
public Document parse(InputStream inputStream) {
try {
parser.parse(inputStream, contentHandler, metadata, parseContext);
String text = contentHandler.toString();
return Document.from(text);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
package dev.langchain4j.data.document.parser.apache.tika;

import dev.langchain4j.data.document.DocumentParser;
import dev.langchain4j.spi.data.document.parser.DocumentParserFactory;

public class ApacheTikaDocumentParserFactory implements DocumentParserFactory {

@Override
public DocumentParser create() {
return new ApacheTikaDocumentParser();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
dev.langchain4j.data.document.parser.apache.tika.ApacheTikaDocumentParserFactory
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
package dev.langchain4j.data.document.parser.apache.tika;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentParser;
import org.apache.tika.parser.AutoDetectParser;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.ValueSource;

import java.io.InputStream;

import static org.assertj.core.api.Assertions.assertThat;

class ApacheTikaDocumentParserTest {

@ParameterizedTest
@ValueSource(strings = {
"test-file.doc",
"test-file.docx",
"test-file.ppt",
"test-file.pptx",
"test-file.pdf"
})
void should_parse_doc_ppt_and_pdf_files(String fileName) {

DocumentParser parser = new ApacheTikaDocumentParser();
InputStream inputStream = getClass().getClassLoader().getResourceAsStream(fileName);

Document document = parser.parse(inputStream);

assertThat(document.text()).isEqualToIgnoringWhitespace("test content");
assertThat(document.metadata().asMap()).isEmpty();
}

@ParameterizedTest
@ValueSource(strings = {
"test-file.xls",
"test-file.xlsx"
})
void should_parse_xls_files(String fileName) {

DocumentParser parser = new ApacheTikaDocumentParser(new AutoDetectParser(), null, null, null);
InputStream inputStream = getClass().getClassLoader().getResourceAsStream(fileName);

Document document = parser.parse(inputStream);

assertThat(document.text())
.isEqualToIgnoringWhitespace("Sheet1\ntest content\nSheet2\ntest content");
assertThat(document.metadata().asMap()).isEmpty();
}
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
33 changes: 32 additions & 1 deletion langchain4j-bom/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-easy-rag</artifactId>
<version>${project.version}</version>
</dependency>

<!-- model providers -->

<dependency>
Expand Down Expand Up @@ -228,36 +234,55 @@
<artifactId>langchain4j-embeddings-all-minilm-l6-v2</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-all-minilm-l6-v2-q</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-en</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-en-q</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-v15-en</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-en-v15-q</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-zh</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-zh-q</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-e5-small-v2</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-e5-small-v2-q</artifactId>
Expand Down Expand Up @@ -300,6 +325,12 @@

<!-- document parsers -->

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-document-parser-apache-pdfbox</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-document-parser-apache-poi</artifactId>
Expand All @@ -308,7 +339,7 @@

<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-document-parser-apache-pdfbox</artifactId>
<artifactId>langchain4j-document-parser-apache-tika</artifactId>
<version>${project.version}</version>
</dependency>

Expand Down
2 changes: 1 addition & 1 deletion langchain4j-core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@
<limit>
<counter>INSTRUCTION</counter>
<value>COVEREDRATIO</value>
<minimum>0.80</minimum>
<minimum>0.75</minimum>
</limit>
</limits>
</rule>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,21 @@
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.rag.content.Content;
import dev.langchain4j.rag.query.Query;
import dev.langchain4j.spi.model.embedding.EmbeddingModelFactory;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingSearchRequest;
import dev.langchain4j.store.embedding.EmbeddingSearchResult;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.filter.Filter;
import lombok.Builder;

import java.util.Collection;
import java.util.List;
import java.util.function.Function;

import static dev.langchain4j.internal.Utils.getOrDefault;
import static dev.langchain4j.internal.ValidationUtils.*;
import static dev.langchain4j.spi.ServiceHelper.loadFactories;
import static java.util.stream.Collectors.toList;

/**
Expand Down Expand Up @@ -104,12 +107,29 @@ private EmbeddingStoreContentRetriever(EmbeddingStore<TextSegment> embeddingStor
Function<Query, Double> dynamicMinScore,
Function<Query, Filter> dynamicFilter) {
this.embeddingStore = ensureNotNull(embeddingStore, "embeddingStore");
this.embeddingModel = ensureNotNull(embeddingModel, "embeddingModel");
this.embeddingModel = ensureNotNull(
getOrDefault(embeddingModel, EmbeddingStoreContentRetriever::loadEmbeddingModel),
"embeddingModel"
);
this.maxResultsProvider = getOrDefault(dynamicMaxResults, DEFAULT_MAX_RESULTS);
this.minScoreProvider = getOrDefault(dynamicMinScore, DEFAULT_MIN_SCORE);
this.filterProvider = getOrDefault(dynamicFilter, DEFAULT_FILTER);
}

private static EmbeddingModel loadEmbeddingModel() {
Collection<EmbeddingModelFactory> factories = loadFactories(EmbeddingModelFactory.class);
if (factories.size() > 1) {
throw new RuntimeException("Conflict: multiple embedding models have been found in the classpath. " +
"Please explicitly specify the one you wish to use.");
}

for (EmbeddingModelFactory factory : factories) {
return factory.create();
}

return null;
}

public static class EmbeddingStoreContentRetrieverBuilder {

public EmbeddingStoreContentRetrieverBuilder maxResults(Integer maxResults) {
Expand All @@ -134,6 +154,14 @@ public EmbeddingStoreContentRetrieverBuilder filter(Filter filter) {
}
}

/**
* Creates an instance of an {@code EmbeddingStoreContentRetriever} from the specified {@link EmbeddingStore}
* and {@link EmbeddingModel} found through SPI (see {@link EmbeddingModelFactory}).
*/
public static EmbeddingStoreContentRetriever from(EmbeddingStore<TextSegment> embeddingStore) {
return builder().embeddingStore(embeddingStore).build();
}

@Override
public List<Content> retrieve(Query query) {

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
package dev.langchain4j.spi.data.document.parser;

import dev.langchain4j.data.document.DocumentParser;

/**
* A factory for creating {@link DocumentParser} instances through SPI.
* <br>
* Available implementations: {@code ApacheTikaDocumentParserFactory}
* in the {@code langchain4j-document-parser-apache-tika} module.
* For the "Easy RAG", import {@code langchain4j-easy-rag} module.
*/
public interface DocumentParserFactory {

DocumentParser create();
}
Loading

0 comments on commit 2f425da

Please sign in to comment.