Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 13 additions & 6 deletions libs/core/langchain_core/caches.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
Distinct from provider-based [prompt caching](https://docs.langchain.com/oss/python/langchain/models#prompt-caching).
!!! warning
This is a beta feature! Please be wary of deploying experimental code to production
!!! warning "Beta feature"
This is a beta feature. Please be wary of deploying experimental code to production
unless you've taken appropriate precautions.
A cache is useful for two reasons:
Expand Down Expand Up @@ -49,17 +49,18 @@ def lookup(self, prompt: str, llm_string: str) -> RETURN_VAL_TYPE | None:
"""Look up based on `prompt` and `llm_string`.
A cache implementation is expected to generate a key from the 2-tuple
of prompt and llm_string (e.g., by concatenating them with a delimiter).
of `prompt` and `llm_string` (e.g., by concatenating them with a delimiter).
Args:
prompt: A string representation of the prompt.
In the case of a chat model, the prompt is a non-trivial
serialization of the prompt into the language model.
llm_string: A string representation of the LLM configuration.
This is used to capture the invocation parameters of the LLM
(e.g., model name, temperature, stop tokens, max tokens, etc.).
These invocation parameters are serialized into a string
representation.
These invocation parameters are serialized into a string representation.
Returns:
On a cache miss, return `None`. On a cache hit, return the cached value.
Expand All @@ -78,8 +79,10 @@ def update(self, prompt: str, llm_string: str, return_val: RETURN_VAL_TYPE) -> N
In the case of a chat model, the prompt is a non-trivial
serialization of the prompt into the language model.
llm_string: A string representation of the LLM configuration.
This is used to capture the invocation parameters of the LLM
(e.g., model name, temperature, stop tokens, max tokens, etc.).
These invocation parameters are serialized into a string
representation.
return_val: The value to be cached. The value is a list of `Generation`
Expand All @@ -94,15 +97,17 @@ async def alookup(self, prompt: str, llm_string: str) -> RETURN_VAL_TYPE | None:
"""Async look up based on `prompt` and `llm_string`.
A cache implementation is expected to generate a key from the 2-tuple
of prompt and llm_string (e.g., by concatenating them with a delimiter).
of `prompt` and `llm_string` (e.g., by concatenating them with a delimiter).
Args:
prompt: A string representation of the prompt.
In the case of a chat model, the prompt is a non-trivial
serialization of the prompt into the language model.
llm_string: A string representation of the LLM configuration.
This is used to capture the invocation parameters of the LLM
(e.g., model name, temperature, stop tokens, max tokens, etc.).
These invocation parameters are serialized into a string
representation.
Expand All @@ -125,8 +130,10 @@ async def aupdate(
In the case of a chat model, the prompt is a non-trivial
serialization of the prompt into the language model.
llm_string: A string representation of the LLM configuration.
This is used to capture the invocation parameters of the LLM
(e.g., model name, temperature, stop tokens, max tokens, etc.).
These invocation parameters are serialized into a string
representation.
return_val: The value to be cached. The value is a list of `Generation`
Expand Down
27 changes: 24 additions & 3 deletions libs/core/langchain_core/documents/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,28 @@
"""Documents module.
"""Documents module for data retrieval and processing workflows.
**Document** module is a collection of classes that handle documents
and their transformations.
This module provides core abstractions for handling data in retrieval-augmented
generation (RAG) pipelines, vector stores, and document processing workflows.
!!! warning "Documents vs. message content"
This module is distinct from `langchain_core.messages.content`, which provides
multimodal content blocks for **LLM chat I/O** (text, images, audio, etc. within
messages).
**Key distinction:**
- **Documents** (this module): For **data retrieval and processing workflows**
- Vector stores, retrievers, RAG pipelines
- Text chunking, embedding, and semantic search
- Example: Chunks of a PDF stored in a vector database
- **Content Blocks** (`messages.content`): For **LLM conversational I/O**
- Multimodal message content sent to/from models
- Tool calls, reasoning, citations within chat
- Example: An image sent to a vision model in a chat message (via
[`ImageContentBlock`][langchain.messages.ImageContentBlock])
While both can represent similar data types (text, files), they serve different
architectural purposes in LangChain applications.
"""

from typing import TYPE_CHECKING
Expand Down
105 changes: 59 additions & 46 deletions libs/core/langchain_core/documents/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
"""Base classes for media and documents."""
"""Base classes for media and documents.
This module contains core abstractions for **data retrieval and processing workflows**:
- `BaseMedia`: Base class providing `id` and `metadata` fields
- `Blob`: Raw data loading (files, binary data) - used by document loaders
- `Document`: Text content for retrieval (RAG, vector stores, semantic search)
!!! note "Not for LLM chat messages"
These classes are for data processing pipelines, not LLM I/O. For multimodal
content in chat messages (images, audio in conversations), see
`langchain.messages` content blocks instead.
"""

from __future__ import annotations

Expand All @@ -19,15 +31,13 @@


class BaseMedia(Serializable):
"""Use to represent media content.
Media objects can be used to represent raw data, such as text or binary data.
"""Base class for content used in retrieval and data processing workflows.
LangChain Media objects allow associating metadata and an optional identifier
with the content.
Provides common fields for content that needs to be stored, indexed, or searched.
The presence of an ID and metadata make it easier to store, index, and search
over the content in a structured way.
!!! note
For multimodal content in **chat messages** (images, audio sent to/from LLMs),
use `langchain.messages` content blocks instead.
"""

# The ID field is optional at the moment.
Expand All @@ -45,61 +55,60 @@ class BaseMedia(Serializable):


class Blob(BaseMedia):
"""Blob represents raw data by either reference or value.
"""Raw data abstraction for document loading and file processing.
Provides an interface to materialize the blob in different representations, and
help to decouple the development of data loaders from the downstream parsing of
the raw data.
Represents raw bytes or text, either in-memory or by file reference. Used
primarily by document loaders to decouple data loading from parsing.
Inspired by [Mozilla's `Blob`](https://developer.mozilla.org/en-US/docs/Web/API/Blob)
Example: Initialize a blob from in-memory data
???+ example "Initialize a blob from in-memory data"
```python
from langchain_core.documents import Blob
```python
from langchain_core.documents import Blob
blob = Blob.from_data("Hello, world!")
blob = Blob.from_data("Hello, world!")
# Read the blob as a string
print(blob.as_string())
# Read the blob as a string
print(blob.as_string())
# Read the blob as bytes
print(blob.as_bytes())
# Read the blob as bytes
print(blob.as_bytes())
# Read the blob as a byte stream
with blob.as_bytes_io() as f:
print(f.read())
```
# Read the blob as a byte stream
with blob.as_bytes_io() as f:
print(f.read())
```
Example: Load from memory and specify mime-type and metadata
??? example "Load from memory and specify MIME type and metadata"
```python
from langchain_core.documents import Blob
```python
from langchain_core.documents import Blob
blob = Blob.from_data(
data="Hello, world!",
mime_type="text/plain",
metadata={"source": "https://example.com"},
)
```
blob = Blob.from_data(
data="Hello, world!",
mime_type="text/plain",
metadata={"source": "https://example.com"},
)
```
Example: Load the blob from a file
??? example "Load the blob from a file"
```python
from langchain_core.documents import Blob
```python
from langchain_core.documents import Blob
blob = Blob.from_path("path/to/file.txt")
blob = Blob.from_path("path/to/file.txt")
# Read the blob as a string
print(blob.as_string())
# Read the blob as a string
print(blob.as_string())
# Read the blob as bytes
print(blob.as_bytes())
# Read the blob as bytes
print(blob.as_bytes())
# Read the blob as a byte stream
with blob.as_bytes_io() as f:
print(f.read())
```
# Read the blob as a byte stream
with blob.as_bytes_io() as f:
print(f.read())
```
"""

data: bytes | str | None = None
Expand Down Expand Up @@ -213,7 +222,7 @@ def from_path(
encoding: Encoding to use if decoding the bytes into a string
mime_type: If provided, will be set as the MIME type of the data
guess_type: If `True`, the MIME type will be guessed from the file
extension, if a mime-type was not provided
extension, if a MIME type was not provided
metadata: Metadata to associate with the `Blob`
Returns:
Expand Down Expand Up @@ -274,6 +283,10 @@ def __repr__(self) -> str:
class Document(BaseMedia):
"""Class for storing a piece of text and associated metadata.
!!! note
`Document` is for **retrieval workflows**, not chat I/O. For sending text
to an LLM in a conversation, use message types from `langchain.messages`.
Example:
```python
from langchain_core.documents import Document
Expand Down
12 changes: 6 additions & 6 deletions libs/core/langchain_core/documents/compressor.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ class BaseDocumentCompressor(BaseModel, ABC):
This abstraction is primarily used for post-processing of retrieved documents.
Documents matching a given query are first retrieved.
`Document` objects matching a given query are first retrieved.
Then the list of documents can be further processed.
For example, one could re-rank the retrieved documents using an LLM.
!!! note
Users should favor using a RunnableLambda instead of sub-classing from this
Users should favor using a `RunnableLambda` instead of sub-classing from this
interface.
"""
Expand All @@ -43,9 +43,9 @@ def compress_documents(
"""Compress retrieved documents given the query context.
Args:
documents: The retrieved documents.
documents: The retrieved `Document` objects.
query: The query context.
callbacks: Optional callbacks to run during compression.
callbacks: Optional `Callbacks` to run during compression.
Returns:
The compressed documents.
Expand All @@ -61,9 +61,9 @@ async def acompress_documents(
"""Async compress retrieved documents given the query context.
Args:
documents: The retrieved documents.
documents: The retrieved `Document` objects.
query: The query context.
callbacks: Optional callbacks to run during compression.
callbacks: Optional `Callbacks` to run during compression.
Returns:
The compressed documents.
Expand Down
4 changes: 2 additions & 2 deletions libs/core/langchain_core/documents/transformers.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
class BaseDocumentTransformer(ABC):
"""Abstract base class for document transformation.

A document transformation takes a sequence of Documents and returns a
sequence of transformed Documents.
A document transformation takes a sequence of `Document` objects and returns a
sequence of transformed `Document` objects.

Example:
```python
Expand Down
4 changes: 2 additions & 2 deletions libs/core/langchain_core/embeddings/fake.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ class FakeEmbeddings(Embeddings, BaseModel):
This embedding model creates embeddings by sampling from a normal distribution.
!!! warning
!!! danger "Toy model"
Do not use this outside of testing, as it is not a real embedding model.
Instantiate:
Expand Down Expand Up @@ -73,7 +73,7 @@ class DeterministicFakeEmbedding(Embeddings, BaseModel):
This embedding model creates embeddings by sampling from a normal distribution
with a seed based on the hash of the text.
!!! warning
!!! danger "Toy model"
Do not use this outside of testing, as it is not a real embedding model.
Instantiate:
Expand Down
11 changes: 6 additions & 5 deletions libs/core/langchain_core/language_models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@
**Chat models**
Language models that use a sequence of messages as inputs and return chat messages
as outputs (as opposed to using plain text). Chat models support the assignment of
distinct roles to conversation messages, helping to distinguish messages from the AI,
users, and instructions such as system messages.
as outputs (as opposed to using plain text).
The key abstraction for chat models is `BaseChatModel`. Implementations
should inherit from this class.
Chat models support the assignment of distinct roles to conversation messages, helping
to distinguish messages from the AI, users, and instructions such as system messages.
The key abstraction for chat models is `BaseChatModel`. Implementations should inherit
from this class.
See existing [chat model integrations](https://docs.langchain.com/oss/python/integrations/chat).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Fake chat model for testing purposes."""
"""Fake chat models for testing purposes."""

import asyncio
import re
Expand Down
5 changes: 4 additions & 1 deletion libs/core/langchain_core/language_models/llms.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
"""Base interface for large language models to expose."""
"""Base interface for traditional large language models (LLMs) to expose.
These are traditionally older models (newer models generally are chat models).
"""

from __future__ import annotations

Expand Down
3 changes: 3 additions & 0 deletions libs/core/langchain_core/load/serializable.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,11 +97,14 @@ class Serializable(BaseModel, ABC):
by default. This is to prevent accidental serialization of objects that should
not be serialized.
- `get_lc_namespace`: Get the namespace of the LangChain object.
During deserialization, this namespace is used to identify
the correct class to instantiate.
Please see the `Reviver` class in `langchain_core.load.load` for more details.
During deserialization an additional mapping is handle classes that have moved
or been renamed across package versions.
- `lc_secrets`: A map of constructor argument names to secret ids.
- `lc_attributes`: List of additional attribute names that should be included
as part of the serialized representation.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ def _convert_to_v1_from_genai(message: AIMessage) -> list[types.ContentBlock]:
else:
# Assume it's raw base64 without data URI
try:
# Validate base64 and decode for mime type detection
# Validate base64 and decode for MIME type detection
decoded_bytes = base64.b64decode(url, validate=True)

image_url_b64_block = {
Expand All @@ -379,7 +379,7 @@ def _convert_to_v1_from_genai(message: AIMessage) -> list[types.ContentBlock]:
try:
import filetype # type: ignore[import-not-found] # noqa: PLC0415

# Guess mime type based on file bytes
# Guess MIME type based on file bytes
mime_type = None
kind = filetype.guess(decoded_bytes)
if kind:
Expand Down
Loading