Skip to content

Commit 882c816

Browse files
committed
Implement document indexing pipeline
Add core components for document processing pipeline that supports load → transform → vectorize → store workflow. Core Components: - DocumentProcessorInterface and DocumentProcessor for orchestrating the pipeline - ReplaceTextTransformer for text replacement with validation - Enhanced TextSplitTransformer with constructor parameters - withContent() method for TextDocument for immutable content updates Features: - Support for multiple transformers in processing chain - Array casting for flexible source input (string|array) - Movie fixtures (gladiator, inception, jurassic-park) in Markdown format - Example script demonstrating OpenAI embeddings with the pipeline Implementation Details: - Uses readonly classes for immutability - Proper validation in transformers - Comprehensive test coverage for withContent() method - Clean separation of concerns between loading, transforming, and indexing
1 parent 6d17be2 commit 882c816

File tree

10 files changed

+342
-2
lines changed

10 files changed

+342
-2
lines changed

examples/indexer/movies.php

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
<?php
2+
3+
/*
4+
* This file is part of the Symfony package.
5+
*
6+
* (c) Fabien Potencier <fabien@symfony.com>
7+
*
8+
* For the full copyright and license information, please view the LICENSE
9+
* file that was distributed with this source code.
10+
*/
11+
12+
use Symfony\AI\Platform\Bridge\OpenAi\Embeddings;
13+
use Symfony\AI\Platform\Bridge\OpenAi\PlatformFactory;
14+
use Symfony\AI\Store\Bridge\Local\InMemoryStore;
15+
use Symfony\AI\Store\Document\DocumentProcessor;
16+
use Symfony\AI\Store\Document\Loader\TextFileLoader;
17+
use Symfony\AI\Store\Document\Transformer\ReplaceTextTransformer;
18+
use Symfony\AI\Store\Document\Transformer\TextSplitTransformer;
19+
use Symfony\AI\Store\Document\Vectorizer;
20+
use Symfony\AI\Store\Indexer;
21+
22+
require_once dirname(__DIR__).'/bootstrap.php';
23+
24+
$platform = PlatformFactory::create(env('OPENAI_API_KEY'), http_client());
25+
$store = new InMemoryStore();
26+
$processor = new DocumentProcessor(
27+
new TextFileLoader(),
28+
[
29+
new ReplaceTextTransformer('## Plot', '## Synopsis'),
30+
new TextSplitTransformer(500, 100),
31+
],
32+
new Indexer(new Vectorizer($platform, new Embeddings('text-embedding-3-small')), $store)
33+
);
34+
35+
$movies = [
36+
dirname(__DIR__, 2).'/fixtures/movies/gladiator.md',
37+
dirname(__DIR__, 2).'/fixtures/movies/inception.md',
38+
dirname(__DIR__, 2).'/fixtures/movies/jurassic-park.md',
39+
];
40+
41+
$processor->process($movies);
42+
43+
$results = $store->search($platform->invoke(new Embeddings('text-embedding-3-small'), 'Roman gladiator revenge')->asVectors()[0], 2);
44+
foreach ($results as $i => $result) {
45+
echo sprintf("%d. [%.0f%%] %s\n", $i + 1, $result['similarity'] * 100, substr($result['document']->id, 0, 40).'...');
46+
}

fixtures/movies/gladiator.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Gladiator (2000)
2+
3+
**IMDB**: https://www.imdb.com/title/tt0172495/
4+
5+
**Director:** Ridley Scott
6+
7+
## Cast
8+
9+
- **Russell Crowe** as Maximus Decimus Meridius
10+
- **Joaquin Phoenix** as Emperor Commodus
11+
- **Connie Nielsen** as Lucilla
12+
- **Oliver Reed** as Proximo
13+
- **Derek Jacobi** as Senator Gracchus
14+
- **Djimon Hounsou** as Juba
15+
- **Richard Harris** as Marcus Aurelius
16+
- **Ralf Möller** as Hagen
17+
- **Tommy Flanagan** as Cicero
18+
- **David Schofield** as Falco
19+
20+
## Plot
21+
22+
A former Roman General sets out to exact vengeance against the corrupt emperor who murdered his family and sent him into slavery.
23+
24+
**Maximus Decimus Meridius** is a powerful Roman general beloved by the people and the aging Emperor **Marcus Aurelius**. As Marcus Aurelius lies dying, he makes known his wish that Maximus should succeed him and return Rome to the former glory of the Republic rather than the corrupt Empire it has become.
25+
26+
However, Marcus Aurelius's son **Commodus** learns of his father's plan and murders him before he can publicly name Maximus as his successor. Commodus then orders the execution of Maximus and his family. Maximus escapes the execution but arrives at his farm too late to save his wife and son.
27+
28+
Wounded and devastated, Maximus is captured by slave traders and forced to become a gladiator. Under the training of **Proximo**, a former gladiator, Maximus becomes a skilled fighter and eventually makes his way to the **Colosseum** in Rome, where he gains fame and the crowd's favor.
29+
30+
Using his newfound popularity with the people, Maximus seeks to avenge the murder of his family and fulfill his promise to Marcus Aurelius to restore Rome to a republic. The film culminates in a final confrontation between Maximus and Commodus in the arena.
31+
32+
The film explores themes of *honor*, *revenge*, *political corruption*, and the struggle between personal desires and duty to the greater good.

fixtures/movies/inception.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Inception (2010)
2+
3+
**IMDB**: https://www.imdb.com/title/tt1375666/
4+
5+
**Director:** Christopher Nolan
6+
7+
## Cast
8+
9+
- **Leonardo DiCaprio** as Dom Cobb
10+
- **Marion Cotillard** as Mal Cobb
11+
- **Tom Hardy** as Eames
12+
- **Elliot Page** as Ariadne
13+
- **Ken Watanabe** as Saito
14+
- **Dileep Rao** as Yusuf
15+
- **Cillian Murphy** as Robert Fischer Jr.
16+
- **Tom Berenger** as Peter Browning
17+
- **Michael Caine** as Professor Stephen Miles
18+
- **Lukas Haas** as Nash
19+
20+
## Plot
21+
22+
A skilled thief is given a chance at redemption if he can successfully perform inception, the act of planting an idea in someone's subconscious.
23+
24+
**Dom Cobb** is a skilled thief who specializes in *extraction* - stealing secrets from people's subconscious minds while they dream. This unique skill has made him a valuable player in the world of corporate espionage, but it has also cost him everything he loves. Cobb's rare ability has made him a coveted player in this treacherous new world of corporate espionage, but it has also made him an international fugitive and cost him everything he has ever loved.
25+
26+
Now Cobb is being offered a chance at redemption. One last job could give him his life back but only if he can accomplish the impossible - **inception**. Instead of the perfect heist, Cobb and his team of specialists have to pull off the reverse: their task is not to steal an idea but to plant one. If they succeed, it could be the perfect crime.
27+
28+
The film explores themes of *reality*, *dreams*, *memory*, and the nature of consciousness through multiple layers of dream states, creating a complex narrative structure that challenges both characters and audience to question what is real.

fixtures/movies/jurassic-park.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Jurassic Park (1993)
2+
3+
**IMDB**: https://www.imdb.com/title/tt0107290/
4+
5+
**Director:** Steven Spielberg
6+
7+
## Cast
8+
9+
- **Sam Neill** as Dr. Alan Grant
10+
- **Laura Dern** as Dr. Ellie Sattler
11+
- **Jeff Goldblum** as Dr. Ian Malcolm
12+
- **Richard Attenborough** as John Hammond
13+
- **Bob Peck** as Robert Muldoon
14+
- **Martin Ferrero** as Donald Gennaro
15+
- **BD Wong** as Dr. Henry Wu
16+
- **Joseph Mazzello** as Tim Murphy
17+
- **Ariana Richards** as Lex Murphy
18+
- **Wayne Knight** as Dennis Nedry
19+
20+
## Plot
21+
22+
During a preview tour, a theme park suffers a major power breakdown that allows its cloned dinosaur exhibits to run amok.
23+
24+
Billionaire **John Hammond** has created a theme park on a remote island where he has successfully cloned dinosaurs from ancient DNA found in prehistoric mosquitoes preserved in amber. Before opening to the public, Hammond invites a select group of people to tour the park, including paleontologist **Dr. Alan Grant**, paleobotanist **Dr. Ellie Sattler**, and mathematician **Dr. Ian Malcolm**.
25+
26+
The tour begins smoothly, but things quickly go wrong when the park's computer systems are sabotaged by the disgruntled programmer **Dennis Nedry**, who is attempting to steal dinosaur embryos. The security systems fail, and the dinosaurs break free from their enclosures.
27+
28+
As the island descends into chaos, the visitors must survive encounters with various dangerous dinosaurs, including the intelligent and deadly **Velociraptors** and the massive **Tyrannosaurus Rex**. Dr. Grant finds himself responsible for Hammond's grandchildren, Tim and Lex, as they attempt to reach safety.
29+
30+
The film explores themes of *scientific ethics*, the *hubris of trying to control nature*, and the *unintended consequences of genetic engineering*. It questions whether humans have the right to resurrect extinct species and whether scientific advancement should be pursued without considering the potential risks and moral implications.
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
<?php
2+
3+
/*
4+
* This file is part of the Symfony package.
5+
*
6+
* (c) Fabien Potencier <fabien@symfony.com>
7+
*
8+
* For the full copyright and license information, please view the LICENSE
9+
* file that was distributed with this source code.
10+
*/
11+
12+
namespace Symfony\AI\Store\Document;
13+
14+
use Psr\Log\LoggerInterface;
15+
use Psr\Log\NullLogger;
16+
use Symfony\AI\Store\Indexer;
17+
18+
/**
19+
* Default implementation of DocumentProcessorInterface that orchestrates
20+
* the complete document processing pipeline: load → transform → vectorize → store.
21+
*
22+
* @author Oskar Stark <oskarstark@googlemail.com>
23+
*/
24+
final readonly class DocumentProcessor implements DocumentProcessorInterface
25+
{
26+
/**
27+
* @param TransformerInterface[] $transformers
28+
*/
29+
public function __construct(
30+
private LoaderInterface $loader,
31+
private array $transformers,
32+
private Indexer $indexer,
33+
private LoggerInterface $logger = new NullLogger(),
34+
) {
35+
}
36+
37+
public function process(string|array $source, array $options = []): void
38+
{
39+
$this->logger->debug('Starting document processing', [
40+
'source' => $source,
41+
'options' => $options,
42+
]);
43+
44+
$sources = (array) $source;
45+
$allDocuments = [];
46+
47+
// Load documents from all sources
48+
foreach ($sources as $singleSource) {
49+
$documents = ($this->loader)($singleSource, $options['loader'] ?? []);
50+
foreach ($documents as $document) {
51+
$allDocuments[] = $document;
52+
}
53+
}
54+
55+
// Transform documents through all transformers
56+
$transformedDocuments = $allDocuments;
57+
foreach ($this->transformers as $transformer) {
58+
$transformedDocuments = ($transformer)($transformedDocuments, $options['transformer'] ?? []);
59+
}
60+
61+
// Vectorize and store documents
62+
$this->indexer->index($transformedDocuments, $options['chunk_size'] ?? 50);
63+
64+
$this->logger->debug('Document processing completed', [
65+
'source' => $source,
66+
]);
67+
}
68+
}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
<?php
2+
3+
/*
4+
* This file is part of the Symfony package.
5+
*
6+
* (c) Fabien Potencier <fabien@symfony.com>
7+
*
8+
* For the full copyright and license information, please view the LICENSE
9+
* file that was distributed with this source code.
10+
*/
11+
12+
namespace Symfony\AI\Store\Document;
13+
14+
15+
/**
16+
* High-level interface for processing documents through the complete pipeline:
17+
* load → transform → vectorize → store.
18+
*
19+
* @author Oskar Stark <oskarstark@googlemail.com>
20+
*/
21+
interface DocumentProcessorInterface
22+
{
23+
/**
24+
* Process a source through the complete indexing pipeline.
25+
*
26+
* @param string|array<string> $source Source identifier (file path, URL, etc.) or array of sources
27+
* @param array<string, mixed> $options Processing options
28+
*/
29+
public function process(string|array $source, array $options = []): void;
30+
31+
}

src/store/src/Document/TextDocument.php

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,9 @@ public function __construct(
2828
throw new InvalidArgumentException('The content shall not be an empty string.');
2929
}
3030
}
31+
32+
public function withContent(string $content): self
33+
{
34+
return new self($this->id, $content, $this->metadata);
35+
}
3136
}
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
<?php
2+
3+
/*
4+
* This file is part of the Symfony package.
5+
*
6+
* (c) Fabien Potencier <fabien@symfony.com>
7+
*
8+
* For the full copyright and license information, please view the LICENSE
9+
* file that was distributed with this source code.
10+
*/
11+
12+
namespace Symfony\AI\Store\Document\Transformer;
13+
14+
use Symfony\AI\Store\Document\Metadata;
15+
use Symfony\AI\Store\Document\TextDocument;
16+
use Symfony\AI\Store\Document\TransformerInterface;
17+
use Symfony\Component\Uid\Uuid;
18+
19+
/**
20+
* Replaces specified text within document content.
21+
*
22+
* @author Oskar Stark <oskarstark@googlemail.com>
23+
*/
24+
final readonly class ReplaceTextTransformer implements TransformerInterface
25+
{
26+
public const OPTION_SEARCH = 'search';
27+
public const OPTION_REPLACE = 'replace';
28+
29+
public function __construct(
30+
private string $search = '',
31+
private string $replace = '',
32+
) {
33+
self::validate($search, $replace);
34+
}
35+
36+
/**
37+
* @param array{search?: string, replace?: string} $options
38+
*/
39+
public function __invoke(iterable $documents, array $options = []): iterable
40+
{
41+
$search = $options[self::OPTION_SEARCH] ?? $this->search;
42+
$replace = $options[self::OPTION_REPLACE] ?? $this->replace;
43+
44+
self::validate($search, $replace);
45+
46+
foreach ($documents as $document) {
47+
yield $document->withContent(str_replace($search, $replace, $document->content));
48+
}
49+
}
50+
51+
private static function validate(string $search, string $replace): void
52+
{
53+
if ($search === $replace) {
54+
throw new \InvalidArgumentException('Search and replace strings must be different.');
55+
}
56+
}
57+
}

src/store/src/Document/Transformer/TextSplitTransformer.php

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,22 @@
2929
public const OPTION_CHUNK_SIZE = 'chunk_size';
3030
public const OPTION_OVERLAP = 'overlap';
3131

32+
public function __construct(
33+
private int $chunkSize = 1000,
34+
private int $overlap = 200,
35+
) {
36+
if ($this->overlap < 0 || $this->overlap >= $this->chunkSize) {
37+
throw new InvalidArgumentException(sprintf('Overlap must be non-negative and less than chunk size. Got chunk size: %d, overlap: %d', $this->chunkSize, $this->overlap));
38+
}
39+
}
40+
3241
/**
3342
* @param array{chunk_size?: int, overlap?: int} $options
3443
*/
3544
public function __invoke(iterable $documents, array $options = []): iterable
3645
{
37-
$chunkSize = $options[self::OPTION_CHUNK_SIZE] ?? 1000;
38-
$overlap = $options[self::OPTION_OVERLAP] ?? 200;
46+
$chunkSize = $options[self::OPTION_CHUNK_SIZE] ?? $this->chunkSize;
47+
$overlap = $options[self::OPTION_OVERLAP] ?? $this->overlap;
3948

4049
if ($overlap < 0 || $overlap >= $chunkSize) {
4150
throw new InvalidArgumentException('Overlap must be non-negative and less than chunk size.');

src/store/tests/Document/TextDocumentTest.php

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,4 +247,38 @@ public function testExceptionMessageIsCorrect()
247247

248248
new TextDocument(Uuid::v4(), ' ');
249249
}
250+
251+
#[TestDox('withContent() creates new instance with updated content')]
252+
public function testWithContent()
253+
{
254+
$id = Uuid::v4();
255+
$originalContent = 'Original content';
256+
$newContent = 'Updated content';
257+
$metadata = new Metadata(['title' => 'Test Document']);
258+
259+
$originalDocument = new TextDocument($id, $originalContent, $metadata);
260+
$updatedDocument = $originalDocument->withContent($newContent);
261+
262+
// Original document is unchanged
263+
$this->assertSame($originalContent, $originalDocument->content);
264+
265+
// New document has updated content but same ID and metadata
266+
$this->assertSame($newContent, $updatedDocument->content);
267+
$this->assertSame($id, $updatedDocument->id);
268+
$this->assertSame($metadata, $updatedDocument->metadata);
269+
270+
// Different instances
271+
$this->assertNotSame($originalDocument, $updatedDocument);
272+
}
273+
274+
#[TestDox('withContent() validates new content')]
275+
public function testWithContentValidatesContent()
276+
{
277+
$document = new TextDocument(Uuid::v4(), 'Valid content');
278+
279+
$this->expectException(InvalidArgumentException::class);
280+
$this->expectExceptionMessage('The content shall not be an empty string.');
281+
282+
$document->withContent(' ');
283+
}
250284
}

0 commit comments

Comments
 (0)