The Learning Lab Module is a RESTful API service that allows users to upload a variety of document types (up to 1GB), process them (text extraction, conversion, indexing, summarization, and moderation), and generate answers using MongoDB Vector Search and LLM integration. It stores document metadata, text, and vector embeddings in MongoDB Atlas and original files in AWS S3.
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Document Uploadβ β Text Extraction β β Vector Embeddingβ
β β β β β β
β βββββββββββββββ β β βββββββββββββββ β β βββββββββββββββ β
β β Client β β β β AWS Textractβ β β β LLM Service β β
β β HTTP RequestβββΌβββββΊβ β Transcribe βββΌβββββΊβ β Embedding β β
β β Multipart β β β β PDF/Office β β β β Generation β β
β βββββββββββββββ β β βββββββββββββββ β β βββββββββββββββ β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Answer β β Vector Search β β MongoDB Atlas β
β Generation β β β β β
β βββββββββββββββ β β βββββββββββββββ β β βββββββββββββββ β
β β LLM β β β β MongoDB β β β β Documents β β
β β Context βββΌββββ β β vectorSearchβββΌββββ β β Embeddings β β
β β Integration β β β β Aggregation β β β β Vector Indexβ β
β βββββββββββββββ β β βββββββββββββββ β β βββββββββββββββ β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
- Project Structure - Overview of the codebase organization
- S3 Vector Store Guide - Guide to the S3-based vector store (legacy)
- RAG Testing Guide - How to test the RAG pipeline with real documents
- AWS Deployment Guide - Guide for deploying to AWS
- API Documentation - Comprehensive API documentation
- Code Refactoring - Details about the code reorganization
- Document Upload & Storage: Securely upload files and store them in AWS S3.
- Processing Pipeline: Extract text using AWS Textract (images), AWS Transcribe (audio/video), PDF parsing (pdf-parse), Excel (xlsx), Word (mammoth), and CSV processing.
- Metadata & Tagging: Store and manage document metadata and tags in MongoDB.
- Asynchronous Processing: Utilize Bull and Redis for handling asynchronous processing with advanced error recovery.
- Content Moderation: Uses AWS Rekognition to detect inappropriate content in images and videos before storing them.
- API Endpoints: Endpoints for uploading, checking status, tagging, searching, and deleting documents.
- MongoDB Vector Search: Uses MongoDB Atlas vector search capabilities for semantic similarity matching.
- LLM Integration: Generate answers to queries using LLM models with document context.
- Error Recovery: Advanced retry mechanism for failed processing jobs with exponential backoff.
- Performance Tests: Comprehensive performance testing suite to ensure system scalability.
The system leverages MongoDB Atlas Vector Search for efficient similarity searches:
-
Document Processing
- Documents are uploaded to S3 (original files)
- Text is extracted and stored in MongoDB
- Embeddings are generated using the LLM service
- Both text and embeddings are stored in the MongoDB document
-
Vector Indexing
- MongoDB Atlas creates and maintains the vector index
- The index is configured for efficient similarity searches
- Document embeddings are indexed automatically
-
Similarity Search
- User queries are converted to embeddings
- MongoDB's
$vectorSearch
aggregation finds similar documents - Results are ranked by similarity score
-
Answer Generation
- Most relevant documents form the context for the LLM
- The LLM generates a response using this context
- References to source documents are included in the response
const searchResults = await documentsCollection.aggregate([
{
$vectorSearch: {
index: 'vector_index',
queryVector: queryEmbedding,
path: "embedding",
numCandidates: 100,
limit: 5
}
},
{
$match: {
"userId": userId,
"status": "processed"
}
},
{
$project: {
"_id": 1,
"textS3Key": 1,
"name": 1,
"cleanedText": 1,
"score": { $meta: "vectorSearchScore" }
}
}
]).toArray();
The document is uploaded to S3 under the docs/
folder (e.g., docs/sample.pdf
).
- AWS Textract extracts text from images.
- AWS Transcribe processes audio/video files into text.
- Rekognition moderates image and video content for inappropriate materials before allowing storage.
- Extracted text is cleaned and stored in MongoDB
- LLM service generates embeddings for the text
- Embeddings and cleaned text are stored in the document record
- MongoDB Atlas vector index makes the embeddings searchable
- User query is converted to embedding vector
- Vector search finds relevant documents
- Document text is combined to form context
- LLM generates answer using context and user query
Before running the application, you must set up MongoDB Atlas with vector search capability:
-
Create an Atlas Cluster
- Sign up for MongoDB Atlas (https://www.mongodb.com/cloud/atlas)
- Create a new cluster (M10 or higher for vector search)
- Enable vector search capability in cluster settings
-
Create a Vector Index
- Go to "Atlas Search" in your cluster
- Create a new index on your lessons collection
- Index name:
vector_index
- Configuration:
{ "fields": [ { "path": "embedding", "type": "vector", "dimensions": 1536, "similarity": "cosine" } ] }
-
Set Connection String
- Get your MongoDB Atlas connection string
- Update your
.env
file with the connection string
Configure your AWS credentials and set up an IAM user with the necessary policies.
# Configure AWS CLI (replace placeholders with your actual credentials)
aws configure set aws_access_key_id YOUR_ACCESS_KEY_ID
aws configure set aws_secret_access_key YOUR_SECRET_ACCESS_KEY
aws configure set default.region YOUR_DEFAULT_REGION
This application requires several AWS services to be properly configured:
- AWS S3 - For document storage
- AWS Textract - For OCR and document text extraction
- AWS Transcribe - For audio/video transcription
- AWS Rekognition - For content moderation
# Verify AWS credentials
aws sts get-caller-identity
# Create a new IAM user (replace testUser with your desired username)
aws iam create-user --user-name testUser
# Attach AWS managed policies to the new user
aws iam attach-user-policy --user-name testUser --policy-arn arn:aws:iam::aws:policy/AdministratorAccess-AWSElasticBeanstalk
aws iam attach-user-policy --user-name testUser --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess
aws iam attach-user-policy --user-name testUser --policy-arn arn:aws:iam::aws:policy/AmazonRekognitionFullAccess
aws iam attach-user-policy --user-name testUser --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-user-policy --user-name testUser --policy-arn arn:aws:iam::aws:policy/AmazonTextractFullAccess
aws iam attach-user-policy --user-name testUser --policy-arn arn:aws:iam::aws:policy/AmazonTranscribeFullAccess
aws iam attach-user-policy --user-name testUser --policy-arn arn:aws:iam::aws:policy/AWSElasticBeanstalkWebTier
aws iam attach-user-policy --user-name testUser --policy-arn arn:aws:iam::aws:policy/AWSElasticBeanstalkWorkerTier
- Node.js (v14+)
- MongoDB Atlas (M10 or higher cluster with vector search)
- Redis
- AWS S3 Account and corresponding credentials
-
Clone the Repository:
git clone <repository-url> cd <repository-directory>
-
Install Dependencies:
npm install
-
Create a .env File:
PORT=3000 MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/learninglab AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_ACCESS_KEY AWS_REGION=us-east-1 S3_BUCKET=YOUR_S3_BUCKET_NAME REDIS_HOST=127.0.0.1 REDIS_PORT=6379 ACCESS_TOKEN_SECRET=test-secret LLM_API_KEY=your-api-key-for-embedding-generation
-
Start Redis:
redis-server
-
Run the Application:
# Production mode npm start # Development mode with auto-reload npm run dev
Run the Jest test suite:
npm test
curl -X POST http://localhost:3000/documents/upload \
-F "file=@/path/to/your/file.pdf" \
-F "name=My Document" \
-F "tags=tag1,tag2" \
-H "Authorization: Bearer <your-jwt-token>"
curl -X POST http://localhost:3000/generate \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-jwt-token>" \
-d '{"prompt": "What information is in my documents about space exploration?"}'
curl -X POST http://localhost:3000/auth/login \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "admin123"}'
For more API details and examples, see the API Documentation.