This project aims to compare different context extraction strategies from text for use in artificial intelligence agents.
The goal is to evaluate and demonstrate how each technique identifies, organizes, and retrieves relevant information for reasoning, answering, and decision-making tasks in LLM-based agents.
The focus of this repository is to explore different context extraction methods, analyzing their advantages, limitations, and ideal use cases.
The approaches evaluated include both lexical search techniques and semantic or embedding-based methods, allowing a deeper understanding of how each one impacts the quality of responses generated by AI agents.
Each technique has been implemented in a modular way, making it easier to study, compare, and integrate with other applications.
If this project was helpful to you or taught you something new, consider leaving a ⭐ star on the repository!
Your support helps increase the project's visibility and encourages the development of new improvements and comparisons. 🙌
- Description: busca direta de padrões (my name is X), listas fixas de dietas e preferências.
- Pros: fast, easy to implement, no heavy dependencies.
- Cons: fragile to linguistic variations, typos, and implicit context.
- Example: already implemented in
extractor_regex.py
.
- Description: uses Named Entity Recognition (NER) to extract names, and token analysis to identify diets/preferences.
- Pros: more robust for names and entities; handles plural/singular, lemmatization, and POS tagging.
- Cons: requires pre-trained models; does not understand more subjective context (e.g., conversational tone).
- Example: spaCy with
en_core_web_sm
to identifyPERSON
,FOOD
, etc.
- Description: combines classical NLP with semantic matchers (spaCy Matcher or PhraseMatcher), creating rules like "diet related to food" or "preference associated with verbs like enjoy/love".
- Pros: balances flexibility and accuracy; easier to maintain than pure regex.
- Cons: still relies on manual rules; may fail on ambiguous sentences.
- Example: "I enjoy barbecue" → matcher detects verb enjoy + food noun.
- Description: converts sentences into vectors and compares them with vectors representing "diets," "preferences," or "conversation tones."
- Pros: understands linguistic variations (“I love steak” ≈ “barbecue”), works well with synonyms.
- Cons: requires vector search infrastructure (e.g., FAISS, Redis Vector, Weaviate), more computationally expensive.
- Example: user says "I’m into grilled meat" → high similarity with the "barbecue" vector.
5. Language Models / Supervised IE (LLMs, Fine-tuning, or In-Context Learning) (very advanced level)
- Description: uses LLMs (GPT, LLaMA, etc.) with prompts for structured extraction, or models specifically trained for Information Extraction (IE).
- Pros: extremely flexible and accurate; captures nuances such as conversational tone, intentions, and implicit context.
- Cons: cost, latency, dependency on external models (if using API).
- Example: enviar prompt:
Extract structured info: {"name": ..., "diet_type": ..., "preferences": ..., "conversation_tone": ...}
Text: "Hi, I’m Lucas. I’m vegetarian and I prefer pizza and desserts. Please explain in detail."
Detailed analyses and comparative results are documented in the repository, allowing evaluation of efficiency, relevance of the extracted contexts, and suitability of each approach for different scenarios.
Method | Robustness | Ease of Use | Implicit Context | Ideal Example |
---|---|---|---|---|
Regex + Keywords | ⭐ | ⭐⭐⭐⭐ | ❌ | Best for quick starts, prototypes, but limited |
Classical NLP (spaCy / NER) | ⭐⭐ | ⭐⭐⭐ | Great trade-off between simplicity and robustness, but language-dependent | |
Semantic Rules (spaCy Matcher) | ⭐⭐⭐ | ⭐⭐ | More powerful than regex, but still manual | |
Embeddings + Similarity | ⭐⭐⭐⭐ | ⭐⭐ | ✅ | Understands synonyms and semantics, great for implicit preferences |
LLM / Supervised IE | ⭐⭐⭐⭐⭐ | ⭐ | ✅✅ | Most robust and flexible, higher cost; ideal for complex, multilingual cases |
Approach | Language Support | Supported Text Length | Accuracy / Reliability | Technological Support | Ease of Use / Learning Curve |
---|---|---|---|---|---|
Regex + Keywords | Works in any language but relies on manual lists → fragile to linguistic variations | Ideal for short to medium texts; regex on large texts can be heavy | Low to medium; good for fixed patterns, poor for varied natural language | Supported in almost all programming languages. Cons: regex and lists need to be rewritten for each language | Very easy to understand, great for prototypes and initial teaching |
Classical NLP (spaCy / NER) | Pretrained models for English, Spanish, German, Portuguese, etc. (limited by model availability) | Medium-length texts (up to a few pages); not ideal for very large documents | Medium; good for common names and entities, weak for subjective context | Mainly Python; other languages have equivalent libs (Stanza, CoreNLP, etc.), but integration varies | Relatively simple, good documentation and active community |
Semantic Rules (spaCy Matcher) | Support depends on the spaCy base model → multiple languages already supported | Medium-length texts; matcher works well on sentences and paragraphs | Medium to high if rules are well-constructed; still fragile for unexpected structures | Python-first; difficult to replicate in other languages | Slightly more technical than regex, but educational for those familiar with NLP |
Embeddings + Similarity | Multilingual (depends on model: SBERT has multilingual versions; OpenAI embeddings support dozens of languages) | Supports large texts, but long embeddings can be costly/slow | High for identifying concepts and synonyms; may fail on ambiguous sentences | Available in Python, JavaScript, Java, etc. (Python is more mature). Cons: dependency on vector search infrastructure | Harder than regex/spaCy but still educational; requires understanding of vectors and ML |
LLM / Supervised IE | Multilingual (large models understand dozens of languages) | Supports very large texts (depending on model token limits) | Very high (captures context, tone, intentions); may still hallucinate | APIs available in multiple languages. Cons: cost, latency, external dependency | Very easy to use (prompt → response), but not very educational for learning fundamentals, as it abstracts the complexity |
The extractors/
folder contains the individual implementations of each extraction method.
Each file represents a specific technique, with its own logic.
(You can add basic instructions here if desired, such as pip install -r requirements.txt
and python extractors/{file}.py
to run each comparison.)