🧠 Context Extraction Comparisons

This project aims to compare different context extraction strategies from text for use in artificial intelligence agents.

The goal is to evaluate and demonstrate how each technique identifies, organizes, and retrieves relevant information for reasoning, answering, and decision-making tasks in LLM-based agents.

🎯 Project Scope

The focus of this repository is to explore different context extraction methods, analyzing their advantages, limitations, and ideal use cases.

The approaches evaluated include both lexical search techniques and semantic or embedding-based methods, allowing a deeper understanding of how each one impacts the quality of responses generated by AI agents.

Each technique has been implemented in a modular way, making it easier to study, compare, and integrate with other applications.

💫 Support the Project

If this project was helpful to you or taught you something new, consider leaving a ⭐ star on the repository!

Your support helps increase the project's visibility and encourages the development of new improvements and comparisons. 🙌

🧩 Compared Techniques

Heuristics with Regex + Keyword Lists (basic level)

Description: busca direta de padrões (my name is X), listas fixas de dietas e preferências.
Pros: fast, easy to implement, no heavy dependencies.
Cons: fragile to linguistic variations, typos, and implicit context.
Example: already implemented in extractor_regex.py.

2. Classical NLP with spaCy / NLTK (basic-intermediate level)

Description: uses Named Entity Recognition (NER) to extract names, and token analysis to identify diets/preferences.
Pros: more robust for names and entities; handles plural/singular, lemmatization, and POS tagging.
Cons: requires pre-trained models; does not understand more subjective context (e.g., conversational tone).
Example: spaCy with en_core_web_sm to identify PERSON, FOOD, etc.

3. Semantic Rule-Based Extraction (spaCy Matcher or advanced RegEx) (intermediate level)

Description: combines classical NLP with semantic matchers (spaCy Matcher or PhraseMatcher), creating rules like "diet related to food" or "preference associated with verbs like enjoy/love".
Pros: balances flexibility and accuracy; easier to maintain than pure regex.
Cons: still relies on manual rules; may fail on ambiguous sentences.
Example: "I enjoy barbecue" → matcher detects verb enjoy + food noun.

4. Embeddings + Similarity (SBERT, OpenAI Embeddings, etc.) (advanced level)

Description: converts sentences into vectors and compares them with vectors representing "diets," "preferences," or "conversation tones."
Pros: understands linguistic variations (“I love steak” ≈ “barbecue”), works well with synonyms.
Cons: requires vector search infrastructure (e.g., FAISS, Redis Vector, Weaviate), more computationally expensive.
Example: user says "I’m into grilled meat" → high similarity with the "barbecue" vector.

5. Language Models / Supervised IE (LLMs, Fine-tuning, or In-Context Learning) (very advanced level)

Description: uses LLMs (GPT, LLaMA, etc.) with prompts for structured extraction, or models specifically trained for Information Extraction (IE).
Pros: extremely flexible and accurate; captures nuances such as conversational tone, intentions, and implicit context.
Cons: cost, latency, dependency on external models (if using API).
Example: enviar prompt:

    Extract structured info: {"name": ..., "diet_type": ..., "preferences": ..., "conversation_tone": ...}
    Text: "Hi, I’m Lucas. I’m vegetarian and I prefer pizza and desserts. Please explain in detail."

📊 Results and Comparisons

Detailed analyses and comparative results are documented in the repository, allowing evaluation of efficiency, relevance of the extracted contexts, and suitability of each approach for different scenarios.

📌 Summary

Method	Robustness	Ease of Use	Implicit Context	Ideal Example
Regex + Keywords	⭐	⭐⭐⭐⭐	❌	Best for quick starts, prototypes, but limited
Classical NLP (spaCy / NER)	⭐⭐	⭐⭐⭐	⚠️	Great trade-off between simplicity and robustness, but language-dependent
Semantic Rules (spaCy Matcher)	⭐⭐⭐	⭐⭐	⚠️	More powerful than regex, but still manual
Embeddings + Similarity	⭐⭐⭐⭐	⭐⭐	✅	Understands synonyms and semantics, great for implicit preferences
LLM / Supervised IE	⭐⭐⭐⭐⭐	⭐	✅✅	Most robust and flexible, higher cost; ideal for complex, multilingual cases

Comparativo detalhado

Approach	Language Support	Supported Text Length	Accuracy / Reliability	Technological Support	Ease of Use / Learning Curve
Regex + Keywords	Works in any language but relies on manual lists → fragile to linguistic variations	Ideal for short to medium texts; regex on large texts can be heavy	Low to medium; good for fixed patterns, poor for varied natural language	Supported in almost all programming languages. Cons: regex and lists need to be rewritten for each language	Very easy to understand, great for prototypes and initial teaching
Classical NLP (spaCy / NER)	Pretrained models for English, Spanish, German, Portuguese, etc. (limited by model availability)	Medium-length texts (up to a few pages); not ideal for very large documents	Medium; good for common names and entities, weak for subjective context	Mainly Python; other languages have equivalent libs (Stanza, CoreNLP, etc.), but integration varies	Relatively simple, good documentation and active community
Semantic Rules (spaCy Matcher)	Support depends on the spaCy base model → multiple languages already supported	Medium-length texts; matcher works well on sentences and paragraphs	Medium to high if rules are well-constructed; still fragile for unexpected structures	Python-first; difficult to replicate in other languages	Slightly more technical than regex, but educational for those familiar with NLP
Embeddings + Similarity	Multilingual (depends on model: SBERT has multilingual versions; OpenAI embeddings support dozens of languages)	Supports large texts, but long embeddings can be costly/slow	High for identifying concepts and synonyms; may fail on ambiguous sentences	Available in Python, JavaScript, Java, etc. (Python is more mature). Cons: dependency on vector search infrastructure	Harder than regex/spaCy but still educational; requires understanding of vectors and ML
LLM / Supervised IE	Multilingual (large models understand dozens of languages)	Supports very large texts (depending on model token limits)	Very high (captures context, tone, intentions); may still hallucinate	APIs available in multiple languages. Cons: cost, latency, external dependency	Very easy to use (prompt → response), but not very educational for learning fundamentals, as it abstracts the complexity

📂 Repository Structure

The extractors/ folder contains the individual implementations of each extraction method.

Each file represents a specific technique, with its own logic.

🚀 How to Use

(You can add basic instructions here if desired, such as pip install -r requirements.txt and python extractors/{file}.py to run each comparison.)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
extractors		extractors
tests		tests
.gitignore		.gitignore
README.md		README.md
README.pt-br.md		README.pt-br.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Context Extraction Comparisons

🎯 Project Scope

💫 Support the Project

🧩 Compared Techniques

Heuristics with Regex + Keyword Lists (basic level)

2. Classical NLP with spaCy / NLTK (basic-intermediate level)

3. Semantic Rule-Based Extraction (spaCy Matcher or advanced RegEx) (intermediate level)

4. Embeddings + Similarity (SBERT, OpenAI Embeddings, etc.) (advanced level)

5. Language Models / Supervised IE (LLMs, Fine-tuning, or In-Context Learning) (very advanced level)

📊 Results and Comparisons

📌 Summary

Comparativo detalhado

📂 Repository Structure

🚀 How to Use

About

Uh oh!

Languages

rodrigogmartins/extract-info-from-text-techniques

Folders and files

Latest commit

History

Repository files navigation

🧠 Context Extraction Comparisons

🎯 Project Scope

💫 Support the Project

🧩 Compared Techniques

Heuristics with Regex + Keyword Lists (basic level)

2. Classical NLP with spaCy / NLTK (basic-intermediate level)

3. Semantic Rule-Based Extraction (spaCy Matcher or advanced RegEx) (intermediate level)

4. Embeddings + Similarity (SBERT, OpenAI Embeddings, etc.) (advanced level)

5. Language Models / Supervised IE (LLMs, Fine-tuning, or In-Context Learning) (very advanced level)

📊 Results and Comparisons

📌 Summary

Comparativo detalhado

📂 Repository Structure

🚀 How to Use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages