Skip to content

Compare and explore different context extraction techniques from text for AI agents, including Regex, NLP, embeddings, and LLM-based approaches. Learn their strengths, limitations, and ideal use cases for real-world applications.

Notifications You must be signed in to change notification settings

rodrigogmartins/extract-info-from-text-techniques

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Context Extraction Comparisons

en pt-br

This project aims to compare different context extraction strategies from text for use in artificial intelligence agents.

The goal is to evaluate and demonstrate how each technique identifies, organizes, and retrieves relevant information for reasoning, answering, and decision-making tasks in LLM-based agents.

🎯 Project Scope

The focus of this repository is to explore different context extraction methods, analyzing their advantages, limitations, and ideal use cases.

The approaches evaluated include both lexical search techniques and semantic or embedding-based methods, allowing a deeper understanding of how each one impacts the quality of responses generated by AI agents.

Each technique has been implemented in a modular way, making it easier to study, compare, and integrate with other applications.


💫 Support the Project

If this project was helpful to you or taught you something new, consider leaving a ⭐ star on the repository!

Your support helps increase the project's visibility and encourages the development of new improvements and comparisons. 🙌


🧩 Compared Techniques

Heuristics with Regex + Keyword Lists (basic level)

  • Description: busca direta de padrões (my name is X), listas fixas de dietas e preferências.
  • Pros: fast, easy to implement, no heavy dependencies.
  • Cons: fragile to linguistic variations, typos, and implicit context.
  • Example: already implemented in extractor_regex.py.

2. Classical NLP with spaCy / NLTK (basic-intermediate level)

  • Description: uses Named Entity Recognition (NER) to extract names, and token analysis to identify diets/preferences.
  • Pros: more robust for names and entities; handles plural/singular, lemmatization, and POS tagging.
  • Cons: requires pre-trained models; does not understand more subjective context (e.g., conversational tone).
  • Example: spaCy with en_core_web_sm to identify PERSON, FOOD, etc.

3. Semantic Rule-Based Extraction (spaCy Matcher or advanced RegEx) (intermediate level)

  • Description: combines classical NLP with semantic matchers (spaCy Matcher or PhraseMatcher), creating rules like "diet related to food" or "preference associated with verbs like enjoy/love".
  • Pros: balances flexibility and accuracy; easier to maintain than pure regex.
  • Cons: still relies on manual rules; may fail on ambiguous sentences.
  • Example: "I enjoy barbecue" → matcher detects verb enjoy + food noun.

4. Embeddings + Similarity (SBERT, OpenAI Embeddings, etc.) (advanced level)

  • Description: converts sentences into vectors and compares them with vectors representing "diets," "preferences," or "conversation tones."
  • Pros: understands linguistic variations (“I love steak” ≈ “barbecue”), works well with synonyms.
  • Cons: requires vector search infrastructure (e.g., FAISS, Redis Vector, Weaviate), more computationally expensive.
  • Example: user says "I’m into grilled meat" → high similarity with the "barbecue" vector.

5. Language Models / Supervised IE (LLMs, Fine-tuning, or In-Context Learning) (very advanced level)

  • Description: uses LLMs (GPT, LLaMA, etc.) with prompts for structured extraction, or models specifically trained for Information Extraction (IE).
  • Pros: extremely flexible and accurate; captures nuances such as conversational tone, intentions, and implicit context.
  • Cons: cost, latency, dependency on external models (if using API).
  • Example: enviar prompt:
    Extract structured info: {"name": ..., "diet_type": ..., "preferences": ..., "conversation_tone": ...}
    Text: "Hi, I’m Lucas. I’m vegetarian and I prefer pizza and desserts. Please explain in detail."

📊 Results and Comparisons

Detailed analyses and comparative results are documented in the repository, allowing evaluation of efficiency, relevance of the extracted contexts, and suitability of each approach for different scenarios.

📌 Summary

Method Robustness Ease of Use Implicit Context Ideal Example
Regex + Keywords ⭐⭐⭐⭐ Best for quick starts, prototypes, but limited
Classical NLP (spaCy / NER) ⭐⭐ ⭐⭐⭐ ⚠️ Great trade-off between simplicity and robustness, but language-dependent
Semantic Rules (spaCy Matcher) ⭐⭐⭐ ⭐⭐ ⚠️ More powerful than regex, but still manual
Embeddings + Similarity ⭐⭐⭐⭐ ⭐⭐ Understands synonyms and semantics, great for implicit preferences
LLM / Supervised IE ⭐⭐⭐⭐⭐ ✅✅ Most robust and flexible, higher cost; ideal for complex, multilingual cases

Comparativo detalhado

Approach Language Support Supported Text Length Accuracy / Reliability Technological Support Ease of Use / Learning Curve
Regex + Keywords Works in any language but relies on manual lists → fragile to linguistic variations Ideal for short to medium texts; regex on large texts can be heavy Low to medium; good for fixed patterns, poor for varied natural language Supported in almost all programming languages. Cons: regex and lists need to be rewritten for each language Very easy to understand, great for prototypes and initial teaching
Classical NLP (spaCy / NER) Pretrained models for English, Spanish, German, Portuguese, etc. (limited by model availability) Medium-length texts (up to a few pages); not ideal for very large documents Medium; good for common names and entities, weak for subjective context Mainly Python; other languages have equivalent libs (Stanza, CoreNLP, etc.), but integration varies Relatively simple, good documentation and active community
Semantic Rules (spaCy Matcher) Support depends on the spaCy base model → multiple languages already supported Medium-length texts; matcher works well on sentences and paragraphs Medium to high if rules are well-constructed; still fragile for unexpected structures Python-first; difficult to replicate in other languages Slightly more technical than regex, but educational for those familiar with NLP
Embeddings + Similarity Multilingual (depends on model: SBERT has multilingual versions; OpenAI embeddings support dozens of languages) Supports large texts, but long embeddings can be costly/slow High for identifying concepts and synonyms; may fail on ambiguous sentences Available in Python, JavaScript, Java, etc. (Python is more mature). Cons: dependency on vector search infrastructure Harder than regex/spaCy but still educational; requires understanding of vectors and ML
LLM / Supervised IE Multilingual (large models understand dozens of languages) Supports very large texts (depending on model token limits) Very high (captures context, tone, intentions); may still hallucinate APIs available in multiple languages. Cons: cost, latency, external dependency Very easy to use (prompt → response), but not very educational for learning fundamentals, as it abstracts the complexity

📂 Repository Structure

The extractors/ folder contains the individual implementations of each extraction method.

Each file represents a specific technique, with its own logic.

🚀 How to Use

(You can add basic instructions here if desired, such as pip install -r requirements.txt and python extractors/{file}.py to run each comparison.)

About

Compare and explore different context extraction techniques from text for AI agents, including Regex, NLP, embeddings, and LLM-based approaches. Learn their strengths, limitations, and ideal use cases for real-world applications.

Topics

Resources

Stars

Watchers

Forks

Languages