Skip to content

Releases: DataFog/datafog-python

🚧 Beta Release 4.3.0b1

05 Jun 02:39
Compare
Choose a tag to compare
Pre-release

Beta Release Notes

Beta Release: 2025-06-05

⚠️ This is a beta release for testing purposes.

🚀 New Features

  • fix(ci): add diagnostics and plugin verification for benchmark tests
  • fix(ci): add diagnostics and plugin verification for benchmark tests
  • Merge pull request #104 from DataFog/feature/sample-notebooks
  • Merge branch 'dev' into feature/sample-notebooks
  • Fix segmentation fault in beta-release workflow and add sample notebook
  • Merge pull request #103 from DataFog/feature/sample-notebooks
  • Fix segmentation fault in beta-release workflow and add sample notebook
  • Merge pull request #102 from DataFog/feature/gliner-integration-v420
  • Merge branch 'dev' into feature/gliner-integration-v420
  • Merge branch 'feature/gliner-integration-v420' of github.com:DataFog/datafog-python into feature/gliner-integration-v420
  • Merge pull request #101 from DataFog/feature/gliner-integration-v420
  • Merge branch 'dev' into feature/gliner-integration-v420
  • Merge pull request #100 from DataFog/feature/gliner-integration-v420
  • docs: add release guidelines to Claude.md

🐛 Bug Fixes

  • Merge pull request #108 from DataFog/fix/beta-workflow-changelog-v2
  • Merge branch 'dev' into fix/beta-workflow-changelog-v2
  • Merge pull request #107 from DataFog/fix/beta-workflow-changelog-v2
  • Merge branch 'fix/performance-regression' into dev
  • fix(ci): improve beta version detection to check existing git tags
  • Merge branch 'fix/performance-regression' of github.com:DataFog/datafog-python into fix/performance-regression
  • fix(ci): improve beta versioning logic and use GH_PAT token
  • fix(ci): improve beta versioning logic and use GH_PAT token
  • fix(ci): replace invalid --benchmark-skip flag with simple performance test
  • Merge pull request #106 from DataFog/fix/performance-regression
  • Merge branch 'dev' into fix/performance-regression
  • Merge pull request #105 from DataFog/fix/performance-regression
  • fix(ci): reset benchmark baseline to resolve false regression alerts
  • fix(performance): eliminate memory debugging overhead from benchmarks
  • fix(performance): eliminate redundant regex calls in structured output mode
  • fix(performance): eliminate redundant regex calls in structured output mode
  • fix(ci): handle segfault gracefully while preserving test validation
  • fix(tests): make spaCy address detection test more robust
  • fix(ci): improve GLiNER validation to confirm PyTorch exclusion
  • fix(ci): exclude PyTorch dependencies entirely to prevent segfault
  • fix(ci): eliminate PyTorch segfaults and enhance README with GLiNER examples
  • fix(ci): workaround for PyTorch segfault in CI environments
  • fix(ci): split test execution to prevent memory segfault
  • fix(ci): reduce coverage reporting to prevent segmentation fault
  • fix(tests): resolve final GLiNER test failures
  • fix(tests): update GLiNER test mocking for proper import paths
  • fix(tests): resolve GLiNER dependency mocking for CI environments

🔧 Other Changes

  • chore: bump version to 4.3.0 for next development cycle
  • chore: clean up test changelog file after merge
  • chore: clean up test changelog file after merge
  • chore: set version to 4.2.0b1 for beta testing of unreleased 4.2.0
  • resolve: merge conflicts with enhanced segfault detection

📥 Installation

# Core package (lightweight)
pip install datafog

# With all features
pip install datafog[all]

📊 Metrics

  • Package size: ~2MB (core)
  • Install time: ~10 seconds
  • Tests passing: ✅
  • Commits this week: 46

🚧 Beta Release 4.2.0b3

31 May 03:15
Compare
Choose a tag to compare
Pre-release

Beta Release Notes

Beta Release: 2025-05-31

⚠️ This is a beta release for testing purposes.

🚀 New Features

  • fix(ci): add diagnostics and plugin verification for benchmark tests
  • fix(ci): add diagnostics and plugin verification for benchmark tests
  • Merge pull request #104 from DataFog/feature/sample-notebooks
  • Merge branch 'dev' into feature/sample-notebooks
  • Fix segmentation fault in beta-release workflow and add sample notebook
  • Merge pull request #103 from DataFog/feature/sample-notebooks
  • Fix segmentation fault in beta-release workflow and add sample notebook
  • Merge pull request #102 from DataFog/feature/gliner-integration-v420
  • Merge branch 'dev' into feature/gliner-integration-v420
  • Merge branch 'feature/gliner-integration-v420' of github.com:DataFog/datafog-python into feature/gliner-integration-v420
  • Merge pull request #101 from DataFog/feature/gliner-integration-v420
  • Merge branch 'dev' into feature/gliner-integration-v420
  • Merge pull request #100 from DataFog/feature/gliner-integration-v420
  • docs: add release guidelines to Claude.md
  • feat(nlp): add GLiNER integration with smart cascading engine
  • fix(deps): add pydantic-settings to cli and all extras
  • Merge pull request #92 from DataFog/feature/automated-release-pipeline
  • feat(ci): configure release workflows for 4.2.0 minor version bump
  • feat(ci): add comprehensive alpha→beta→stable release cycle
  • feat(ci): add nightly alpha builds for Monday-Thursday
  • Merge pull request #91 from DataFog/feature/implement-weekly-release-plan
  • feat(release): implement weekly release plan infrastructure

🐛 Bug Fixes

  • Merge pull request #107 from DataFog/fix/beta-workflow-changelog-v2
  • Merge branch 'fix/performance-regression' into dev
  • fix(ci): improve beta version detection to check existing git tags
  • Merge branch 'fix/performance-regression' of github.com:DataFog/datafog-python into fix/performance-regression
  • fix(ci): improve beta versioning logic and use GH_PAT token
  • fix(ci): improve beta versioning logic and use GH_PAT token
  • fix(ci): replace invalid --benchmark-skip flag with simple performance test
  • Merge pull request #106 from DataFog/fix/performance-regression
  • Merge branch 'dev' into fix/performance-regression
  • Merge pull request #105 from DataFog/fix/performance-regression
  • fix(ci): reset benchmark baseline to resolve false regression alerts
  • fix(performance): eliminate memory debugging overhead from benchmarks
  • fix(performance): eliminate redundant regex calls in structured output mode
  • fix(performance): eliminate redundant regex calls in structured output mode
  • fix(ci): handle segfault gracefully while preserving test validation
  • fix(tests): make spaCy address detection test more robust
  • fix(ci): improve GLiNER validation to confirm PyTorch exclusion
  • fix(ci): exclude PyTorch dependencies entirely to prevent segfault
  • fix(ci): eliminate PyTorch segfaults and enhance README with GLiNER examples
  • fix(ci): workaround for PyTorch segfault in CI environments
  • fix(ci): split test execution to prevent memory segfault
  • fix(ci): reduce coverage reporting to prevent segmentation fault
  • fix(tests): resolve final GLiNER test failures
  • fix(tests): update GLiNER test mocking for proper import paths
  • fix(tests): resolve GLiNER dependency mocking for CI environments
  • Merge pull request #99 from DataFog/fix/github-actions-workflow-fixes
  • Merge branch 'dev' into fix/github-actions-workflow-fixes
  • fix(deps): move pydantic-settings to core dependencies
  • fix(ci): install all extras and configure pytest-asyncio in workflows
  • Merge pull request #98 from DataFog/fix/github-actions-workflow-fixes
  • fix(ci): resolve YAML syntax errors in GitHub Actions workflows
  • Merge pull request #96 from DataFog/codex/fix-failing-github-actions-in-workflows
  • fix release workflows
  • Merge pull request #95 from DataFog/hotfix/readme-fix
  • Merge branch 'dev' into hotfix/readme-fix
  • fix(ci): remove indentation from Python code in workflow commands
  • fix(text): resolve missing Span import for structured output
  • fix(ci): resolve YAML syntax issues in workflow files
  • fix(ci): resolve prettier pre-commit hook configuration
  • fix(ci): resolve YAML syntax issues in release workflows
  • fix(lint): resolve flake8 string formatting warnings
  • fix(ci): restore expected job names and consolidate workflows
  • fix(imports): resolve flake8 E402 import order issues

📚 Documentation

  • docs: streamline Claude.md development guide for v4.2.0
  • fixed readme

🔧 Other Changes

  • chore: clean up test changelog file after merge
  • chore: set version to 4.2.0b1 for beta testing of unreleased 4.2.0
  • resolve: merge conflicts with enhanced segfault detection
  • release: prepare v4.2.0 with GLiNER integration
  • updated workflows
  • Merge pull request #94 from DataFog/hotfix/beta-workflow-yaml-syntax
  • Merge branch 'dev' into hotfix/beta-workflow-yaml-syntax
  • Merge pull request #93 from DataFog/hotfix/beta-workflow-yaml-syntax

📥 Installation

# Core package (lightweight)
pip install datafog

# With all features
pip install datafog[all]

📊 Metrics

  • Package size: ~2MB (core)
  • Install time: ~10 seconds
  • Tests passing: ✅
  • Commits this week: 75

release: prepare v4.2.0 with GLiNER integration

31 May 03:36
Compare
Choose a tag to compare

DataFog 4.2.0 - GLiNER Integration Release

Released: 2025-05-30

🚀 Major Features

GLiNER Integration

  • Modern NER Engine: Added GLiNER (Generalist Named Entity Recognition) support
  • Smart Cascading: Intelligent progression from regex → GLiNER → spaCy
  • 32x Performance: GLiNER provides 32x faster NER compared to spaCy baseline
  • PII-Specialized Models: Support for urchade/gliner_multi_pii-v1 and other models

Engine Selection

from datafog.services.text_service import TextService

# New GLiNER engine
service = TextService(engine="gliner")

# Smart cascading (recommended)
service = TextService(engine="smart")  # regex → GLiNER → spaCy

Performance Improvements

  • 190x faster regex engine for structured PII (emails, phones, SSNs)
  • Lightweight core: <2MB package with optional ML extras
  • Memory optimization: Enhanced segfault handling and performance validation

🐛 Bug Fixes

  • Fixed CI segmentation faults in test environments
  • Resolved benchmark regression detection
  • Improved dependency management for optional ML features
  • Enhanced test stability across platforms

🔧 Infrastructure

  • Comprehensive CI/CD improvements
  • Enhanced GitHub Actions workflows
  • Better error handling and diagnostics
  • Sample notebooks and examples

📥 Installation

# Core package (lightweight)
pip install datafog

# With GLiNER support
pip install datafog[nlp-advanced]

# Everything included
pip install datafog[all]

📊 Performance Comparison

Engine Speed vs spaCy Accuracy Use Case
regex 190x faster High (structured) Emails, phones, SSNs
gliner 32x faster Very High Modern NER
spacy 1x (baseline) Good Traditional NLP
smart 60x faster Highest Best balance

🔗 Links

v4.1.1

25 May 19:49
a9eaffd
Compare
Choose a tag to compare

What's Changed

Full Changelog: v4.1.0...v4.1.1

v4.1.0

25 May 19:33
8cc6ad9
Compare
Choose a tag to compare

What's Changed

Full Changelog: v4.0.0...v4.1.0

v4.0.0

30 Aug 20:49
4be5015
Compare
Choose a tag to compare

What's Changed

Full Changelog: v3.4.0...v4.0.0

v3.4.0

06 Aug 21:30
492ab5c
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v3.3.0...v3.4.0

v3.3.0

14 Jul 19:55
1e9a024
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v3.2.1...v3.3.0

v.3.2.1

28 May 04:01
4311508
Compare
Choose a tag to compare

Moved PySpark as an optional dependency (faster install)
OpenTelemetry implementation (logging to support dev priorities)

v3.2.0: Improved OCR, streamlined functions, and more

14 May 16:39
Compare
Choose a tag to compare

First - thanks everyone for bearing with us as we've made some notable architectural changes over the past several releases.
A big part of doing this was orienting the package towards better long-term development and where DataFog is being used today and likely in the future within API services.

  • Implement Pytesseract: significant speed and accuracy in OCR text extraction from Donut!
    • Allows for better image and PDF extraction
  • Enhanced test suite coverage
  • Refactored definitions to support async (for API integration)
  • Refactored classes/functions around ImageService, TextService, SparkService