CV Parser

A Ruby gem for parsing and extracting structured information from CVs/resumes using LLM providers.

Features

Multiple file format support: PDF, DOCX, TXT, and Markdown files
Smart file processing: Converts DOCX to PDF, processes text files directly (no upload required)
Extract structured data from CVs using leading LLM providers
Multiple LLM providers: OpenAI, Anthropic, and Faker (for testing)
Customizable output schema using JSON Schema format
Command-line interface for quick parsing and analysis
Performance optimized: Text files bypass upload for faster processing
Robust error handling and validation

Installation

Add this line to your application's Gemfile:

gem 'cv-parser'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install cv-parser

Usage

Using in Rails

You can use CV Parser directly in your Ruby or Rails application to extract structured data from CVs.

Basic Configuration

You can configure the gem for different providers:

require 'cv_parser'

# OpenAI
CvParser.configure do |config|
  config.provider = :openai
  config.api_key = ENV['OPENAI_API_KEY']
  config.model = 'gpt-4.1-mini'
  config.output_schema = schema
end

# Anthropic
CvParser.configure do |config|
  config.provider = :anthropic
  config.api_key = ENV['ANTHROPIC_API_KEY']
  config.model = 'claude-3-sonnet-20240229'
  config.output_schema = schema
end

# Faker (for testing/development)
CvParser.configure do |config|
  config.provider = :faker
  config.output_schema = schema
end

Defining an Output Schema

Define the schema for the data you want to extract using JSON Schema format:

schema = {
  type: "json_schema",
  name: "cv_parsing",
  description: "Schema for a CV or resume document",
  properties: {
    personal_info: {
      type: "object",
      description: "Personal and contact information for the candidate",
      properties: {
        name: {
          type: "string",
          description: "Full name of the individual"
        },
        email: {
          type: "string",
          description: "Email address of the individual"
        },
        phone: {
          type: "string",
          description: "Phone number of the individual"
        },
        location: {
          type: "string",
          description: "Geographic location or city of residence"
        }
      },
      required: %w[name email]
    },
    experience: {
      type: "array",
      description: "List of professional experience entries",
      items: {
        type: "object",
        description: "A professional experience entry",
        properties: {
          company: {
            type: "string",
            description: "Name of the company or organization"
          },
          position: {
            type: "string",
            description: "Job title or position held"
          },
          start_date: {
            type: "string",
            description: "Start date of employment (e.g. '2020-01')"
          },
          end_date: {
            type: "string",
            description: "End date of employment or 'present'"
          },
          description: {
            type: "string",
            description: "Description of responsibilities and achievements"
          }
        },
        required: %w[company position start_date]
      }
    },
    education: {
      type: "array",
      description: "List of educational qualifications",
      items: {
        type: "object",
        description: "An education entry",
        properties: {
          institution: {
            type: "string",
            description: "Name of the educational institution"
          },
          degree: {
            type: "string",
            description: "Degree or certification received"
          },
          field: {
            type: "string",
            description: "Field of study"
          },
          graduation_date: {
            type: "string",
            description: "Graduation date (e.g. '2019-06')"
          }
        },
        required: %w[institution degree]
      }
    },
    skills: {
      type: "array",
      description: "List of relevant skills",
      items: {
        type: "string",
        description: "A single skill"
      }
    }
  },
  required: %w[personal_info experience education skills]
}

Set the output schema in the configuration block:

CvParser.configure do |config|
  config.output_schema = schema
end

You can also set the output schema in the extractor method which will override the configuration block:

extractor = CvParser::Extractor.new
extractor.extract(
  output_schema: schema
)

Extracting Data from a CV

extractor = CvParser::Extractor.new

# Extract from PDF (uploaded to LLM)
result = extractor.extract(
  file_path: "path/to/resume.pdf"
)

# Extract from text file (fast, no upload)
result = extractor.extract(
  file_path: "path/to/resume.txt"
)

# Extract from markdown file (fast, no upload)
result = extractor.extract(
  file_path: "path/to/resume.md"
)

puts "Name: #{result['personal_info']['name']}"
puts "Email: #{result['personal_info']['email']}"
result['skills'].each { |skill| puts "- #{skill}" }

Error Handling

begin
  result = extractor.extract(
    file_path: "path/to/resume.txt"  # Works with any supported format
  )
rescue CvParser::FileNotFoundError, CvParser::FileNotReadableError => e
  puts "File error: #{e.message}"
rescue CvParser::EmptyTextFileError => e
  puts "Text file is empty: #{e.message}"

rescue CvParser::TextFileEncodingError => e
  puts "Text file encoding error: #{e.message}"
rescue CvParser::ParseError => e
  puts "Error parsing the response: #{e.message}"
rescue CvParser::APIError => e
  puts "LLM API error: #{e.message}"
rescue CvParser::ConfigurationError => e
  puts "Configuration error: #{e.message}"
end

Command-Line Interface

CV Parser also provides a CLI for quick analysis:

# Process different file formats
cv-parser path/to/resume.pdf
cv-parser path/to/resume.docx
cv-parser path/to/resume.txt
cv-parser path/to/resume.md

# Use different providers
cv-parser --provider anthropic path/to/resume.pdf
cv-parser --provider openai path/to/resume.txt

# Output options
cv-parser --format yaml --output result.yaml path/to/resume.md
cv-parser --schema custom-schema.json path/to/resume.txt
cv-parser --help

You can use environment variables for API keys and provider selection:

export OPENAI_API_KEY=your-openai-key
export ANTHROPIC_API_KEY=your-anthropic-key
export CV_PARSER_PROVIDER=openai
export CV_PARSER_API_KEY=your-api-key
cv-parser resume.pdf

Supported File Formats

CV Parser supports multiple file formats with optimized processing:

File Format Support

Format	Extension	Processing Method	Upload Required	Performance
PDF	`.pdf`	Direct upload	Yes	Standard
DOCX	`.docx`	Convert to PDF → Upload	Yes	Standard
Text	`.txt`	Direct text processing	No	Fast
Markdown	`.md`	Direct text processing	No	Fast

Performance Benefits of Text Files

Text files (.txt and .md) offer significant performance advantages:

No file upload overhead: Content is included directly in the API request
Faster processing: Eliminates the upload → reference workflow
Reduced API calls: Single request instead of upload + process
Lower bandwidth usage: Direct text inclusion vs binary file transfer
Better for automation: Simpler integration in automated workflows

File Size Limits

PDF/DOCX files: Limited by LLM provider (typically 20MB)
Text files: No explicit size limits (limited only by LLM provider)

File Processing Examples

# Fast text processing (no upload)
extractor.extract(file_path: "resume.txt", output_schema: schema)
extractor.extract(file_path: "resume.md", output_schema: schema)

# Standard file processing (with upload)
extractor.extract(file_path: "resume.pdf", output_schema: schema)
extractor.extract(file_path: "resume.docx", output_schema: schema)

Advanced Configuration

You can further customize CV Parser by setting advanced options in the configuration block. For example:

CvParser.configure do |config|
  # Configure OpenAI with organization ID
  config.provider = :openai
  config.api_key = ENV['OPENAI_API_KEY']
  config.model = 'gpt-4.1-mini'
  
  # Set timeout for file uploads (important for larger files)
  config.timeout = 120  # TODO - not yet implemented
  config.max_retries = 2  # TODO - not yet implemented
  
  # Provider-specific options
  config.provider_options[:organization_id] = ENV['OPENAI_ORG_ID']

  # You can also set custom prompts for the LLM:
  config.prompt = "Extract the following fields from the CV..."
  config.system_prompt = "You are a CV parsing assistant."

  # Set the output schema (JSON Schema format)
  config.output_schema = schema

  # Set the max tokens and temperature
  config.max_tokens = 4000
  config.temperature = 0.1
end

Testing and Development

Using the Faker Provider

The Faker provider generates realistic-looking fake data based on your schema without making API calls. This is useful for:

Writing tests (RSpec, Rails, etc.)
Developing UI components
Demonstrating functionality without API keys
Avoiding API costs and rate limits
Tests run faster without external API calls
Consistent, predictable results
No need for API keys in CI/CD environments

Basic Test Setup

Here's how to use the faker provider in your RSpec tests:

# spec/your_resume_processor_spec.rb
require 'spec_helper'

RSpec.describe YourResumeProcessor do
  # Define a JSON Schema format schema for testing
  let(:test_schema) do
    {
      type: "json_schema",
      name: "cv_parsing_test",
      description: "Test schema for CV parsing",
      properties: {
        personal_info: {
          type: "object",
          description: "Personal information",
          properties: {
            name: {
              type: "string",
              description: "Full name"
            },
            email: {
              type: "string",
              description: "Email address"
            }
          },
          required: %w[name email]
        },
        skills: {
          type: "array",
          description: "List of skills",
          items: {
            type: "string",
            description: "A skill"
          }
        }
      },
      required: %w[personal_info skills]
    }
  end

  before do
    # Configure CV Parser to use the faker provider
    CvParser.configure do |config|
      config.provider = :faker
    end
  end

  after do
    # Reset configuration after tests
    CvParser.reset
  end

  it "processes a resume and extracts relevant fields" do
    processor = YourResumeProcessor.new
    result = processor.process_resume("spec/fixtures/sample_resume.pdf", test_schema)
    
    # The faker provider will return consistent test data
    expect(result.personal_info.name).to eq("John Doe")
    expect(result.personal_info.email).to eq("john.doe@example.com")
    expect(result.skills).to be_an(Array)
    expect(result.skills).not_to be_empty
  end
end

Simple Faker Example

# Configure with Faker provider
CvParser.configure do |config|
  config.provider = :faker
end

# Use the extractor as normal
extractor = CvParser::Extractor.new
result = extractor.extract(
  file_path: "path/to/resume.pdf",  # Path will be ignored by faker
  output_schema: schema  # Using the JSON Schema format defined above
)

# Faker will generate structured data based on your schema
puts result.inspect

Data Generation Behavior

The faker provider generates realistic-looking data based on your schema. The data is deterministic for fields like name, email, and phone, but randomized for arrays and collections. You can write tests that check for structure without relying on specific content for variable fields.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
bin		bin
exe		exe
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
cv-parser.gemspec		cv-parser.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CV Parser

Features

Installation

Usage

Using in Rails

Basic Configuration

Defining an Output Schema

Extracting Data from a CV

Error Handling

Command-Line Interface

Supported File Formats

File Format Support

Performance Benefits of Text Files

File Size Limits

File Processing Examples

Advanced Configuration

Testing and Development

Using the Faker Provider

Basic Test Setup

Simple Faker Example

Data Generation Behavior

About

Uh oh!

Releases 1

Packages

Languages

License

gysmuller/cv-parser

Folders and files

Latest commit

History

Repository files navigation

CV Parser

Features

Installation

Usage

Using in Rails

Basic Configuration

Defining an Output Schema

Extracting Data from a CV

Error Handling

Command-Line Interface

Supported File Formats

File Format Support

Performance Benefits of Text Files

File Size Limits

File Processing Examples

Advanced Configuration

Testing and Development

Using the Faker Provider

Basic Test Setup

Simple Faker Example

Data Generation Behavior

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages