Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: #59691

Closed
1 of 3 tasks
Bratet opened this issue Sep 3, 2024 · 1 comment
Closed
1 of 3 tasks

ENH: #59691

Bratet opened this issue Sep 3, 2024 · 1 comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@Bratet
Copy link

Bratet commented Sep 3, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I believe adding a new similarity_merge method to the Pandas DataFrame class to merge DataFrames based on similar, but not necessarily identical, string values. This feature would be useful for merging data where exact matches are impractical due to typographical errors or variations in string formatting. I encountered this problem several times, i developed a function to solve it and i said to myself maybe it should become a feature in pandas, so here i am :)

Feature Description

Similarity Merge Feature for pandas

Key Components

  1. Similarity Metric: The function will use a string similarity metric to compare values. Initially, we'll implement Levenshtein distance, but the function will be designed to allow for other metrics in the future.

  2. Threshold: Users can specify a similarity threshold to determine when strings are considered a match.

  3. Multiple Matches: The function will handle cases where multiple potential matches exceed the threshold.

  4. Performance Optimization: To improve performance on large datasets, we'll implement some optimization strategies.

Pseudocode

def similarity_merge(left_df, right_df, left_on, right_on, threshold=0.8, method='levenshtein'):
    """
    Merge two DataFrames based on string similarity.
    
    Parameters:
    - left_df, right_df: DataFrames to merge
    - left_on, right_on: Columns to join on
    - threshold: Similarity threshold (0 to 1)
    - method: Similarity method ('levenshtein', 'jaccard', etc.)
    
    Returns:
    - Merged DataFrame
    """
    
    # Initialize similarity metric
    similarity_func = get_similarity_function(method)
    
    # Create dictionaries for faster lookup
    left_dict = create_lookup_dict(left_df, left_on)
    right_dict = create_lookup_dict(right_df, right_on)
    
    # Initialize result DataFrame
    result = []
    
    # Iterate through left DataFrame
    for left_key, left_rows in left_dict.items():
        best_matches = find_best_matches(left_key, right_dict, similarity_func, threshold)
        
        for right_key, similarity in best_matches:
            for left_row in left_rows:
                for right_row in right_dict[right_key]:
                    merged_row = merge_rows(left_row, right_row)
                    merged_row['similarity'] = similarity
                    result.append(merged_row)
    
    return pd.DataFrame(result)

def get_similarity_function(method):
    if method == 'levenshtein':
        return levenshtein_distance
    elif method == 'jaccard':
        return jaccard_similarity
    # Add more similarity methods as needed

def create_lookup_dict(df, column):
    return df.groupby(column).apply(lambda x: x.to_dict('records')).to_dict()

def find_best_matches(key, lookup_dict, similarity_func, threshold):
    matches = []
    for other_key in lookup_dict:
        similarity = similarity_func(key, other_key)
        if similarity >= threshold:
            matches.append((other_key, similarity))
    return sorted(matches, key=lambda x: x[1], reverse=True)

def merge_rows(left_row, right_row):
    # Implement row merging logic
    pass

def levenshtein_distance(s1, s2):
    # Implement Levenshtein distance calculation
    pass

def jaccard_similarity(s1, s2):
    # Implement Jaccard similarity calculation
    pass

Usage Example

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'name': ['John Doe', 'Jane Smith', 'Bob Johnson'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'name': ['John Doe', 'Jane Smth', 'Bobby Johnson'], 'score': [A, B, C]})

# Perform similarity merge
result = similarity_merge(df1, df2, left_on='name', right_on='name', threshold=0.8)

print(result)

Alternative Solutions

Alternative Solutions and Benefits of pandas Implementation

Alternative Solutions

There are several existing solutions that partially address the need for similarity-based merging:

  1. Third-party packages:

    • fuzzymatcher: Provides fuzzy matching capabilities for pandas DataFrames.
    • recordlinkage: Offers various methods for record linkage, including string similarity.
    • pandas-dedupe: Uses machine learning for deduplication and entity resolution.
  2. Custom functions:

    • It's possible to create a custom function using libraries like Levenshtein to perform similarity-based merging. Here's an example implementation:
import pandas as pd
from Levenshtein import ratio

def similarity_merge(left_df, right_df, left_on, right_on, threshold=0.8):
    """
    Merge two DataFrames based on string similarity.
    
    Parameters:
    - left_df, right_df: DataFrames to merge
    - left_on, right_on: Columns to join on
    - threshold: Similarity threshold (0 to 1)
    
    Returns:
    - Merged DataFrame
    """
    
    def find_matches(value, candidates, threshold):
        similarities = [ratio(value, candidate) for candidate in candidates]
        matches = [i for i, sim in enumerate(similarities) if sim >= threshold]
        return matches, [similarities[i] for i in matches]
    
    # Dictionaries for faster lookup
    left_dict = left_df.set_index(left_on).to_dict('index')
    right_dict = right_df.set_index(right_on).to_dict('index')
    
    right_keys = list(right_dict.keys())
    
    results = []
    
    for left_key, left_row in left_dict.items():
        matches, similarities = find_matches(left_key, right_keys, threshold)
        
        for match, similarity in zip(matches, similarities):
            right_key = right_keys[match]
            right_row = right_dict[right_key]
            
            merged_row = {**left_row, **right_row, 'similarity': similarity}
            merged_row[f'{left_on}_left'] = left_key
            merged_row[f'{right_on}_right'] = right_key
            
            results.append(merged_row)
    
    if not results:
        return pd.DataFrame()
    
    return pd.DataFrame(results)

This custom function provides a basic implementation of similarity-based merging using the Levenshtein ratio for string comparison.

  1. SQL-based solutions:
    • Some databases (e.g., PostgreSQL with pg_trgm) offer fuzzy matching capabilities that could be used in conjunction with pandas.

Benefits of Implementing in pandas

While these alternatives exist, implementing a similarity_merge function directly in pandas would offer several advantages:

  1. Native Integration: As a built-in pandas function, it would seamlessly integrate with existing pandas workflows, maintaining consistency in API and performance optimizations.

  2. Wider Adoption: Being part of the core pandas library would make it more accessible to users, encouraging broader adoption and community support.

  3. Comprehensive Documentation: Official pandas documentation would ensure clear, standardized usage guidelines and examples.

  4. Ongoing Maintenance: The pandas core team would maintain and improve the feature over time, ensuring its reliability and performance.

  5. Enhanced Functionality: A pandas implementation could handle multiple scenarios not covered by the current custom function:

    a. Multiple column matching: Allow similarity comparison across multiple columns simultaneously.

    b. Customizable similarity metrics: Support various similarity metrics (Levenshtein, Jaccard, cosine similarity, etc.) and allow users to provide custom metrics.

    c. Handling of non-string data: Extend similarity matching to numerical or categorical data with appropriate metrics.

    d. Asymmetric thresholds: Allow different thresholds for left and right DataFrames or even row-specific thresholds.

    e. Parallelization: Implement parallel processing for improved performance on large datasets.

    f. Memory efficiency: Optimize for memory usage, crucial for very large DataFrames.

    g. Handling of multi-index DataFrames: Extend functionality to work with multi-index DataFrames.

    h. Incremental merging: Allow for incremental updates to merged results as new data comes in.

THANK YOU FOR YOUR TIME !!

Additional Context

No response

@Bratet Bratet added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2024
@mroeschke
Copy link
Member

Thanks for the request. This request is similar to #34543 or #10309 - joining based on some condition applied on the join keys so closing to keep the discussion in those issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants