Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing two separate columns #10

Closed
thomasdebeus opened this issue May 3, 2018 · 3 comments
Closed

Comparing two separate columns #10

thomasdebeus opened this issue May 3, 2018 · 3 comments

Comments

@thomasdebeus
Copy link

thomasdebeus commented May 3, 2018

Hi Chris,

Great package! I'm a journalist and a lot of my time is spend in merging two data frames on the country, municipality, name or party column. These columns often contain different spellings for the same entity.

Now your package comes in handy, only I haven't figured out yet how to compare two of 'the same' columns and name the strings like I use refinr on a single vector. I'm not that experienced in R so maybe this sounds a little bit vague. Maybe my examples make things a bit clearer.

library(tidyverse)
library(refinr)

# I would like to add the values (and the right name's) of this example df...
df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
  )

# ...to the following df
df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "blu",
  "C", "YellowPurple",
  "C", "green"
  )

# The following code of course produces NA values
df3 <- left_join(df1, df2, by = c("uid", "name"))

# While the following is the desired outcome

# A tibble: 4 x 3
  uid   name             value
  <chr> <chr>          <dbl>
1 A     Red                    13 
2 B     Blue                     3
3 C     yellow-purple  789   
4 C     green                 17 

If this is possible, it would safe me so much time!

Thanks in advance.

@ChrisMuir
Copy link
Owner

Hi @thomasdebeus , glad to hear you've found the package useful!

For your question, yes this is possible to do. Here's an example:

library(tidyverse)
library(refinr)

df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
)

df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "bluee",
  "C", "Yellow Purple",
  "C", "green"
)

# Run the values from both "name" variables through the refinr functions (as a 
# single vector).
names_r <- c(df1$name, df2$name) %>% 
  key_collision_merge(dict = df1$name) %>% 
  n_gram_merge()

# Create vector of labels, then split up "names_r" by those labels, and assign
# the refined values back to their original data frames.
names_labels <- c(rep("df1", nrow(df1)), rep("df2", nrow(df2)))
df1$name <- names_r[names_labels == "df1"]
df2$name <- names_r[names_labels == "df2"]

# Perform join.
left_join(df2, df1, by = c("uid", "name"))
#> # A tibble: 4 x 3
#>   uid   name          value
#>  <chr> <chr>         <dbl>
#> 1 A     Red             13.
#> 2 B     Blue             3.
#> 3 C     yellow-purple  789.
#> 4 C     green           17.

There's probably a cleaner, more elegant way to achieve this using tidyverse functions and the pipe %>% that wouldn't require creating intermediate variables on the way to the solution, but this example should give the basic idea.

I must point out that, for this example, I edited two of the str values in variable name in the input data frames. Here's a quick explanation:

  • Blue and blu: These are not treated as a cluster and thus won't get merged together. The reason for this is because the approximate string matching used in function n_gram_merge() is only applied to str values that have an identical unigram (unigram values are derived via function get_fingerprint_ngram()). The algorithm only considers a set of strings to be suitable for clustering if their unigram values are identical. Blue and blu return different unigram values, versus the initial strings I used my example (Blue and bluee). Here's a quick demo:
refinr:::get_fingerprint_ngram(c("Blue", "blu"), numgram = 1)
#> "belu" "blu" 
refinr:::get_fingerprint_ngram(c("Blue", "bluee"), numgram = 1)
#> "belu" "belu"
  • yellow-purple and YellowPurple: These two values would be clustered and merged together by function n_gram_merge(), but NOT by function key_collision_merge(). n_gram_merge() does not feature the dict argument, while key_collision_merge() does (this arg allows the user to dictate merge values for specific clusters). You specified in your example that you wanted the merged values to take the form of the names in df1, however for me just running the names from both df's through n_gram_merge(), the "yellow purple" cluster was taking the form of the string from df2:
df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
)

df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "blu",
  "C", "YellowPurple",
  "C", "green"
)

n_gram_merge(c(df1$name, df2$name))
#> [1] "Red"          "violet"       "Blue"         "yellow"       "YellowPurple" "green"       
#> [7] "Red"          "blu"         "YellowPurple" "green"

Each algorithm has it's quirks and pros/cons, unfortunately they aren't perfect....I've tried to strike a balance between merging true-positives and ignoring false-positives. If you run into an example of specific text combinations that seem to "stump" the functions, I encourage you to open issues for those (something similar to issue #6).

Thanks!

@thomasdebeus
Copy link
Author

Wow thanks a lot for your thorough reply! Other than the practical code examples, the theory behind this helps me a lot as well. Thanks again!

@ChrisMuir
Copy link
Owner

No problem, happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants