Modelling and data in Tricot #9475

lilyclements · 2025-03-03T14:10:45Z

lilyclements
Mar 3, 2025
Maintainer

I can see modelling in Scripts 2 and 3 shared by Kaue.

From Script 2:

Model 1. Plackett-Luce Model
Fitting Plackett-Luce models to a list of rankings (rankings_list).

mod <- purrr::map(.x = rankings_list, .f = ~PlackettLuce::PlackettLuce(.x))

This means we are fitting a separate Plackett-Luce model for each ranking in your list.
These models estimate a worth parameter for each item being ranked.

Look at summary, coef, qvcalc(itempar(.x)), reliability, worth_map, plot_logworth:

purrr::map(.x = mod, .f = ~summary(.x))
purrr::map(.x = mod, .f = ~coef(.x, log = FALSE))
purrr::map(.x = mod, .f = ~qvcalc(itempar(.x)))
variety <- data_book$get_columns_from_data("dat_rearrange", "variety")
reference <- levels(variety)[1]
rel <- purrr::map(.x = mod, .f = ~reliability(.x, ref = reference))
rel_overall = rel[["overall"]]
data.frame(item = rel_overall$item, reliability = round((rel_overall$reliability / 0.5 - 1), 2))   # This transforms the reliability score relative to 0.5.

# Worth Map visualizes the worth of different traits/varieties.
worth_map(mod,
          traits,
          ref = reference)

# Log-worth of a specific trait is plotted. This helps in understanding how a particular trait (e.g., saltiness) affects rankings.
plot_to_show <- "saltyness"
plot_logworth(mod[[plot_to_show]], ref = reference) +
  labs(x = "Variety",
       y = "Log-worth")

Model 2: Tree-Based Model for Rankings & Chocolate Consumption

Group = TRUE (what does this mean!)
Response: overall_grouped_ranking (ranking object for just the "overall" trait)
Predictor: chocolateconsumption
Purpose: This model finds subgroups of chocolate consumption where rankings behave differently.

tree = PlackettLuce::pltree(overall_grouped_ranking ~ chocolateconsumption,
                            data = dat_subset_new,
                            minsize = 5,
                            alpha = 0.1, 
                            gamma = TRUE)

We look at: plot, top_items, node_rules, regret
This shows the decision tree structure, top items in each split, and the rules defining the tree splits.
regret?

Model 3 and 4

PLADMM is an alternative to Plackett-Luce that is often more robust.
Rankings for all traits, except "overall" and fit Model 1. Turn into a df with "item" (variety), and then each trait and their coefficient with that item.

Model 3: PLADMM Model Using Overall Rankings

Response is rankings object for overall (group FALSE - rankings_list[["overall"]])
Predictors are our varieties
Data is the coefficient data we created above

mod3 <- pladmm(RO, ~ item, data = trait_coeffs)

We look at the summary
We compare to our overall model from Model 1 to check differences in parameter estimates between PLADMM and PL.

cbind(pladmm = itempar(mod3), pl = itempar(mod[["overall"]]))

Model 4: PLADMM Model Using Trait Features

Looks at the relationship between rankings and trait coefficients.
Response is rankings object for overall (group FALSE - rankings_list[["overall"]])
Predictors: Traits (e.g., saltiness, sweetness, etc.)

# create a formula using all trait coefficients as predictors.
form = paste("~", paste0(names(trait_coeffs)[-c(1)], collapse = " + "))
form = as.formula(form)

# fit the PLADMM model:
mod4 <- pladmm(RO, formula = form, data = trait_coeffs)

lilyclements · 2025-03-03T15:03:24Z

lilyclements
Mar 3, 2025
Maintainer Author

group when ranking relates to how the rankings are structured. If we want to link the ranking to the covariates, then we set group = TRUE. I think because now rankings are grouped per person, assuming each person’s ranking should be treated as a block.
If group = TRUE, then we need to read it into the data frame as a new covariate.

Now to compare our three models: PlackettLuce, pltree, and PladMM. From what I can see:

Model	Type	Grouped	Key Feature	Best For	Limitations
PlackettLuce	Probabilistic Ranking Model	Both	Estimates worth parameters (how desireable/preferred an item is)	General ranking analysis	Can be slow for large datasets
PLTree	Decision Tree	Grouped only	Identifies subgroups in rankings	When ranking depends on other variables	Needs sufficient data per group
PLADMM	Penalized Model	Ungrouped only	More efficient and robust	Large datasets, regularized models	Requires tuning, harder to interpret

In addition:
Grouped and No Covariates: PlackettLuce or pltree
Ungrouped and No Covariates: PlackettLuce
Ungrouped and Covariates: pladMM

0 replies

lilyclements · 2025-03-04T20:21:26Z

lilyclements
Mar 4, 2025
Maintainer Author

I am working on a summary on tricot models here

0 replies

rdstern · 2025-03-09T22:09:24Z

rdstern
Mar 9, 2025
Maintainer

@lilyclements I comment on the script and data in the current day 3 script.
It needs the ggchicklet package loaded, which is no longer in CRAN. I followed instructions in an e-mail, but was not able to install the package from the dialog. The following worked in the script window:

devtools::install_github(upgrade="never", "hrbrmstr/ggchicklet")

Now to the data. I suggest we include these data in our teaching, starting from day 1! So I wonder how we can do that most easily. The ideal would be if they added "it" into one of their packages as another dataset. Ideally it would produce all 4 (or 5 data frames described below.

The first dataset, called trial has 4 variables and 15,039 rows:

It is a nice example of very clean data, and I made all 4 variables into factors. But it could load as in the script, so that would be the first step.

@lilyclements I realise I don't yet understand the data. The measurement is the ranks, 1, 2, and 3. What's the "winner"? Is it 1 as the first rank, or 3 as the highest number?

The second is called covar. It has 11 variables and 557 rows:

This provides potential covariates for the ranks in the trial data.

There are 10 variaties and the next data provides variety-level data: (I don't know from where, but it is very nice to have this level of information. I hope the background can be included in the documentation about these data.)

The data, called dat, is just a pivot-wider of the trial data (we don't need that, as it is easy to produce.):

Next is a dataset at the trait level called kendall-rankings.

I don't know why the overall results is absent, and am not clear what we learn from these data. I assume this is produced from their analyses, so we would not include it, in the data.)

Finally, so far, we get to the climatic data, and they are in an odd shape:

There are 577 rows - presumably for the 577 trials. There are 581 columns, with one for each day of the record to span the planting dates of each trial. I assume the rows are in the same order as the trials, but would have so much welcomed an ID variable as an indication of good practices. I was relieved that we will be handling the climatic covariates differently.

However, it is probably worth reshaping these data and including the resulting data frame. I suggest there are just 77 different pixels of data, i.e. on average 10 stations make use of the same climatic data, and suggest we might include this information in our version of the climatic dataset. They won't necessarily have the same climatic summary, because their planting dates could be different, but we should still include that variable for completeness. I wonder how it relates to the 6 different trials?

I have now included a reshaped chirps dataframe, together with the additional pixel variable. I am assuming we would not have both shapes of climatic data.

day3data2.zip

1 reply

lilyclements Mar 13, 2025
Maintainer Author

To the data - the 1st and 2nd data frame are in their gosset package, under "nicabean". This gives the first two data frames as "trial" and "covar" respectively.
If it is ranked 1, it is the "winner", and rank 3 is the "loser".

Discussions with David is what lead to the shape of your "dat". That is, data at the ID and Variety level. It is easy to produce from our "trial" data, but much more complex from other data formats that they use in tricots. This is where the function pivot_tricot has come from. I think that one was used in the Day 2 code.
From discussions with David, we think that when downloading from ClimMob, it produces the data in the "trial" format (ID-Variety-Trait level) as that is their format, and also in the "dat" format (ID-Variety level)

To the variety-level data, from the file that Kaue shared - I agree it would be nice to have this as a general addition! And to understand where these values have come from.

The Kendall ranking data has come from their analyses, as you correctly assumed. It is used in their modelling later which is why I saved it as a data frame in the R code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modelling and data in Tricot #9475

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Modelling and data in Tricot #9475

lilyclements Mar 3, 2025 Maintainer

Replies: 3 comments · 1 reply

lilyclements Mar 3, 2025 Maintainer Author

lilyclements Mar 4, 2025 Maintainer Author

rdstern Mar 9, 2025 Maintainer

lilyclements Mar 13, 2025 Maintainer Author

lilyclements
Mar 3, 2025
Maintainer

Replies: 3 comments 1 reply

lilyclements
Mar 3, 2025
Maintainer Author

lilyclements
Mar 4, 2025
Maintainer Author

rdstern
Mar 9, 2025
Maintainer

lilyclements Mar 13, 2025
Maintainer Author