Gosset / Tricot - Describe. #9442

lilyclements · 2025-02-13T14:12:03Z

lilyclements
Feb 13, 2025
Maintainer

@rdstern here is a full script for the Gosset vignette (1) up until the Modelling. I think what we would like for the first iteration of the menu is the ability to run this in R-Instat. I've attached the Full Script at the end in case you'd rather have it all as one. Otherwise, here it is broken up with comments throughout.
I've also amended this to use Kaue's scripts he shared with us on the trainings.

Importing and Rearranging the Data

This is all very straightforward and do-able in R-Instat. This rearrangement is relevant for this data type, but not for all tricot data shapes.

There are data formats that the tricot analyses have which require much more manipulation to get it into a consistent shape. That is not relevant for this vignette, but the shape it is rearranged to a shape which David suggested, and I strongly agree with. I'm not sure how we want to offer that rearrangement in R-Instat yet. I should discuss that with David, it might be that he sees that as something that happens automatically in the defining part of the dialog (that if you define certain columns, clearly it's in the X format, and so we can rearrange that to the Y format).

# Dialog: Import From Library
utils::data(package="gosset", X=nicabean)

data_book$import_data(data_tables=lapply(X=nicabean, FUN=data.frame))

# Right click menu: Convert Column(s) To Factor
data_book$convert_column_to_type(data_name="trial", col_names="trait", to_type="factor")

# Dialog: Unstack (Pivot Wider)
trial <- data_book$get_data_frame(data_name="trial")
trial_unstacked <- tidyr::pivot_wider(data=trial, names_from=trait, values_from=rank)
data_book$import_data(data_tables=list(trial_unstacked=trial_unstacked))

rm(list=c("trial_unstacked", "trial"))

There is also the option to import from ClimMob using their ClimMobTools package. Kaue has shared this in script 2 of his training materials (02_fetch_and_merge_data.R). I've written up an issue in #9400 on how we can incorporate that in, and added amendments now we have this scripts from Kaue.

Defining the Data

As I said above there are data formats that the tricot analyses have which require much more manipulation to get it into a consistent shape. I will get the names of the columns in that shape if we want that rearrangement to occur at this part for it to be in a consistent format.

Currently from this vignette and from the second training script, we want to define:

ID - single receiver
Traits - multiple receiver
Variety - single receiver
Longitude - single receiver
Latitude - single receiver
Elevation - single receiver
Planting Date - single receiver

For our Traits, we have a multiple receiver. I discussed this part with David, and we agreed that adding a "select" for these trait columns can be a function that automatically occurs here when defining:

# Column selection subdialog: Created new column selection
data_book$add_column_selection(data_name="trial_unstacked", name="all_trait_vars", column_selection=list(C0=list(operation="base::match", parameters=list(x=c("Vigor","Architecture","ResistanceToPests","ResistanceToDiseases","ToleranceToDrought","Yield","Marketability","Taste","OverallAppreciation")))), and_or="|")

Create Rankings Object

A ranking object is created for the tricots analyses. I can see two main types: ranking and grouped_ranking. This is then used in other dialogs throughout. I see this as a bit like the "Create Survival Object" for survival data. So we have a dialog where you create the rankings (or grouped_rankings) object (and we should add ranking and grouped_ranking as objects in R-Instat, like how "surv" is a survival object).

# Create Rankings Object:
trial_unstacked <- data_book$get_data_frame("trial_unstacked")
traits <- data_book$get_column_selection(data_name = "trial_unstacked", name = "all_trait_vars") #get_object (all_trial_vars)
traits <- traits$conditions$C0$parameters$x
trial_unstacked <- trial_unstacked %>% tidyr::pivot_longer(cols = all_of(traits), names_to = "trait", values_to = "rank")
trial_unstacked <- trial_unstacked %>% dplyr::group_by(id, trait) %>% dplyr::filter(!anyNA(rank))

rankings_list <- traits %>%
  purrr::map(~ {
    trial_unstacked %>%
      dplyr::filter(trait == .x) %>%
      gosset::rank_numeric(data = ., 
                           items = "item", 
                           input = "rank", 
                           id = "id", 
                           ascending = TRUE)
  })
names(rankings_list) <- traits

I'll put a draft idea for this dialog here:

R-Instat Control	Corresponding Function	Corresponding Parameter	Default
Data Selector	rank_numeric	data	required
ID receiver	rank_numeric	id	NULL
Variety receiver	rank_numeric	items	required
Trait receiver	data_book$get_column_selection	name	name of select for receiver
Ascending checkbox	rank_numeric	ascending	FALSE
Group checkbox	rank_numeric	group	FALSE
Store as DF checkbox			An option, default unchecked, to store as a data frame with two columns: ID and the rankings object (of class rankings). Default name: <data_frame>_<ucrSave_name>
ucrSave

I've added the option to "Store as DF checkbox", since looking at the training materials there is an instance where they View the rankings object as a data frame. It can be accessed as just a column this way when modelling. Otherwise, it can be an object you don't see - like a key or link.

In addition, by "Traits" can we have an option for it to be either a column (e.g., the "Overall" column) or a select object (e.g., the traits select). It will return a list of rankings objects if it is a select, which we can save as a "list_rankings" object perhaps? (or list_grouped_rankings if grouped is checked)
It will return just a single ranking if it is a single column, which we can save as a "rankings" object perhaps? (or grouped_rankings if grouped is checked)

I can see four objects being returned:

ranking: single column, grouped unchecked.
grouped_ranking: single column, grouped checked.
list_ranking: select, grouped unchecked.
list_grouped_ranking: select, grouped checked.

(Unless we always return as a list, even if it's a list of size one. But what if it's a column. Then it's a single rnaking object. Or we have a select or multiple ranking objects, i.e., our "list")

Prepare Parts

The "trial" data is of course clean as it is one of their package data sets. I have enquired to Kaue about data used in the training (at the very least it would be nice to know what sort of data the participants come with so we can clean it). I know they've previously said that ClimMob is quite good at the cleaning stage.

From the training data scripts I can see a few dialogs which are relevant

Rename Dialog: They rename variables (e.g., by removing all instances of "registration_" before the name of variables. We already have this nice and achievable in R-Instat)
Frequencies Dialog: They look at frequencies of the varieties used. With our proposed data format, the varieties will be as one column so this works great.
Convert: Set variables as different classes - e.g., age to be an integer.
Factors: Levels/Labels: Viewing levels and frequencies in the variety column, and later for the gender column (covariate in modelling)
Filter: Check for unlikely age values (e.g., age == 0, age > 80), and filtering to !na values when modelling.
Replace Values: Replace unlikely age values with NA
Factors: Reorder Levels: Reordering levels to have a different baseline
Merge: Merging data frames

Other dialogs, for describe, perhaps:

Density Plot, Boxplot

Describe: Correlations

This dialog gives the Kendall correlation between overall appreciation (baseline level) and the other traits in the trial. There is also an additional plot option.

# 1. Kendall correlation between overall appreciation and other traits
baseline <- "OverallAppreciation"    # set OverallAppreciation variable as the baseline variable
baseline_trait <- rankings_list[["OverallAppreciation"]]
kendall_rankings <- rankings_list %>%
  purrr::keep(names(.) != baseline) %>%
  purrr::map_dfr(~ gosset::kendallTau(x = .x, y = baseline_trait), .id = "trait")
kendall_rankings 
# "The Kendall correlation indicates that farmers prioritized the traits yield, taste, and marketability when assessing overall appreciation."

# 2. Distances and the distribution of the kendall correlation coefficients.
# For that we use the function kendallTau_bootstrap() which resamples the data using a
# bootstrapping approach to draw an uniform distribution in the data.
kendall <- rankings_list %>%
  keep(names(.) != baseline) %>%
  purrr::map_dfr(~ gosset::kendallTau_bootstrap(x = .x, y = rankings_list[[baseline]], nboot = 50, seed = 1206), .id = "trait"
  ) %>%
  pivot_longer(cols = everything(), names_to = "trait", values_to = "kendallTau") 
# not sure how useful visualisation in a table is, but here you go:
kendall

# we can visualise in a plot, which is much more useful presumably!
ggplot(kendall, aes(y = trait, x = kendallTau)) +
  geom_boxplot() +
  labs(y = "", x = "Correlation with the 'Overall appreciation'") +
  theme_minimal()

Note that this is from the Gosset Vignette. These correlations are run in the third R script provided by Kaue for the training. I have tried this script with that data, and it gives the same results (because I've slightly amended the R code to fit in the tidyverse format).

To add into R-Instat, could we amend the Correlations dialog to have a third tab? This gives new tab gives the Correlation of a rankings object:

We add a "Rankings" tab to the correlations dialog
We have two receivers: "Baseline Trait" which takes a column from the data (e.g., "OverallAppreciation") that we compare against, and "Rankings Object" which takes a rankings object
There is a checkbox to return our bootstrapping graph. It should have a much better name than just "Give graph"! This can give the kendallTau_bootstrap graphic if you click a checkbox for this display.
In terms of this fitting with the current correlations dialog, we can use some of the use options in the current correlations dialog that are under "Display Options", except we can't have method (Rearrange checkbox) and it doesn't make sense to have "Display on Diagonal" since there's no diagonal here

corrr::fashion(kendall_rankings, decimals = 2, leading_zeros = FALSE, na_print = "")

The dialog runs the kendallTau function and returns a table.
I also suggest we have additional options, like to remove the p-values by default. Maybe remove N_effective too? "Effective N, which is the equivalent N needed if all items were compared to all items"

kendall_rankings %>% dplyr::select(-c(`Pr(>|z|)`, `Zvalue`))

Side notes:

In the package, there is also kendallTau_permute which isn't covered in the vignette. I get an error with this stating that it is not supported on Windows. I will check with Kaue at a later date on this, but I assume this is not used. Update: I cannot see it in the training scripts shared by Kaue.
For handling missing options: We can still have the Missing options here, and just read that into gosset::kendallTau. Except I'm not sure if there can be missing things in a rankings object! I should talk to Kaue and explore that a bit more.

EDIT: This next bit is for modelling. I will move this over to modelling when I've opened a discussion on it!

Describe: Performance of Varieties across Traits

This is a function to visualise the performance of the different varieties across the multiple traits. The values represented in a worth map are log-worth estimates. In this, we fit a model to each of our traits, giving a list of models. But we don't use the models or look at them - like in an ANOVA how we make a linear model, but look at the descriptive side of it in the Describe menu.

This outputs a graph where we can visualise how much each variety impacts that traits ranking of "1-2-3".

## Performance of Varieties across Traits
# Fit a model to see which Varieties are affecting that Trait 
mod <- rankings_list %>% purrr::map(PlackettLuce::PlackettLuce)

# We can then visualise how much each variety impacts that traits ranking of 1-2-3. So if it is likely to give a lower rank, it is a darker brown colour, and if it is likely to give a higher rank it is a bluer colour. 
gosset::worth_map(mod,
                  labels = traits,
                  labels.order = rev(traits)) +
  labs(x = "Variety",
       y = "Trait")

If it is likely to give a lower rank, it is a darker brown colour, and if it is likely to give a higher rank it is a bluer colour. e.g., if we run percentages and look at the look at trial_by_trait_item_rank data frame, and look just at OverallAppreciation, we can see that "SX" tends to often be ranked 1 out of it's three rankings (and hence is a dark blue in the plot), and INT F is ranked 1 the least times (and hence is a dark brown).

data_book$calculate_summary(data_name="trial", store_results=TRUE, return_output=FALSE, factors=c("trait","item","rank"), drop=FALSE, j=1, summaries=c("summary_count_all"), silent=TRUE, percentage_type="factors", perc_total_factors=c("item","trait"))

I'm not sure how is best to fit this in yet. It's just a simple graph, but not correlations. It just helps show the relationship between the different varieties and traits. I haven't seen anything similar in other vignettes that it could tie in with. Could it be an option in a correlation dialog or a describe/summarise dialog. We could also offer other tabular statistics alongside. Let me know what you think.

I will look into how this fits into the Training scripts. That might provide more insights into how and where this can fit in! Especially as I can see some other functions that they look at alongside it -- it might be that it is appropriate in the modelling side.

Full Script (Gosset Vignette, for the R Code run in here)

vignette_1_start.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gosset / Tricot - Describe. #9442

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Gosset / Tricot - Describe. #9442

lilyclements Feb 13, 2025 Maintainer

Importing and Rearranging the Data

Defining the Data

Create Rankings Object

Prepare Parts

Describe: Correlations

EDIT: This next bit is for modelling. I will move this over to modelling when I've opened a discussion on it!

Describe: Performance of Varieties across Traits

Full Script (Gosset Vignette, for the R Code run in here)

Replies: 0 comments

lilyclements
Feb 13, 2025
Maintainer