11report.Rmd

# Critical Appraisal

```{block, type='rmdcomment'}
In this chapter, we will talk about how to critically appraise an article analyzing data using machine learning methods.
```

- Watch the video describing this chapter [![video](https://img.youtube.com/vi/Y4q2kKNcOSo/maxresdefault.jpg)](https://youtu.be/Y4q2kKNcOSo)


```{r setup11, include=FALSE}
require(Publish)
```

```{r, echo=FALSE}
tweetrmd::tweet_embed("1401588817694392325")
``` 


```{block, type='rmdcomment'}
Given the popularity of ML and AI methods, it is important to be able to critically appraise a paper that reports ML / AI analyses results.
```

## Existing guidelines

@liu2020reporting and @rivera2020guidelines provided the CONSORT-AI and SPIRIT‑AI Extensions. @vinny2021critical reported a number of ways to critically appraise an article that analysed the data using an ML approach. 

## Key considerations

Below a summary of the general considerations are listed in critical appraisal
of a ML research article. 

| Issues to consider | Details |
|-|---|
| Clinical utility | Explanation or rationale of why these prediction models are being built or developed? Was the study aim clear? Is it prediction of outcome, or identification important features? |
| Data source and study description | How was the data collected? What was the study design? RCT, observational, cross-sectional, longitudinal, nationally representative survey? Study start, end dates reported? What was the baseline? Are the data measurable in clinical setting routinely or they are measured irregularly? |
| Target population | Was it clear who was the target population where this model was developed and where it can be generalized?  |
| Analytic data | How was the data pre-processed? Inclusion, exclusion criteria properly implemented to properly target the intended population? Clinicians were consulted to discuss the appropriateness of inclusion, exclusion criteria? Protocol published a priori? |
| Data dimension, and split ratio | Total data size, analytic data size, training, tuning, testing data size?  |
| Outcome label | How was the gold standard determined, and what was the quality? The prediction of such outcome clinically relevant? |
| Features | How many covariates or features used? How were these variables selected? Subject area experts consulted in selection and identification of some or all of these variables? Any of these variables transformed or dichotomized or categorized or combined? A table of baseline characteristics of the subjects, stratified by the outcome labels presented? |
| Missing data | Were the amount of missing observations reported? Any explanation of why they were missing? How were the missing values handles? Complete case or multiple imputation? |
| ML model choice | Rationale of the ML model choice (logistic, LASSO, CART or extensions, ensemble, or others)? Model specification? Additive, linear or not? Amount of data adequate given the model complexity (number of parameters)?  |
| ML model details | Details about ML model and implementation reported? Model fine tuned? Model somehow customized? Hyperparameters provided? |
| Optimism or overfitting | What method was used to address these issues? What measures of performances were used? Was there any performance gap (between tuned model vs internal validation model)? Model performance reasonable, or unrealistic? |
| Generalizability | External validation data present? Model was tested in real-world clinical setting?  |
| Reproducibility | repeatable and reproducible? These can be in 3 levels (i) model (ii) code (iii) data or their combinations. Software code provided? Which software and version was used? Was the computing time reported? |
| Interpretability | Clinicians were consulted? Results were interpreted in collaboration with clinicians and subject area experts? Model results believable, interpretable? |
| Subgroup | Clinically important subgroups considered? |

## Example

1. Download the article by @antman2000timi ([link](https://jamanetwork.com/journals/jama/articlepdf/192996/JOC00458.pdf)). Try to identify how many of the above key considerations they have reported in the process of developing a risk score?
2. [OpenSafely article](https://www.nature.com/articles/s41586-020-2521-4) in Nature: let's discuss research goal.

```{r, echo=FALSE}
tweetrmd::tweet_embed("1315757533643051010")
``` 

## Exercise 

Find an article in the medical literature (published in a peer-reviewed journal, could be related to the area that you work on, or are interested in) that used a machine learning method to build a clinical prediction model (here is an [example](https://www.google.com/search?q=machine+learning+clinical+prediction+model+tuberculosis)). Critically appraise that article.