index.Rmd

---
title: '**Unveiling the Structure of Heart Rate Variability (HRV) Indices: A Data-driven Meta-clustering Approach**'
subtitle: "Supplementary Materials"
output:
  bookdown::html_document2:
    self_contained: false
    theme: cerulean
    highlight: pygments
    toc: yes
    toc_depth: 3
    toc_float: yes
    number_sections: no
    df_print: default
    code_folding: show
    code_download: yes
  word_document:
    reference_docx: utils/Template_Word.docx
    highlight: pygments
    toc: no
    toc_depth: 3
    df_print: default
    number_sections: yes
  rmarkdown::html_vignette:
    toc: yes
    toc_depth: 3
  html_document:
    toc: yes
    toc_depth: '2'
    latex_engine: xelatex
editor_options:
  chunk_output_type: console
bibliography: references.bib
csl: utils/apa.csl
---


<!-- 
!!!! IMPORTANT: run `source("utils/render.R")` to publish instead of clicking on 'Knit'
-->

```{r setup, warning=FALSE, message=TRUE, include=FALSE}
fast <- FALSE  # Make this false to skip the chunks

# Set up the environment (or use local alternative `source("utils/config.R")`)
source("https://raw.githubusercontent.com/RealityBending/TemplateResults/main/utils/config.R")  

# Set theme
ggplot2::theme_set(see::theme_modern())

source_rmd <- function(x){
  file <- knitr::purl(x, quiet=TRUE)
  source(file)
  if (file.exists(file)) file.remove(file)
}
```

# Introduction
```{r badges, echo=FALSE, message=TRUE, warning=FALSE, results='asis'}
# This chunk is a bit complex so don't worry about it: it's made to add badges to the HTML versions
# NOTE: You have to replace the links accordingly to have working "buttons" on the HTML versions
if (!knitr::is_latex_output() && knitr::is_html_output()) {
  cat("[![Website](https://img.shields.io/badge/repo-Readme-2196F3)](https://github.com/Tam-Pham/HRVStructure)")
}
```

<!-- ABSTRACT -->

Heart Rate Variability (HRV) can be estimated using a myriad of mathematical indices, but the lack of systematic comparison between these indices renders the interpretation and evaluation of results tedious. In this study, we assessed the relationship between 57 HRV metrics collected from 302 human recordings using a variety of structure-analysis algorithms. We then applied a meta-clustering approach that combines their results to obtain a robust and reliable view of the observed relationships. We found that HRV metrics can be clustered into 3 groups, representing the distribution-related features, harmony-related features and frequency/complexity features. From there, we described and discussed their associations, and derived recommendations on which indices to prioritize for parsimonious, yet comprehensive HRV-related data analysis and reporting.

<!-- introduce hrv -->

Heart Rate Variability (HRV), reflecting the heart's ability to effectively regulate and adapt to internal and external environmental changes, has been linked with many physical and mental health outcomes [e.g., cardiac complications, @laitio2007role, diabetes, @kudat2006heart, mood disorders, @bassett2016literature, cognitive functioning, @forte2019heart]. Conventionally, the various indices used in the assessment of HRV were broadly categorized based on the nature of their mathematical approach, in categories including *time-domain*, *frequency-domain*, and *nonlinear dynamics*.


<!-- main categories -->

Time-domain analysis presents the simplest and most straightforward method of quantifying HRV from the original normal (i.e., excluding abnormal beats such as ectopic beats) heartbeat intervals or NN intervals. Some commonly derived indices include *SDNN*, the standard deviation of all NN intervals, *RMSSD*, the root mean square of the sum of successive differences of NN intervals, and *pNN50*, the percentage of adjacent NN intervals separated by more than 50ms. While time-domain methods offers computational ease, they are unable to distinguish between the contributions of sympathetic and parasympathetic branches. Frequency-domain analysis, on the other hand, targets the assessment of these different regulatory mechanisms by investigating how the HRV power spectrum distributes across different frequency bands (e.g, low frequency, *LF* or high frequency, *HF*). Other indices that fall under the frequency domain include derivatives of the aforementioned components, such as the ratio of *LF* to *HF* (*LF/HF*) power and their normalized (e.g., *LFn*, *HFn*) and natural logarithmic variants (e.g., *LnHF*). Finally, drawn from concepts of non-linear dynamics and chaos theory [@golberger1996non], non-linear analysis was later introduced to better characterize the complex physiological mechanisms underlying HRV regulation. Prominent indices include measures obtained from a Poincaré plot where an ellipse is fitted to a scatterplot of each NN interval against its preceding one [e.g., the standard deviation of the short-term, *SD1* and long-term, *SD2* NN interval variability, as well as its corresponding ratio, *SD1/SD2*, @brennan2001existing]. Other non-linear indices that fall under this category, such as Detrended Fluctuation Analysis (*DFA*), multi-fractal *DFA* (*MF-DFA*) and correlation dimension (*CD*), account for the fractal properties of HRV, while entropy measures like approximate entropy (*ApEn*), sample entropy (*SampEn*), and multiscale entropy (*MSE*) quantify the amount of regularity in the HR time series [@voss2009methods]. For a more comprehensive description of all HRV indices, see @pham2021heart.


<!-- overlap and duplicates in indices, aim of study -->

Despite the rising popularity of HRV analysis as a real-time, noninvasive technique for investigating health and disease, there are some shared similarities (and even overlaps) between the multitude of HRV indices that are not yet well understood. Early studies have investigated the relationships between time-domain and frequency-domain indices, showing that not only were *RMSSD* and *pNN50* strongly correlated with each other [above 0.9, @bigger1989comparison], they were also highly associated with *HF* power [@bigger1989comparison; @kleiger2005heart; @otzenberger1998dynamic], suggesting that these measures could be treated as surrogates for each other in assessing the parasympathetic modulation of HRV. This observation is warranted given that the former is computed from the differences across consecutive NN intervals, and hence, they reflect mainly high frequency oscillatory patterns in HR and are independent of long-term changes. On the other hand, *SDNN*, which has been thought to reflect both sympathetic and parasympathetic activity, is correlated to total power (*TP*) in HRV power spectrum [@bigger1989comparison]. Recent years also witnessed the emergence of debates regarding the traditional conceptualization of *SD1* and *SD2* as non-linear indices, particularly when @ciccone2017reminder proposed that *RMSSD* and *SD1* were mathematically equivalent. Many studies that report both of these short-term HRV indices independently often arrive at identical statistical results without addressing this equivalence [e.g., @rossi2015impact, @peng2015extraction, @leite2015correlation]. Additionally, other studies have also drawn similarities between *SD1/SD2* and *LF/HF* in their indexing of the balance between short- and long-term HRV [@brennan2002poincare; @guzik2007correlations]. 

Overall, there is a need for a greater data-driven understanding of the relationship between the multitude of HRV indices and their respective groupings. While there exist different approaches to assign data to different groups based on their level of associations [see @nguyen2019improving], there is no gold standard or clear guidelines to determine the most appropriate method for grouping of physiological indices. As such, choosing one method and presenting its solution as a definitive one can be uninformative or even misleading. The aim of this study is to explore the structure of HRV indices by using a consensus-based methodology [@kuncheva2014combining; @monti2003consensus; @bhattacharjee2001classification], hereafter referred to as *meta-clustering*, where the results of multiple structure analysis approaches are systematically combined to highlight the most robust associations between the indices. 


# Methods

In total, the electrocardiogram (ECG) data of 302 participants were extracted from 6 databases. The raw data as well as the entire analysis script (including details and additional analyses) can be found at this GitHub repository (<https://github.com/Tam-Pham/HRVStructure>).

## Databases


The Glasgow University Database (GUDB) database [@howell2018high] contains ECG recordings from 25 healthy participants (\> 18 years old) performing five different two-minute tasks (sitting, doing a maths test on a tablet, walking on a treadmill, running on a treadmill, using a handbike). All recordings were sampled at 250 Hz.


The MIT-BIH Arrhythmia Database (MIT-Arrhythmia and MIT-Arrhythmia-x) database [@moody2001impact] contains 48 ECG recordings (25 men, 32-89 years old; 22 women, 23-89 years old) from a mixed population of patients. All recordings were sampled at 360 Hz and lasted for 30 minutes.

The Fantasia database [@iyengar1996age] contains ECG recordings from 20 young (21-34 years old) and 20 elderly (68-85 years old) healthy participants. All participants remained in a resting state in sinus rhythm while watching the movie Fantasia (Disney, 1940) that helped to maintain wakefulness. All recordings were sampled at 250 Hz and lasted for 120 minutes.


The MIT-BIH Normal Sinus Rhythm Database (MIT-Normal) database [@goldberger2000physiobank] contains long-term ($\approx$. 24h) ECG recordings from 18 participants (5 men, 26-45 years old; 13 women, 20-50 years old). All recordings were sampled at 128 Hz and due to memory limits, we kept only the second and third hours of each recording (with the loose assumption that the first hour might be less representative of the rest of the recording and a duration of 120 minutes to match those from Fantasia database).

The MIT-BIH Long-term ECG Database (MIT-Long-term) database [@goldberger2000physiobank] contains long-term (14 to 22 hours each) ECG recordings from 7 participants (6 men, 46-88 years old; 1 woman, 71 years old). All recordings were sampled at 128 Hz and due to memory limits, we kept only the second and third hours of each recording.

The last dataset came from resting-state <https://github.com/neuropsychology/RestingState> recordings of authors' other empirical studies. This dataset contains ECG recordings sampled at 4000 Hz from 43 healthy participants (\> 18 years old) that underwent 8 minutes of eyes-closed, seated position, resting state.

## Data Processing

The default processing pipeline of NeuroKit2 [@makowski2021neurokit2] was used for cleaning and processing of raw ECG recordings as well as for the computation of all HRV indices. Note that except for the resting-state dataset, data from online database was not in the form of ECG recordings but sample locations of annotated heartbeats (R-peaks).

## Data Analysis

One of the core "issues" of statistical clustering is that, in many cases, different methods will give different results. The **meta-clustering** approach proposed by *easystats* [that finds echoes in *consensus clustering*; see @monti2003consensus] consists of treating the unique clustering solutions as a ensemble, from which we can derive a probability matrix. This matrix contains, for each pair of observations, the probability of being in the same cluster. For instance, if the 6th and the 9th row of a dataframe has been assigned to a similar cluster by 5 out of 10 clustering methods, then its probability of being grouped together is 0.5. Essentially, *meta-clustering* approach is based on the hypothesis that, as each clustering algorithm embodies a different prism by which it sees the data, running an infinite amount of algorithms would result in the emergence of the "true" clusters. As the number of algorithms and parameters is finite, the probabilistic perspective is a useful proxy. This method is interesting where there is no obvious reasons to prefer one over another clustering method, as well as to investigate how robust some clusters are under different algorithms.

In this analysis, we first correlated the HRV indices [using the `correlations` function from the *correlation* package, see @makowski2020methods] and removed indices that were perfectly correlated with each other, before removing outliers based on the median absolute deviation from the median [see the `check_outliers` function in the *performance* R package; @ludecke2019performance]. Following the *meta-clustering* approach, multiple clustering methods were used to analyze the associations between the HRV indices, namely, factor analysis, k-means clustering, k-meloids clustering, hierarchical clustering, density-based spatial clustering of applications with noise (DBSCAN), hierarchical density-based spatial clustering of applications with noise (HBSCAN), mixture model approach, and exploratory graph analysis (EGA). We then jointly considered the results obtained from these multiple methods by estimating the probability for each pair of indices to be clustered together. Ultimately, the analysis aimed to to provide a more robust overview of the clustering results and facilitate a more thorough conceptual understanding of these patterns.

Data processing was carried out with R [@R-base] and the *easystats* ecosystem [@ludecke2019insight; @makowski2019bayestestr]. The raw data, as well as the full reproducible analysis script, including additional description of each approach and the solutions of each individual clustering method, are available at this GitHub repository (<https://github.com/Tam-Pham/HRVStructure>).

# Results

```{r, message=FALSE, warning=FALSE, results='hide'}
library(tidyverse)
library(easystats)

data <- read.csv("data/data_combined.csv", stringsAsFactors = FALSE) %>% 
  select(-HRV_ULF, -HRV_VLF) %>%  # insufficient recording length to compute
  select(-HRV_SDANN1, -HRV_SDANN2, -HRV_SDANN5, 
         -HRV_SDNNI1, -HRV_SDNNI2, -HRV_SDNNI5) %>% # insufficient recording length to compute
  setNames(stringr::str_remove(names(.), "HRV_")) %>% 
  mutate_all(function(x) {
    x[is.infinite(x)] <- NA
    return(x) })
```


```{r warning=FALSE, message=TRUE, results='asis'}
cat("This study includes", 
    nrow(data), 
    "participants from", 
    length(unique(data$Database)), 
    "databases.")
```


```{r convenience_chunk_for_local_running, include = FALSE, eval = FALSE}
library(patchwork)
library(tidyverse)
library(easystats)
library(GPArotation)


# To run things locally: CTRL + ALT + SHIFT + P, and then the next two lines
source_rmd("0_ConvenienceFunctions.Rmd")
source_rmd("1_Preprocessing.Rmd")
source_rmd("3_Dimensions.Rmd")
source_rmd("4_Clustering.Rmd")
source_rmd("5_Network.Rmd")
source_rmd("6_Summary.Rmd")
```


## Preprocessing

```{r child='0_ConvenienceFunctions.Rmd'}
```


```{r child='1_Preprocessing.Rmd'}
```


## Dimensional Structure

```{r child='3_Dimensions.Rmd'}
```

## Cluster Structure

```{r child='4_Clustering.Rmd'}
```

## Network Structure

```{r child='5_Network.Rmd'}
```

## Summary

```{r child='6_Summary.Rmd'}
```