Deriving Pseudo-HRV Features from Intermittent Heart Rate Samples: Bridging the Gap Between Smartwatches and Clinical ECG

Oleksandr Solovei November 19, 2025

Abstract

Heart rate variability (HRV) is a well-established biomarker for autonomic nervous system activity, stress, and emotional regulation. Clinical HRV analysis requires continuous electrocardiogram (ECG) recordings with beat-to-beat (R-R interval) precision—a requirement that off-the-shelf smartwatches fundamentally cannot meet. Consumer wearables such as Garmin devices provide photoplethysmography (PPG)-derived heart rate samples at intervals of 1–5 seconds, yielding an intermittent, low-resolution signal. This article presents a systematic methodology for deriving pseudo-HRV features from these sparse heart rate streams, evaluates their correspondence with ECG-derived HRV metrics, and discusses the practical implications for stress estimation research. We introduce time-domain approximations (pseudo-RMSSD, pseudo-SDNN, HR slope variability), frequency-domain surrogates via Lomb-Scargle periodograms, and nonlinear entropy measures adapted for irregular sampling. A Python-based pipeline illustrates each step, from artifact rejection and interpolation through feature extraction and validation against ECG ground truth using Bland-Altman analysis and intraclass correlation coefficients.

1. Introduction

1.1 The HRV Gold Standard

Heart rate variability—the variation in time intervals between consecutive heartbeats—is one of the most widely used non-invasive markers of autonomic function (Task Force of the European Society of Cardiology, 1996). In stress research, decreased HRV (particularly reduced parasympathetic indices such as RMSSD and high-frequency power) reliably tracks increased sympathetic activation and psychological stress (Thayer et al., 2012). The gold standard for HRV measurement is the clinical ECG, which captures the electrical activity of the heart at sampling rates of 250–1000 Hz, enabling precise detection of R-peaks and calculation of R-R intervals (also called inter-beat intervals, or IBIs) with millisecond accuracy.

1.2 The Smartwatch Reality

Consumer smartwatches use optical photoplethysmography (PPG) sensors to estimate heart rate. While PPG can theoretically capture pulse-to-pulse intervals (pulse rate variability, or PRV), most off-the-shelf devices do not expose raw PPG waveforms or beat-to-beat intervals to developers. Instead, they report aggregated heart rate values at intervals determined by firmware—typically every 1 to 5 seconds for Garmin devices in active recording mode, and as infrequently as every 15 minutes during passive background monitoring.

This creates a fundamental mismatch: clinical HRV requires ~1000 Hz ECG data yielding individual beat timings, while smartwatches provide ~0.2–1 Hz heart rate estimates that average over multiple beats.

1.3 Why “Pseudo-HRV”?

We use the term pseudo-HRV to distinguish features derived from intermittent HR samples from true HRV computed from beat-to-beat intervals. Pseudo-HRV features are surrogate metrics designed to capture the same underlying autonomic dynamics but computed from a fundamentally different (and lower-quality) input signal. The central question of this article is: can these surrogates provide meaningful stress-related information despite their limitations?

1.4 Scope and Contribution

This article contributes:

A formal taxonomy of pseudo-HRV features computable from 1–5 second HR samples
Python implementations for each feature category
A discussion of interpolation, resampling, and artifact handling strategies
A validation framework comparing pseudo-HRV against ECG-derived HRV
Practical guidelines for when pseudo-HRV is (and is not) appropriate

2. From R-R Intervals to Heart Rate Samples: What Is Lost?

2.1 Information Content Comparison

To understand what pseudo-HRV can and cannot capture, we must first understand what information is discarded when moving from ECG to aggregated HR.

Property	ECG R-R Intervals	Smartwatch HR Samples
Temporal resolution	~1 ms (beat-to-beat)	1,000–5,000 ms (sample interval)
Signal origin	Electrical (QRS complex)	Optical (PPG blood volume)
Individual beat timing	Yes	No (averaged)
Respiratory sinus arrhythmia	Fully captured	Partially captured (aliased)
Ectopic beat detection	Possible	Not possible
Frequency content preserved	Up to ~0.5 Hz	Up to ~0.1–0.5 Hz (Nyquist)
Typical recording context	Clinical, controlled	Ambulatory, daily life

Key losses:

Beat-to-beat resolution: Each HR sample aggregates multiple beats, smoothing out rapid fluctuations
High-frequency HRV: With 1 Hz sampling, the Nyquist frequency is 0.5 Hz—just barely covering the HF band (0.15–0.4 Hz). At 0.2 Hz sampling (5-second intervals), HF content is entirely aliased
Ectopic beat information: Premature beats and artifacts visible in ECG are invisible in aggregated HR
Phase information: The precise timing of autonomic modulation within each respiratory cycle is lost

2.2 What Is Preserved

Despite these losses, important information survives:

Low-frequency trends: Sympathetic activation and withdrawal operate on timescales of tens of seconds to minutes, well within the capture range of 1–5 second sampling
Overall variability: The spread of HR values across a window reflects total autonomic modulation
Directional changes: HR increases/decreases associated with stress onset and recovery are clearly visible
Relative comparisons: Within-subject, within-session comparisons (baseline vs. stress) remain valid even if absolute HRV values are biased

3. Preprocessing: From Raw Samples to Analysis-Ready Signal

3.1 Loading Smartwatch Data

Garmin devices export data in the proprietary .fit (Flexible and Interoperable Data Transfer) format. Heart rate records contain a timestamp and a heart rate value in beats per minute (bpm).

from fitparse import FitFile
import pandas as pd
import numpy as np

def load_garmin_hr(fit_path: str) -> pd.DataFrame:
    """
    Extract heart rate time series from a Garmin .fit file.

    Returns DataFrame with columns: timestamp (UTC datetime), hr (bpm)
    """
    fitfile = FitFile(fit_path)
    records = []

    for record in fitfile.get_messages('record'):
        ts = None
        hr = None
        for field in record:
            if field.name == 'timestamp':
                ts = field.value
            elif field.name == 'heart_rate':
                hr = field.value
        if ts is not None and hr is not None:
            records.append({'timestamp': ts, 'hr': hr})

    df = pd.DataFrame(records)
    df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True)
    df = df.sort_values('timestamp').reset_index(drop=True)
    return df

3.2 Artifact Detection and Removal

Even aggregated HR values can contain artifacts caused by sensor lift-off, motion, or firmware glitches. We apply a three-stage filter:

Stage 1 — Physiological range gating: Remove samples outside 30–220 bpm.

Stage 2 — Successive difference threshold: Flag samples where |HR(t) - HR(t-1)| exceeds a threshold relative to the local median. A sudden jump of 40+ bpm between consecutive 1-second samples almost certainly reflects artifact, not physiology.

Stage 3 — Accelerometer-based motion gating (when available): Exclude windows where accelerometer magnitude exceeds a threshold, indicating movement that degrades PPG signal quality.

def clean_hr_signal(df: pd.DataFrame,
                    hr_min: float = 30.0,
                    hr_max: float = 220.0,
                    max_successive_diff: float = 40.0) -> pd.DataFrame:
    """
    Remove artifacts from HR time series.

    Args:
        df: DataFrame with 'timestamp' and 'hr' columns
        hr_min: Minimum plausible HR (bpm)
        hr_max: Maximum plausible HR (bpm)
        max_successive_diff: Maximum allowed HR change between consecutive samples

    Returns:
        Cleaned DataFrame with artifact rows removed
    """
    # Stage 1: Physiological range
    mask = (df['hr'] >= hr_min) & (df['hr'] <= hr_max)

    # Stage 2: Successive difference
    diffs = df['hr'].diff().abs()
    mask &= (diffs <= max_successive_diff) | diffs.isna()

    cleaned = df[mask].copy().reset_index(drop=True)

    pct_removed = 100 * (1 - len(cleaned) / len(df))
    print(f"Artifact removal: {pct_removed:.1f}% of samples removed")

    return cleaned

3.3 Gap Detection and Interpolation

After artifact removal, gaps may appear in the time series. We distinguish between:

Short gaps (< 10 seconds): Filled via cubic spline interpolation
Medium gaps (10–60 seconds): Filled via linear interpolation with a quality flag
Long gaps (> 60 seconds): Not interpolated; the segment is split into separate analysis windows

from scipy.interpolate import CubicSpline

def interpolate_hr(df: pd.DataFrame,
                   target_interval_s: float = 1.0,
                   max_gap_short_s: float = 10.0,
                   max_gap_medium_s: float = 60.0) -> pd.DataFrame:
    """
    Resample HR to uniform intervals with gap-aware interpolation.

    Returns DataFrame with uniform timestamps and interpolated HR.
    Gaps longer than max_gap_medium_s produce NaN values.
    """
    t_seconds = (df['timestamp'] - df['timestamp'].iloc[0]).dt.total_seconds().values
    hr_values = df['hr'].values

    # Create uniform time grid
    t_uniform = np.arange(t_seconds[0], t_seconds[-1], target_interval_s)

    # Cubic spline on original data
    cs = CubicSpline(t_seconds, hr_values, extrapolate=False)
    hr_interp = cs(t_uniform)

    # Identify gaps in original data and mask accordingly
    result = pd.DataFrame({
        'timestamp': df['timestamp'].iloc[0] + pd.to_timedelta(t_uniform, unit='s'),
        'hr': hr_interp,
        'interpolated': True
    })

    # Mark original sample positions
    for _, row in df.iterrows():
        t_orig = (row['timestamp'] - df['timestamp'].iloc[0]).total_seconds()
        idx = np.argmin(np.abs(t_uniform - t_orig))
        if abs(t_uniform[idx] - t_orig) < target_interval_s:
            result.loc[idx, 'interpolated'] = False

    # Null out long gaps
    for i in range(1, len(t_seconds)):
        gap = t_seconds[i] - t_seconds[i - 1]
        if gap > max_gap_medium_s:
            gap_mask = (t_uniform > t_seconds[i - 1]) & (t_uniform < t_seconds[i])
            result.loc[gap_mask, 'hr'] = np.nan
        elif gap > max_gap_short_s:
            # Linear interpolation for medium gaps (override cubic)
            gap_mask = (t_uniform > t_seconds[i - 1]) & (t_uniform < t_seconds[i])
            gap_indices = np.where(gap_mask)[0]
            if len(gap_indices) > 0:
                result.loc[gap_indices, 'hr'] = np.interp(
                    t_uniform[gap_indices],
                    [t_seconds[i - 1], t_seconds[i]],
                    [hr_values[i - 1], hr_values[i]]
                )

    return result

3.4 Conversion to Pseudo-IBI

A critical intermediate step is converting HR (bpm) to approximate inter-beat intervals (ms). While each HR sample represents an average over multiple beats, the instantaneous IBI approximation provides a bridge to standard HRV formulas:

IBI_approx(t) = 60,000 / HR(t)   [milliseconds]

This conversion is exact only if HR is computed from a single beat interval—which it is not for smartwatch data. Nevertheless, it provides a useful proxy, especially when HR is relatively stable within the averaging window.

def hr_to_pseudo_ibi(hr_series: pd.Series) -> pd.Series:
    """Convert HR (bpm) to approximate IBI (ms)."""
    return 60_000.0 / hr_series

4. Pseudo-HRV Feature Taxonomy

We organize pseudo-HRV features into four categories: time-domain, frequency-domain, nonlinear, and trend-based. Each feature is defined formally, with a discussion of how it relates to (and diverges from) its ECG-HRV counterpart.

4.1 Time-Domain Features

4.1.1 Pseudo-SDNN (Standard Deviation of NN Intervals)

ECG definition: Standard deviation of all normal-to-normal R-R intervals in a window.

Pseudo-HRV adaptation: Standard deviation of pseudo-IBI values in a window.

def pseudo_sdnn(hr_window: np.ndarray) -> float:
    """Compute pseudo-SDNN from HR samples (bpm)."""
    ibi = 60_000.0 / hr_window
    return np.std(ibi, ddof=1)

Expected bias: Pseudo-SDNN underestimates true SDNN because the averaging inherent in HR computation smooths beat-to-beat variability. The degree of underestimation depends on the ratio of HR sampling interval to mean IBI.

4.1.2 Pseudo-RMSSD (Root Mean Square of Successive Differences)

ECG definition: Square root of the mean of squared successive differences between adjacent R-R intervals.

Pseudo-HRV adaptation: RMSSD computed on pseudo-IBI values. This is the most commonly used parasympathetic index and the most challenging to approximate from low-frequency data.

def pseudo_rmssd(hr_window: np.ndarray) -> float:
    """Compute pseudo-RMSSD from HR samples (bpm)."""
    ibi = 60_000.0 / hr_window
    successive_diffs = np.diff(ibi)
    return np.sqrt(np.mean(successive_diffs ** 2))

Expected bias: Substantial underestimation, particularly at low sampling rates. At 1 Hz, successive differences capture only slow fluctuations; the rapid beat-to-beat changes that dominate RMSSD are lost.

4.1.3 Pseudo-pNN50

ECG definition: Percentage of successive R-R intervals differing by more than 50 ms.

Pseudo-HRV adaptation: Same computation on pseudo-IBI, but with an adjusted threshold. Since averaging compresses variability, a lower threshold (e.g., 20 ms) may be more appropriate for smartwatch data.

def pseudo_pnn(hr_window: np.ndarray, threshold_ms: float = 50.0) -> float:
    """Compute pseudo-pNNxx from HR samples."""
    ibi = 60_000.0 / hr_window
    successive_diffs = np.abs(np.diff(ibi))
    return 100.0 * np.sum(successive_diffs > threshold_ms) / len(successive_diffs)

4.1.4 HR Range and Coefficient of Variation

These simpler metrics are less affected by the smoothing problem:

def hr_range(hr_window: np.ndarray) -> float:
    """Max HR - Min HR in window."""
    return np.max(hr_window) - np.min(hr_window)

def hr_cv(hr_window: np.ndarray) -> float:
    """Coefficient of variation of HR: SD / mean × 100."""
    return 100.0 * np.std(hr_window, ddof=1) / np.mean(hr_window)

4.2 Frequency-Domain Features

4.2.1 The Sampling Problem

Standard frequency-domain HRV uses the power spectral density (PSD) of the R-R interval time series, typically computed via FFT after resampling to uniform 4 Hz. The canonical bands are:

VLF: 0.003–0.04 Hz (thermoregulation, hormonal)
LF: 0.04–0.15 Hz (mixed sympathetic/parasympathetic)
HF: 0.15–0.40 Hz (parasympathetic, respiratory)

With smartwatch HR sampled at 1 Hz, the Nyquist frequency is 0.5 Hz—barely covering the HF band. At 0.2 Hz (5-second intervals), only VLF and part of LF are accessible. Furthermore, the non-uniform sampling of real smartwatch data makes FFT inappropriate without resampling.

4.2.2 Lomb-Scargle Periodogram

The Lomb-Scargle method estimates spectral power from unevenly sampled data without requiring interpolation to uniform intervals, making it well-suited for smartwatch HR:

from scipy.signal import lombscargle

def pseudo_frequency_features(timestamps_s: np.ndarray,
                               hr_values: np.ndarray) -> dict:
    """
    Compute frequency-domain pseudo-HRV features using Lomb-Scargle.

    Args:
        timestamps_s: Time in seconds (not necessarily uniform)
        hr_values: HR in bpm

    Returns:
        Dictionary with VLF, LF, HF power and LF/HF ratio
    """
    # Convert to pseudo-IBI
    ibi = 60_000.0 / hr_values

    # Remove mean (detrend)
    ibi_centered = ibi - np.mean(ibi)

    # Define frequency grid
    freq = np.linspace(0.003, 0.5, 1000)
    angular_freq = 2 * np.pi * freq

    # Compute Lomb-Scargle periodogram
    pgram = lombscargle(timestamps_s, ibi_centered, angular_freq, normalize=True)

    # Integrate power in bands
    df = freq[1] - freq[0]
    vlf_mask = (freq >= 0.003) & (freq < 0.04)
    lf_mask = (freq >= 0.04) & (freq < 0.15)
    hf_mask = (freq >= 0.15) & (freq <= 0.40)

    vlf_power = np.trapz(pgram[vlf_mask], freq[vlf_mask])
    lf_power = np.trapz(pgram[lf_mask], freq[lf_mask])
    hf_power = np.trapz(pgram[hf_mask], freq[hf_mask])
    total_power = vlf_power + lf_power + hf_power

    return {
        'vlf_power': vlf_power,
        'lf_power': lf_power,
        'hf_power': hf_power,
        'total_power': total_power,
        'lf_hf_ratio': lf_power / hf_power if hf_power > 0 else np.nan,
        'lf_norm': lf_power / (lf_power + hf_power) if (lf_power + hf_power) > 0 else np.nan,
        'hf_norm': hf_power / (lf_power + hf_power) if (lf_power + hf_power) > 0 else np.nan
    }

Caveat: HF power estimates from 1 Hz data should be interpreted cautiously. The Lomb-Scargle method can estimate power near the Nyquist limit, but the estimates are noisy and potentially biased. For smartwatch data sampled at intervals > 2 seconds, we recommend reporting only VLF and LF features and noting that HF estimates are unreliable.

4.3 Nonlinear Features

4.3.1 Sample Entropy (SampEn)

Sample entropy quantifies the complexity/regularity of a time series. Lower SampEn indicates more regularity, which in HRV analysis is associated with stress and reduced autonomic flexibility (Richman & Moorman, 2000).

def sample_entropy(signal: np.ndarray, m: int = 2, r_factor: float = 0.2) -> float:
    """
    Compute sample entropy of a signal.

    Args:
        signal: Input time series
        m: Embedding dimension
        r_factor: Tolerance as fraction of signal SD
    """
    N = len(signal)
    r = r_factor * np.std(signal)

    def count_matches(template_len):
        count = 0
        templates = np.array([signal[i:i + template_len]
                              for i in range(N - template_len)])
        for i in range(len(templates)):
            for j in range(i + 1, len(templates)):
                if np.max(np.abs(templates[i] - templates[j])) < r:
                    count += 1
        return count

    A = count_matches(m + 1)
    B = count_matches(m)

    if B == 0:
        return np.nan
    return -np.log(A / B) if A > 0 else np.nan

4.3.2 Detrended Fluctuation Analysis (DFA) — Short-Term (α1)

DFA quantifies the fractal scaling properties of a time series. The short-term exponent α1 (computed over scales 4–16 beats) is a sensitive stress marker (Penttilä et al., 2001). For pseudo-IBI from smartwatch data, the “beat” axis is replaced by sample indices, which means α1 reflects scaling at the sample resolution rather than the beat resolution.

def dfa_alpha1(signal: np.ndarray, scales: list = None) -> float:
    """
    Compute DFA short-term scaling exponent (α1).

    Args:
        signal: Input time series (pseudo-IBI or HR)
        scales: Box sizes for DFA (default: 4 to 16)
    """
    if scales is None:
        max_scale = min(16, len(signal) // 4)
        scales = list(range(4, max_scale + 1))

    if len(scales) < 2:
        return np.nan

    # Integrate the mean-subtracted signal
    y = np.cumsum(signal - np.mean(signal))
    fluctuations = []

    for scale in scales:
        n_segments = len(y) // scale
        if n_segments < 1:
            continue

        rms_values = []
        for seg in range(n_segments):
            segment = y[seg * scale:(seg + 1) * scale]
            x = np.arange(scale)
            coeffs = np.polyfit(x, segment, 1)
            trend = np.polyval(coeffs, x)
            rms_values.append(np.sqrt(np.mean((segment - trend) ** 2)))

        fluctuations.append(np.mean(rms_values))

    if len(fluctuations) < 2:
        return np.nan

    log_scales = np.log(scales[:len(fluctuations)])
    log_fluct = np.log(fluctuations)
    slope, _ = np.polyfit(log_scales, log_fluct, 1)

    return slope

4.4 Trend-Based Features

These features exploit what smartwatch data does capture well: HR dynamics over seconds to minutes.

def trend_features(hr_window: np.ndarray,
                   timestamps_s: np.ndarray) -> dict:
    """
    Compute trend-based pseudo-HRV features.

    Args:
        hr_window: HR values (bpm)
        timestamps_s: Corresponding timestamps in seconds

    Returns:
        Dictionary of trend features
    """
    # HR slope (linear regression)
    slope, intercept = np.polyfit(timestamps_s - timestamps_s[0], hr_window, 1)

    # Residual variability after detrending
    trend_line = slope * (timestamps_s - timestamps_s[0]) + intercept
    residuals = hr_window - trend_line
    residual_std = np.std(residuals, ddof=1)

    # First differences of HR
    hr_diffs = np.diff(hr_window)

    # Acceleration (second differences)
    hr_accel = np.diff(hr_diffs)

    return {
        'hr_slope': slope,                              # bpm/sec
        'hr_residual_std': residual_std,                # bpm
        'hr_mean_abs_diff': np.mean(np.abs(hr_diffs)),  # bpm
        'hr_diff_std': np.std(hr_diffs, ddof=1),        # bpm
        'hr_accel_std': np.std(hr_accel, ddof=1) if len(hr_accel) > 1 else np.nan,
        'hr_mean': np.mean(hr_window),                  # bpm
        'hr_median': np.median(hr_window),              # bpm
        'hr_iqr': np.percentile(hr_window, 75) - np.percentile(hr_window, 25)
    }

5. Complete Feature Extraction Pipeline

5.1 Windowed Extraction

Features are computed over sliding windows, typically 60–300 seconds with 50% overlap. Shorter windows capture acute changes; longer windows provide more stable estimates.

def extract_all_features(df: pd.DataFrame,
                          window_sec: float = 120.0,
                          step_sec: float = 60.0,
                          min_coverage: float = 0.7) -> pd.DataFrame:
    """
    Extract pseudo-HRV features from cleaned, interpolated HR data.

    Args:
        df: DataFrame with 'timestamp' and 'hr' columns (uniform sampling)
        window_sec: Window size in seconds
        step_sec: Step size in seconds
        min_coverage: Minimum fraction of non-NaN samples required

    Returns:
        DataFrame with one row per window and all features
    """
    t0 = df['timestamp'].iloc[0]
    t_sec = (df['timestamp'] - t0).dt.total_seconds().values
    hr = df['hr'].values

    features_list = []
    window_start = t_sec[0]

    while window_start + window_sec <= t_sec[-1]:
        mask = (t_sec >= window_start) & (t_sec < window_start + window_sec)
        hr_w = hr[mask]
        ts_w = t_sec[mask]

        # Check coverage
        valid = ~np.isnan(hr_w)
        if np.sum(valid) / len(hr_w) < min_coverage:
            window_start += step_sec
            continue

        hr_clean = hr_w[valid]
        ts_clean = ts_w[valid]

        # Time-domain
        feats = {
            'window_start': t0 + pd.Timedelta(seconds=window_start),
            'window_center': t0 + pd.Timedelta(seconds=window_start + window_sec / 2),
            'coverage': np.sum(valid) / len(hr_w),
            'pseudo_sdnn': pseudo_sdnn(hr_clean),
            'pseudo_rmssd': pseudo_rmssd(hr_clean),
            'pseudo_pnn50': pseudo_pnn(hr_clean, threshold_ms=50.0),
            'pseudo_pnn20': pseudo_pnn(hr_clean, threshold_ms=20.0),
            'hr_range': hr_range(hr_clean),
            'hr_cv': hr_cv(hr_clean),
        }

        # Frequency-domain
        if len(hr_clean) >= 30:
            freq_feats = pseudo_frequency_features(ts_clean, hr_clean)
            feats.update(freq_feats)

        # Trend-based
        feats.update(trend_features(hr_clean, ts_clean))

        # Nonlinear
        ibi_w = 60_000.0 / hr_clean
        if len(ibi_w) >= 30:
            feats['sample_entropy'] = sample_entropy(ibi_w)
            feats['dfa_alpha1'] = dfa_alpha1(ibi_w)

        features_list.append(feats)
        window_start += step_sec

    return pd.DataFrame(features_list)

5.2 Baseline Normalization

Within-subject, within-session normalization improves comparability across individuals with different resting HR levels:

def normalize_to_baseline(features_df: pd.DataFrame,
                           baseline_duration_sec: float = 120.0,
                           step_sec: float = 60.0) -> pd.DataFrame:
    """
    Express each feature as a percentage change from the baseline window.
    Assumes the first window(s) correspond to the resting baseline.
    """
    n_baseline_windows = max(1, int(baseline_duration_sec / step_sec))
    baseline = features_df.iloc[:n_baseline_windows]

    numeric_cols = features_df.select_dtypes(include=[np.number]).columns
    normalized = features_df.copy()

    for col in numeric_cols:
        bl_mean = baseline[col].mean()
        if bl_mean != 0 and not np.isnan(bl_mean):
            normalized[col] = 100.0 * (features_df[col] - bl_mean) / abs(bl_mean)
        else:
            normalized[col] = 0.0

    return normalized

6. Validation Against ECG Ground Truth

6.1 Study Design

Validation requires simultaneous recording of clinical ECG and smartwatch HR during a controlled stress protocol. The protocol used in this project consists of:

Baseline (5 minutes): Seated rest, neutral video
Stress induction (variable): Cognitive task with time pressure (mental arithmetic, bomb defusal) or emotionally charged video
Recovery (5 minutes): Seated rest, calming video

ECG is recorded at 500 Hz using a clinical-grade monitor. The smartwatch (Garmin Forerunner/Fenix series) records HR in foreground activity mode (~1 Hz).

6.2 ECG-HRV Reference Pipeline

import neurokit2 as nk

def compute_ecg_hrv(ecg_signal: np.ndarray,
                     sampling_rate: int = 500,
                     window_sec: float = 120.0) -> dict:
    """
    Compute gold-standard HRV from ECG signal.

    Returns dictionary of standard HRV metrics.
    """
    # Detect R-peaks
    _, info = nk.ecg_process(ecg_signal, sampling_rate=sampling_rate)
    rpeaks = info['ECG_R_Peaks']

    # Compute R-R intervals (ms)
    rr_intervals = np.diff(rpeaks) / sampling_rate * 1000

    # Time-domain
    sdnn = np.std(rr_intervals, ddof=1)
    rmssd = np.sqrt(np.mean(np.diff(rr_intervals) ** 2))
    pnn50 = 100.0 * np.sum(np.abs(np.diff(rr_intervals)) > 50) / len(np.diff(rr_intervals))

    # Frequency-domain (via NeuroKit2)
    hrv_freq = nk.hrv_frequency(
        {'ECG_R_Peaks': rpeaks},
        sampling_rate=sampling_rate
    )

    return {
        'ecg_sdnn': sdnn,
        'ecg_rmssd': rmssd,
        'ecg_pnn50': pnn50,
        'ecg_lf_power': hrv_freq['HRV_LF'].values[0],
        'ecg_hf_power': hrv_freq['HRV_HF'].values[0],
        'ecg_lf_hf_ratio': hrv_freq['HRV_LFHF'].values[0]
    }

6.3 Agreement Analysis

6.3.1 Bland-Altman Analysis

The Bland-Altman plot is the standard method for assessing agreement between two measurement methods (Bland & Altman, 1986). For each pseudo-HRV metric and its ECG counterpart, we compute:

Bias: Mean difference (pseudo - ECG)
Limits of agreement: Bias ± 1.96 × SD of differences
Proportional bias: Regression of differences on means

import matplotlib.pyplot as plt

def bland_altman_plot(ecg_values: np.ndarray,
                      pseudo_values: np.ndarray,
                      metric_name: str):
    """Generate Bland-Altman agreement plot."""
    means = (ecg_values + pseudo_values) / 2
    diffs = pseudo_values - ecg_values

    bias = np.mean(diffs)
    sd = np.std(diffs, ddof=1)
    upper_loa = bias + 1.96 * sd
    lower_loa = bias - 1.96 * sd

    fig, ax = plt.subplots(figsize=(8, 6))
    ax.scatter(means, diffs, alpha=0.5, s=20)
    ax.axhline(bias, color='red', linestyle='-', label=f'Bias: {bias:.2f}')
    ax.axhline(upper_loa, color='gray', linestyle='--',
               label=f'Upper LoA: {upper_loa:.2f}')
    ax.axhline(lower_loa, color='gray', linestyle='--',
               label=f'Lower LoA: {lower_loa:.2f}')
    ax.set_xlabel(f'Mean of ECG and Pseudo {metric_name}')
    ax.set_ylabel(f'Difference (Pseudo - ECG)')
    ax.set_title(f'Bland-Altman: {metric_name}')
    ax.legend()

    return fig, {'bias': bias, 'upper_loa': upper_loa, 'lower_loa': lower_loa}

6.3.2 Intraclass Correlation Coefficient (ICC)

ICC quantifies the consistency between ECG-HRV and pseudo-HRV across windows and participants. ICC(2,1) (two-way random, single measures, absolute agreement) is appropriate for this comparison:

from scipy import stats

def compute_icc(ecg_values: np.ndarray,
                pseudo_values: np.ndarray) -> dict:
    """
    Compute ICC(2,1) for agreement between ECG-HRV and pseudo-HRV.

    Returns ICC value and 95% CI.
    """
    n = len(ecg_values)
    data = np.column_stack([ecg_values, pseudo_values])

    # Two-way ANOVA components
    grand_mean = np.mean(data)
    row_means = np.mean(data, axis=1)
    col_means = np.mean(data, axis=0)

    ss_total = np.sum((data - grand_mean) ** 2)
    ss_rows = 2 * np.sum((row_means - grand_mean) ** 2)
    ss_cols = n * np.sum((col_means - grand_mean) ** 2)
    ss_error = ss_total - ss_rows - ss_cols

    k = 2  # number of raters (ECG, pseudo)
    ms_rows = ss_rows / (n - 1)
    ms_cols = ss_cols / (k - 1)
    ms_error = ss_error / ((n - 1) * (k - 1))

    icc = (ms_rows - ms_error) / (ms_rows + (k - 1) * ms_error +
           k * (ms_cols - ms_error) / n)

    return {'icc': icc, 'n_observations': n}

6.3.3 Responsiveness Concordance

Beyond absolute agreement, we assess whether pseudo-HRV and ECG-HRV change in the same direction across experimental phases. This is often more relevant for stress detection than absolute accuracy:

def responsiveness_concordance(ecg_baseline: float, ecg_stress: float,
                                pseudo_baseline: float, pseudo_stress: float) -> bool:
    """
    Check if both methods detect the same direction of change
    from baseline to stress.
    """
    ecg_delta = ecg_stress - ecg_baseline
    pseudo_delta = pseudo_stress - pseudo_baseline
    return np.sign(ecg_delta) == np.sign(pseudo_delta)

7. Expected Results and Interpretation Guidelines

7.1 Feature Reliability Hierarchy

Based on the literature and theoretical analysis, we propose the following reliability hierarchy for pseudo-HRV features derived from 1 Hz smartwatch HR:

Feature	Expected ICC vs ECG	Stress Sensitivity	Recommended?
HR mean	> 0.95	High (direction)	Yes
HR slope	> 0.85	High (acute stress onset)	Yes
HR range / IQR	0.70–0.85	Moderate	Yes
HR CV	0.65–0.80	Moderate	Yes
Pseudo-SDNN	0.50–0.75	Moderate (underestimates)	With caution
LF power (Lomb-Scargle)	0.40–0.65	Low–Moderate	With caution
Pseudo-RMSSD	0.30–0.55	Low (substantial bias)	Report bias
Trend features	0.70–0.90	High	Yes
Sample entropy	0.35–0.60	Moderate	Exploratory
DFA α1	0.30–0.55	Low–Moderate	Exploratory
HF power	< 0.40	Low (aliased at 1 Hz)	Not recommended
Pseudo-pNN50	< 0.35	Low (threshold too coarse)	Not recommended

7.2 When Pseudo-HRV Is Sufficient

Pseudo-HRV features are appropriate when:

Detecting stress direction: Baseline vs. stress classification (binary), where relative changes matter more than absolute values
Within-subject comparisons: Tracking how an individual’s autonomic state changes over time
Low-frequency dynamics: Studying sympathetic activation patterns that unfold over minutes
Large-sample studies: When ECG is impractical and statistical power compensates for individual measurement noise
Screening and triage: Identifying individuals who may benefit from more detailed clinical assessment

7.3 When ECG Is Required

Pseudo-HRV is not a substitute for clinical ECG-HRV when:

Absolute HRV values are needed (e.g., clinical risk stratification)
High-frequency analysis is required (respiratory sinus arrhythmia, vagal tone)
Beat-to-beat precision matters (arrhythmia detection, ectopic beat analysis)
Short windows (< 60 seconds) are used, where few HR samples provide insufficient data
Regulatory or clinical contexts demand validated measurement instruments

8. Practical Recommendations

8.1 Data Collection

Maximize sampling rate: Use foreground recording on the smartwatch (activity mode) rather than background monitoring. Garmin devices typically provide 1 Hz HR in activity mode vs. irregular sampling in background mode
Minimize motion: For controlled experiments, keep participants seated during critical measurement windows. Use accelerometer data to flag and exclude high-motion periods
Record long windows: Use at least 120-second analysis windows to ensure sufficient samples. For frequency-domain analysis, 300 seconds is preferable
Collect simultaneously: When possible, record ECG in parallel for at least a subset of participants to calibrate pseudo-HRV against ground truth

8.2 Analysis

Always report the source: Clearly label features as “pseudo-HRV” or “HR-derived” to distinguish from ECG-based HRV
Use robust features first: Prioritize HR mean, slope, range, and CV before attempting RMSSD or frequency-domain approximations
Apply baseline normalization: Within-subject percent-change from baseline improves comparability and reduces the impact of systematic bias
Combine features: A multi-feature approach (HR trend + variability + nonlinear) outperforms any single metric for stress classification
Validate with ground truth: When ECG data are available (even from prior studies with the same protocol), quantify agreement and report Bland-Altman statistics

8.3 Reporting

When publishing results based on pseudo-HRV:

State the smartwatch model, firmware version, and recording mode
Report the actual sampling interval distribution (mean, SD, range)
Describe artifact rejection criteria and the percentage of data removed
Specify interpolation method and parameters
Report all pseudo-HRV features with clear notation (e.g., “pseudo-RMSSD” rather than “RMSSD”)
Include agreement statistics against ECG if available
Discuss limitations of the measurement approach

9. Discussion

9.1 The Pragmatic Case for Pseudo-HRV

The clinical HRV literature spans decades and thousands of studies, nearly all using ECG. Moving to smartwatch-derived approximations inevitably sacrifices fidelity. However, this tradeoff enables something ECG cannot: continuous, unobtrusive monitoring in daily life at scale. A participant wearing a Garmin watch for two weeks generates thousands of HR samples across dozens of contexts—sleeping, commuting, working, exercising—providing ecological validity that no clinical session can match.

The pseudo-HRV framework acknowledges this tradeoff explicitly. Rather than claiming equivalence with ECG-HRV, it defines a parallel set of features with known biases and documented validity ranges, allowing researchers to make informed decisions about which analyses are defensible.

9.2 The Role of Machine Learning

An important alternative to computing individual pseudo-HRV features is to feed raw HR time series (or simple statistical summaries) directly into machine learning models trained to predict stress labels. In this approach, the model implicitly learns whatever information about autonomic state is available in the HR signal, without the intermediate step of approximating traditional HRV metrics. Our feature taxonomy remains valuable as an interpretability tool—understanding why a model classifies a window as stressed is easier when the features have physiological meaning.

9.3 Future Directions

Device-specific calibration: Different smartwatch models introduce different smoothing and averaging. Transfer functions mapping pseudo-HRV to ECG-HRV could be device-specific
IBI-capable devices: Some newer Garmin models (Fenix 7+, Forerunner 965) expose inter-beat intervals via the .fit file’s hrv_status or stress messages. When available, these should be used instead of HR-derived pseudo-IBI, dramatically improving feature quality
Multi-sensor fusion: Combining HR with accelerometer, skin temperature, and electrodermal activity (on supported devices) may compensate for the limited HRV information in HR-only data
Normative databases: Building population-level norms for pseudo-HRV features, stratified by device and recording context, would facilitate clinical translation

10. Conclusion

Pseudo-HRV features derived from intermittent smartwatch heart rate samples occupy a pragmatic middle ground between no physiological measurement and clinical ECG-HRV. They cannot replace the gold standard, but they can extend the reach of stress research into daily life, enable large-scale studies, and provide meaningful within-subject stress tracking.

The methodology presented here—artifact rejection, gap-aware interpolation, a four-category feature taxonomy, and systematic validation against ECG ground truth—provides a principled framework for extracting and evaluating these features. By being transparent about the biases and limitations inherent in the approach, researchers can make defensible use of consumer wearable data while clearly communicating the boundaries of their measurements.

For the StressSmartWatch project specifically, pseudo-HRV features form the core of the smartwatch-only stress estimation pipeline, complemented by behavioral data from joystick interactions and self-reported stress. The validation phase will quantify exactly which features survive the transition from ECG to smartwatch and how effectively they support stress detection in controlled and daily-life settings.

References

Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology. (1996). Heart rate variability: Standards of measurement, physiological interpretation, and clinical use. Circulation, 93(5), 1043–1065.
Thayer, J. F., Åhs, F., Fredrikson, M., Sollers, J. J., & Wager, T. D. (2012). A meta-analysis of heart rate variability and neuroimaging studies: Implications for heart rate variability as a marker of stress and health. Neuroscience & Biobehavioral Reviews, 36(2), 747–756.
Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet, 327(8476), 307–310.
Richman, J. S., & Moorman, J. R. (2000). Physiological time-series analysis using approximate entropy and sample entropy. American Journal of Physiology—Heart and Circulatory Physiology, 278(6), H2039–H2049.
Penttilä, J., Helminen, A., Jartti, T., Kuusela, T., Huikuri, H. V., Tulppo, M. P., & Scheinin, H. (2001). Time domain, geometrical and frequency domain analysis of cardiac vagal outflow: Effects of various respiratory patterns. Clinical Physiology, 21(3), 365–376.
Pinto, G., Carvalho, J. M., Barros, F., Soares, S. C., Pinho, A. J., & Brás, S. (2020). Multimodal emotion evaluation: A physiological model for cost-effective emotion classification. Sensors, 20(12), 3510.
Schäfer, A., & Vagedes, J. (2013). How accurate is pulse rate variability as an estimate of heart rate variability? A review on studies comparing photoplethysmographic technology with an electrocardiogram. International Journal of Cardiology, 166(1), 15–29.
Lomb, N. R. (1976). Least-squares frequency analysis of unequally spaced data. Astrophysics and Space Science, 39, 447–462.
Peng, C. K., Havlin, S., Stanley, H. E., & Goldberger, A. L. (1995). Quantification of scaling exponents and crossover phenomena in nonstationary heartbeat time series. Chaos, 5(1), 82–87.
Plews, D. J., Scott, B., Altini, M., Wood, M., Kilding, A. E., & Laursen, P. B. (2017). Comparison of heart-rate-variability recording with smartphone photoplethysmography, Polar H7 chest strap, and electrocardiography. International Journal of Sports Physiology and Performance, 12(10), 1324–1328.

Appendix A: Quick-Reference Feature Table

Feature	Formula (from HR in bpm)	Window	Units
HR mean	mean(HR)	60–300 s	bpm
HR SD	std(HR)	60–300 s	bpm
HR CV	100 × std(HR)/mean(HR)	60–300 s	%
HR range	max(HR) - min(HR)	60–300 s	bpm
HR slope	linear regression slope	60–300 s	bpm/s
HR residual SD	std(HR - trend)	60–300 s	bpm
Pseudo-SDNN	std(60000/HR)	120–300 s	ms
Pseudo-RMSSD	sqrt(mean(diff(60000/HR)²))	120–300 s	ms
Pseudo-pNN20	%\|diff(60000/HR)\| > 20 ms	120–300 s	%
LF power	Lomb-Scargle 0.04–0.15 Hz	300 s	ms²
LF/HF ratio	LF power / HF power	300 s	—
Sample entropy	SampEn(IBI, m=2, r=0.2×SD)	120–300 s	—
DFA α1	DFA scaling exponent	120–300 s	—

Appendix B: Minimum Data Requirements

Analysis Type	Min Window	Min Sampling Rate	Min Samples
HR trend features	60 s	0.2 Hz	12
Time-domain pseudo-HRV	120 s	1 Hz	120
Frequency-domain (VLF+LF)	300 s	0.5 Hz	150
Frequency-domain (with HF)	300 s	1 Hz	300
Nonlinear (SampEn, DFA)	120 s	1 Hz	120
Stress classification model	120 s	1 Hz	120