Validation of Consumer Smartwatch Heart Rate for Stress Detection: A Comparison with ECG-Derived HRV

Abstract

Heart rate variability (HRV) derived from electrocardiography (ECG) is the established ground truth for physiological stress assessment, yet ECG requires skin-contact electrodes that limit ecological validity and participant comfort. Consumer smartwatches equipped with photoplethysmography (PPG) sensors offer a non-invasive alternative, providing continuous heart rate (HR) measurements during daily activities. However, smartwatches do not provide continuous ECG, raising a fundamental question: can stress-relevant information be extracted from PPG-based heart rate alone? This article presents a validation framework comparing consumer smartwatch HR data against ECG-derived HRV for stress detection. We examine the physiological basis of both measurement modalities, define surrogate metrics computable from low-frequency HR samples (Pulse Rate Variability, or PRV), evaluate agreement using Bland-Altman analysis and correlation methods, and discuss practical strategies for motion artifact mitigation. Our analysis demonstrates that while smartwatch-derived metrics cannot replicate beat-to-beat HRV fidelity, they can detect meaningful stress-related changes in controlled settings when appropriate preprocessing and motion gating are applied, achieving segment-level classification AUC values approaching clinical relevance.


1. Introduction

1.1 The Promise of Wearable Stress Monitoring

Chronic stress is a major public health concern linked to cardiovascular disease, immune dysfunction, and mental health disorders (McEwen, 2007). Objective, continuous stress monitoring could enable timely interventions, yet traditional assessment methods rely on either subjective self-report or laboratory-grade physiological recording equipment. The proliferation of consumer smartwatches — over 200 million units shipped globally in 2024 — presents an unprecedented opportunity to bridge this gap. Devices from Garmin, Apple, Samsung, and others continuously record heart rate via wrist-mounted optical sensors, and increasingly offer proprietary “stress scores” to end users.

1.2 The Ground Truth Problem

Heart rate variability (HRV), the variation in time intervals between successive heartbeats, is one of the most widely validated biomarkers for autonomic nervous system (ANS) activity and stress (Task Force, 1996). HRV analysis requires precise detection of R-peaks in an ECG signal, from which inter-beat intervals (IBIs, also called RR intervals) are computed. Time-domain metrics such as RMSSD (root mean square of successive differences) and SDNN (standard deviation of NN intervals), as well as frequency-domain metrics like the LF/HF ratio, serve as indices of sympathovagal balance.

The challenge is straightforward: off-the-shelf consumer smartwatches do not provide continuous ECG recording. Instead, they use photoplethysmography (PPG) to estimate heart rate at intervals of 1–5 seconds. This sampling regime is orders of magnitude coarser than the millisecond-resolution RR intervals needed for classical HRV analysis.

1.3 Research Questions

This validation study addresses three core questions:

  1. Agreement: How closely do smartwatch-derived pulse rate variability (PRV) metrics approximate ECG-derived HRV metrics during controlled stress protocols?
  2. Discrimination: Can smartwatch HR features distinguish between baseline, stress, and recovery states with sufficient accuracy for research and applied use?
  3. Robustness: Under what conditions (rest vs. motion, controlled vs. daily life) does the smartwatch approach retain validity?

1.4 Scope and Contribution

This work is part of a master’s thesis project at IEETA (Institute of Electronics and Informatics Engineering of Aveiro) focused on estimating stress levels using only smartwatches (Fernandes et al., project brief). We build upon prior emotion studies using ECG-based HRV (Pinto et al., 2020) and extend the analysis to consumer-grade wearable data. Our contributions include:

  • A systematic comparison framework for smartwatch HR vs. ECG-HRV
  • Surrogate PRV metrics designed for low-frequency HR sampling
  • Motion gating strategies using accelerometer data
  • Practical guidelines for researchers adopting smartwatch-based stress assessment

2. Background: ECG, PPG, and What Smartwatches Actually Measure

2.1 Electrocardiography (ECG): The Gold Standard

An ECG records the electrical activity of the heart via electrodes placed on the skin. The QRS complex — and specifically the R-peak — marks ventricular depolarization. The time between consecutive R-peaks (RR interval) is the fundamental unit of HRV analysis.

Key properties of ECG for HRV:

  • Temporal resolution: ~1 ms (1000 Hz typical sampling)
  • Signal origin: Electrical (cardiac conduction system)
  • Artifact sources: Electrode displacement, muscle noise, powerline interference
  • Setup burden: Electrodes, leads, conductive gel, stationary or semi-stationary recording

2.2 Photoplethysmography (PPG): What the Smartwatch Sees

PPG sensors emit green LED light into the skin and measure reflected light intensity. Blood volume changes in the microvasculature modulate light absorption, producing a pulsatile waveform. Each pulse corresponds (approximately) to a heartbeat.

Key properties of PPG for HR estimation:

  • Temporal resolution: Varies by device; raw PPG at 25–100 Hz, but reported HR typically at 1 Hz or lower
  • Signal origin: Optical (peripheral blood volume changes)
  • Artifact sources: Motion (dominant), skin tone, sensor placement, ambient light, sweat
  • Setup burden: Minimal (wear on wrist)

2.3 From PPG to Pulse Rate Variability (PRV)

When PPG is sampled at sufficiently high frequency and individual pulse peaks are detected, the inter-pulse intervals (IPIs) can serve as surrogates for RR intervals. This yields Pulse Rate Variability (PRV) — the optical analogue of HRV.

Research has shown that PRV and HRV show strong agreement at rest (r > 0.95 for RMSSD) but diverge during movement and hemodynamic stress (Schfer & Vagedes, 2013; Georgiou et al., 2018). The agreement depends critically on:

  • Sensor quality: Higher-end optical sensors with multiple wavelengths perform better
  • Body site: Wrist PPG is more motion-sensitive than finger or ear PPG
  • Activity level: Agreement degrades substantially during physical activity
  • Algorithm quality: Peak detection and artifact rejection vary by manufacturer

2.4 The Consumer Smartwatch Reality

Most consumer smartwatches (Garmin Forerunner, Fenix, Venu series; Apple Watch; Samsung Galaxy Watch) do not expose raw PPG waveforms or individual pulse intervals to third-party applications. Instead, they provide:

Data AvailableTypical ResolutionAccess Method
Heart rate (HR)1 value per 1–5 secondsConnect IQ API / Health APIs
Beat-to-beat intervals (BBI/IBI)Available on select modelsFIT file export, device-dependent
Stress scoreProprietary, ~3 min intervalsGarmin Connect export
Accelerometer25–100 HzConnect IQ Sensor API

This means that for many devices, researchers work with HR trend data (a time series of instantaneous HR values at 1–5 second intervals), not raw IBI data. This fundamentally constrains which HRV metrics can be computed.


3. Methodology: Validation Framework

3.1 Experimental Protocol

Our validation framework uses a controlled stress induction protocol aligned with the broader thesis experiment design:

Phase 1: Baseline        (5 min)  - Seated rest, neutral stimulus
Phase 2: Stress Induction (10 min) - Cognitive task (joystick) or emotional video
Phase 3: Recovery         (5 min)  - Seated rest, calming stimulus

Participants wear both:

  • Garmin smartwatch (Forerunner 965, Fenix 7, or Venu 3 with Elevate Gen 5 sensor) on the non-dominant wrist
  • ECG chest strap (Polar H10 or equivalent research-grade device) as ground truth

Both devices record simultaneously during all three phases. Self-report measures (SAM, Likert stress scales) are collected after each phase.

3.2 Data Extraction

ECG ground truth pipeline:

import neurokit2 as nk
import numpy as np

def extract_ecg_hrv(ecg_signal, sampling_rate=1000):
    """
    Extract HRV metrics from raw ECG signal.

    Args:
        ecg_signal: Raw ECG waveform (numpy array)
        sampling_rate: ECG sampling rate in Hz

    Returns:
        dict with HRV metrics per analysis window
    """
    # Clean ECG signal
    ecg_cleaned = nk.ecg_clean(ecg_signal, sampling_rate=sampling_rate)

    # Detect R-peaks
    _, rpeaks = nk.ecg_peaks(ecg_cleaned, sampling_rate=sampling_rate)
    r_peak_indices = rpeaks["ECG_R_Peaks"]

    # Compute RR intervals in milliseconds
    rr_intervals_ms = np.diff(r_peak_indices) / sampling_rate * 1000

    # Time-domain HRV metrics
    hrv_metrics = {
        "mean_rr": np.mean(rr_intervals_ms),
        "sdnn": np.std(rr_intervals_ms, ddof=1),
        "rmssd": np.sqrt(np.mean(np.diff(rr_intervals_ms) ** 2)),
        "pnn50": (
            np.sum(np.abs(np.diff(rr_intervals_ms)) > 50)
            / len(np.diff(rr_intervals_ms))
            * 100
        ),
        "mean_hr": 60000 / np.mean(rr_intervals_ms),
    }

    return hrv_metrics

Smartwatch data pipeline:

from fitparse import FitFile
import pandas as pd

def extract_watch_hr(fit_file_path):
    """
    Extract heart rate time series from Garmin FIT file.

    Returns:
        DataFrame with columns: timestamp (UTC), heart_rate (bpm)
    """
    fitfile = FitFile(fit_file_path)
    records = []

    for record in fitfile.get_messages("record"):
        point = {}
        for field in record:
            if field.name == "timestamp":
                point["timestamp"] = field.value
            elif field.name == "heart_rate":
                point["heart_rate"] = field.value
            elif field.name == "enhanced_speed":
                point["speed"] = field.value
        if "timestamp" in point and "heart_rate" in point:
            records.append(point)

    df = pd.DataFrame(records)
    df["timestamp"] = pd.to_datetime(df["timestamp"], utc=True)
    return df

3.3 Surrogate Metrics: What Can We Compute from HR Trend Data?

When only HR values at 1–5 second intervals are available (no raw IBI), classical HRV metrics cannot be computed directly. We define surrogate metrics that approximate HRV information from the HR time series:

Surrogate MetricDefinitionHRV AnalogueRationale
HR_meanMean HR over windowMean HRDirect measure
HR_stdStandard deviation of HRRelated to SDNNCaptures overall variability
HR_rmssd_proxyRMSSD of successive HR differencesRMSSD approximationCaptures short-term variability
HR_rangeMax - Min HR in windowRelated to HRV rangeCaptures dynamic range
HR_slopeLinear regression slope of HRTrend directionCaptures sympathetic drift
HR_delta_baselineHR - personal baseline HRReactivityCaptures stress response magnitude
HR_cvCoefficient of variation (std/mean)Normalized variabilityAccounts for HR level

Implementation:

import numpy as np
from scipy import stats

def compute_surrogate_hrv(hr_series, window_sec=60, step_sec=30):
    """
    Compute surrogate HRV metrics from HR trend data.

    Args:
        hr_series: pandas Series with DatetimeIndex and HR values (bpm)
        window_sec: Analysis window length in seconds
        step_sec: Step size for sliding window

    Returns:
        DataFrame with surrogate metrics per window
    """
    results = []

    start = hr_series.index[0]
    end = hr_series.index[-1]
    current = start

    while current + pd.Timedelta(seconds=window_sec) <= end:
        window_end = current + pd.Timedelta(seconds=window_sec)
        window = hr_series[current:window_end]

        if len(window) < 5:  # Minimum samples for meaningful calculation
            current += pd.Timedelta(seconds=step_sec)
            continue

        hr_values = window.values.astype(float)
        successive_diffs = np.diff(hr_values)

        # Time axis for slope calculation (in seconds from window start)
        time_axis = (window.index - window.index[0]).total_seconds()

        metrics = {
            "window_start": current,
            "window_end": window_end,
            "n_samples": len(hr_values),
            "hr_mean": np.mean(hr_values),
            "hr_std": np.std(hr_values, ddof=1) if len(hr_values) > 1 else 0,
            "hr_rmssd_proxy": (
                np.sqrt(np.mean(successive_diffs ** 2))
                if len(successive_diffs) > 0
                else 0
            ),
            "hr_range": np.ptp(hr_values),
            "hr_cv": (
                np.std(hr_values, ddof=1) / np.mean(hr_values)
                if np.mean(hr_values) > 0
                else 0
            ),
            "hr_slope": (
                stats.linregress(time_axis, hr_values).slope
                if len(hr_values) >= 2
                else 0
            ),
        }

        results.append(metrics)
        current += pd.Timedelta(seconds=step_sec)

    return pd.DataFrame(results)

3.4 Agreement Analysis

We evaluate agreement between smartwatch-derived and ECG-derived metrics using three complementary approaches:

3.4.1 Pearson and Spearman Correlation

Correlation quantifies the strength of the linear (Pearson) or monotonic (Spearman) relationship between the two measurement methods across participants and conditions.

from scipy import stats

def correlation_analysis(ecg_metrics, watch_metrics, metric_pairs):
    """
    Compute correlations between ECG-HRV and smartwatch surrogate metrics.

    Args:
        ecg_metrics: DataFrame with ECG-derived HRV per window
        watch_metrics: DataFrame with smartwatch surrogate metrics per window
        metric_pairs: List of (ecg_col, watch_col) tuples to compare

    Returns:
        DataFrame with correlation results
    """
    results = []

    for ecg_col, watch_col in metric_pairs:
        ecg_vals = ecg_metrics[ecg_col].dropna()
        watch_vals = watch_metrics[watch_col].dropna()

        # Align by index
        common = ecg_vals.index.intersection(watch_vals.index)
        ecg_aligned = ecg_vals.loc[common]
        watch_aligned = watch_vals.loc[common]

        r_pearson, p_pearson = stats.pearsonr(ecg_aligned, watch_aligned)
        r_spearman, p_spearman = stats.spearmanr(ecg_aligned, watch_aligned)

        results.append({
            "ecg_metric": ecg_col,
            "watch_metric": watch_col,
            "n": len(common),
            "pearson_r": r_pearson,
            "pearson_p": p_pearson,
            "spearman_rho": r_spearman,
            "spearman_p": p_spearman,
        })

    return pd.DataFrame(results)

3.4.2 Bland-Altman Analysis

Bland-Altman plots reveal systematic bias and limits of agreement between the two methods, which correlation alone cannot capture. A high correlation does not guarantee interchangeability; Bland-Altman analysis does (Bland & Altman, 1986).

import matplotlib.pyplot as plt
import numpy as np

def bland_altman_plot(ecg_values, watch_values, metric_name, ax=None):
    """
    Generate Bland-Altman plot for method comparison.

    Args:
        ecg_values: Array of ECG-derived metric values
        watch_values: Array of smartwatch-derived metric values
        metric_name: Label for the metric being compared
        ax: Optional matplotlib axes object
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 6))

    mean_vals = (ecg_values + watch_values) / 2
    diff_vals = ecg_values - watch_values

    mean_diff = np.mean(diff_vals)
    std_diff = np.std(diff_vals, ddof=1)

    upper_loa = mean_diff + 1.96 * std_diff
    lower_loa = mean_diff - 1.96 * std_diff

    ax.scatter(mean_vals, diff_vals, alpha=0.5, edgecolors="k", linewidth=0.5)
    ax.axhline(mean_diff, color="red", linestyle="-", label=f"Bias: {mean_diff:.2f}")
    ax.axhline(upper_loa, color="gray", linestyle="--", label=f"+1.96 SD: {upper_loa:.2f}")
    ax.axhline(lower_loa, color="gray", linestyle="--", label=f"-1.96 SD: {lower_loa:.2f}")

    ax.set_xlabel(f"Mean of ECG and Smartwatch ({metric_name})")
    ax.set_ylabel(f"Difference: ECG - Smartwatch ({metric_name})")
    ax.set_title(f"Bland-Altman: {metric_name}")
    ax.legend(loc="upper right")

    return {"bias": mean_diff, "upper_loa": upper_loa, "lower_loa": lower_loa}

3.4.3 Intraclass Correlation Coefficient (ICC)

ICC evaluates absolute agreement between the two methods, accounting for both systematic and random differences. We use ICC(2,1) — two-way random effects, single measures, absolute agreement.


4. Expected Results and Interpretation Framework

4.1 Metric-Level Agreement

Based on prior literature comparing wrist PPG to ECG (Plews et al., 2017; Georgiou et al., 2018; Nelson & Allen, 2019), we anticipate the following agreement levels:

ComparisonExpected Correlation (r)ConditionInterpretation
Mean HR (watch) vs. Mean HR (ECG)0.95–0.99RestExcellent; HR estimation is mature
Mean HR (watch) vs. Mean HR (ECG)0.85–0.95Low motionGood; minor PPG artifact
HR_std vs. SDNN0.60–0.80RestModerate; information loss from sampling
HR_rmssd_proxy vs. RMSSD0.40–0.70RestFair; fundamentally different temporal resolution
HR_rmssd_proxy vs. RMSSD0.20–0.50MotionPoor; PPG artifacts dominate

4.2 Segment-Level Discrimination

For stress detection, absolute metric agreement matters less than the ability to discriminate between experimental phases (baseline vs. stress vs. recovery). We evaluate this using:

  • Within-subject effect sizes (Cohen’s d) for metric changes between phases
  • Classification AUC using leave-one-subject-out cross-validation
  • Sensitivity and specificity for binary stress detection (baseline vs. stress)

Target performance (thesis success criteria):

  • Segment detection AUC >= 0.70
  • Meaningful correlation with ECG-HRV change across segments (r > 0.5)
  • Improvement when motion gating and personal baseline calibration are applied

4.3 Interpretation of Discrepancies

When smartwatch metrics diverge from ECG-HRV, three sources of discrepancy must be distinguished:

  1. Measurement error: PPG signal quality issues (motion, poor contact, ambient light) that corrupt the HR estimate itself
  2. Temporal resolution loss: True physiological variability that exists at the beat-to-beat level but is invisible at 1-second sampling
  3. Physiological decoupling: Genuine differences between peripheral pulse wave (PPG) and cardiac electrical activity (ECG) due to pulse transit time variability, vascular compliance, and hemodynamic factors

Understanding which source dominates in a given context is essential for interpreting validation results and setting realistic expectations.


5. Motion Artifacts: The Primary Challenge

5.1 Why Motion Matters

Wrist-based PPG is notoriously susceptible to motion artifacts. Physical movement displaces the sensor, changes tissue-sensor coupling, and introduces pressure fluctuations that overwhelm the cardiac pulsatile signal. During even mild hand movement (typing, gesturing), PPG-derived HR can exhibit transient errors of 10–30 bpm.

For stress detection, motion artifacts are particularly problematic because:

  • Cognitive stress tasks often involve motor responses (keyboard, joystick, mouse)
  • Emotional arousal can increase fidgeting and restlessness
  • Motion-induced HR artifacts can mimic or mask genuine stress-related HR changes

5.2 Motion Gating Strategy

We implement an accelerometer-based motion gating approach using the smartwatch’s built-in inertial sensors:

import numpy as np
import pandas as pd

def apply_motion_gating(hr_data, accel_data, threshold_mg=50, window_sec=5):
    """
    Gate HR data based on accelerometer magnitude.

    Args:
        hr_data: DataFrame with timestamp and heart_rate columns
        accel_data: DataFrame with timestamp and x, y, z acceleration columns
        threshold_mg: Motion threshold in milli-g (above = exclude)
        window_sec: Window around each HR sample to check for motion

    Returns:
        DataFrame with added 'motion_flag' and 'hr_gated' columns
    """
    # Compute acceleration magnitude (subtract gravity)
    accel_data = accel_data.copy()
    accel_data["magnitude"] = np.sqrt(
        accel_data["x"] ** 2 + accel_data["y"] ** 2 + accel_data["z"] ** 2
    )
    # Remove gravity component (approximate)
    accel_data["magnitude_detrended"] = np.abs(accel_data["magnitude"] - 1000)

    hr_data = hr_data.copy()
    hr_data["motion_flag"] = False

    for idx, row in hr_data.iterrows():
        t = row["timestamp"]
        window_start = t - pd.Timedelta(seconds=window_sec / 2)
        window_end = t + pd.Timedelta(seconds=window_sec / 2)

        accel_window = accel_data[
            (accel_data["timestamp"] >= window_start)
            & (accel_data["timestamp"] <= window_end)
        ]

        if len(accel_window) > 0:
            mean_motion = accel_window["magnitude_detrended"].mean()
            hr_data.at[idx, "motion_flag"] = mean_motion > threshold_mg

    # Gated HR: NaN where motion detected
    hr_data["hr_gated"] = hr_data["heart_rate"].where(~hr_data["motion_flag"])

    return hr_data

5.3 Motion-Aware Analysis Windows

Rather than discarding entire experimental phases when motion is detected, we adopt a tiered approach:

Motion LevelAccelerometer MagnitudeStrategy
Stillness< 20 mgFull analysis; all metrics valid
Low motion20–50 mgHR trend analysis; variability metrics flagged
Moderate motion50–200 mgOnly mean HR retained; variability excluded
High motion> 200 mgData excluded from analysis

This tiered approach preserves maximum data while maintaining quality. The motion thresholds should be calibrated during the pilot study phase.


6. Feature Engineering for Stress Classification

6.1 Smartwatch-Only Feature Set

Building on the surrogate metrics from Section 3.3, we define a comprehensive feature set for stress classification using only smartwatch-accessible data:

HR trend features (per analysis window):

  • Statistical: mean, median, std, IQR, skewness, kurtosis
  • Temporal: slope, curvature, delta from personal baseline
  • Variability proxies: RMSSD-proxy, coefficient of variation, range
  • Quality indicators: sample count, motion-gated fraction

Accelerometer features (context):

  • Mean magnitude, standard deviation
  • Stillness ratio (fraction of samples below motion threshold)
  • Activity classification (seated, walking, gesturing)

6.2 Personal Baseline Calibration

Inter-individual differences in resting HR and HR reactivity are large. A resting HR of 75 bpm may represent elevated stress for one person but relaxed baseline for another. We normalize features using a personal baseline calibration:

def calibrate_to_baseline(features_df, baseline_window_minutes=5):
    """
    Normalize features relative to each participant's baseline period.

    Args:
        features_df: DataFrame with participant_id, phase, and feature columns
        baseline_window_minutes: Duration of baseline period

    Returns:
        DataFrame with added calibrated feature columns
    """
    calibrated = features_df.copy()

    feature_cols = [c for c in features_df.columns if c.startswith("hr_")]

    for pid in features_df["participant_id"].unique():
        mask = features_df["participant_id"] == pid
        baseline_mask = mask & (features_df["phase"] == "baseline")

        for col in feature_cols:
            baseline_mean = features_df.loc[baseline_mask, col].mean()
            baseline_std = features_df.loc[baseline_mask, col].std()

            if baseline_std > 0:
                # Z-score relative to personal baseline
                calibrated.loc[mask, f"{col}_calibrated"] = (
                    (features_df.loc[mask, col] - baseline_mean) / baseline_std
                )
            else:
                # Delta from baseline mean
                calibrated.loc[mask, f"{col}_calibrated"] = (
                    features_df.loc[mask, col] - baseline_mean
                )

    return calibrated

6.3 Ablation Study Design

To quantify the contribution of each feature group, we plan a systematic ablation:

Model VariantFeatures UsedPurpose
HR-onlyHR trend featuresBaseline smartwatch capability
HR + motion-gatedHR features with motion quality flagEffect of artifact handling
HR + accelHR + accelerometer contextEffect of motion context
HR + calibratedHR + personal baseline normalizationEffect of individual calibration
FullAll features combinedUpper bound of smartwatch approach
ECG-HRV (oracle)Standard ECG-derived HRV featuresGround truth ceiling

7. Statistical Validation Plan

7.1 Within-Subject Repeated Measures

Each participant completes all three phases (baseline, stress, recovery), enabling within-subject comparisons that control for individual differences:

  • Repeated-measures ANOVA (or Friedman test if non-normal) for phase effects on each metric
  • Post-hoc paired comparisons with Bonferroni correction (baseline vs. stress, stress vs. recovery, baseline vs. recovery)
  • Effect sizes (Cohen’s d for paired samples) to quantify practical significance

7.2 Classification Evaluation

For binary (baseline vs. stress) and three-class (baseline vs. stress vs. recovery) classification:

  • Leave-one-subject-out cross-validation (LOSO-CV) to evaluate generalization
  • Metrics: AUC-ROC, F1-score, sensitivity, specificity, balanced accuracy
  • Models: Random Forest and Gradient Boosting (XGBoost) as primary classifiers; logistic regression as interpretable baseline
  • Feature importance: SHAP values to identify which smartwatch features contribute most
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np

def loso_cv_evaluation(X, y, groups, classifier=None):
    """
    Leave-one-subject-out cross-validation for stress classification.

    Args:
        X: Feature matrix (n_windows x n_features)
        y: Labels (0=baseline, 1=stress)
        groups: Participant IDs for each window
        classifier: sklearn classifier (default: RandomForest)

    Returns:
        dict with per-fold and aggregate performance metrics
    """
    if classifier is None:
        classifier = RandomForestClassifier(
            n_estimators=200, max_depth=10, random_state=42
        )

    logo = LeaveOneGroupOut()
    fold_results = []
    all_y_true = []
    all_y_prob = []

    for train_idx, test_idx in logo.split(X, y, groups):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        classifier.fit(X_train, y_train)
        y_prob = classifier.predict_proba(X_test)[:, 1]

        fold_auc = roc_auc_score(y_test, y_prob)
        fold_results.append({
            "subject": groups[test_idx[0]],
            "auc": fold_auc,
            "n_test": len(test_idx),
        })

        all_y_true.extend(y_test)
        all_y_prob.extend(y_prob)

    overall_auc = roc_auc_score(all_y_true, all_y_prob)

    return {
        "overall_auc": overall_auc,
        "fold_results": fold_results,
        "mean_fold_auc": np.mean([f["auc"] for f in fold_results]),
        "std_fold_auc": np.std([f["auc"] for f in fold_results]),
    }

7.3 Agreement Reporting Standards

Following recommended reporting guidelines for method comparison studies (Giavarina, 2015):

  • Report both Pearson correlation and Bland-Altman limits of agreement
  • Include ICC with 95% confidence intervals
  • State the clinical/research context for interpreting limits of agreement
  • Report results separately for low-motion and all-motion conditions
  • Present per-participant and aggregate statistics

8. Discussion

8.1 What Smartwatches Can and Cannot Do

Based on the validation framework and prior literature, a realistic assessment:

Smartwatches CAN reliably:

  • Track mean HR changes across experimental phases at rest
  • Detect sustained HR elevation during cognitive stress tasks
  • Capture the overall trajectory (baseline -> stress -> recovery)
  • Provide continuous monitoring without participant burden
  • Enable longitudinal tracking of stress reactivity patterns

Smartwatches CANNOT reliably:

  • Replicate beat-to-beat HRV metrics (RMSSD, pNN50) with clinical precision
  • Detect brief autonomic events (< 30 seconds) from HR trend alone
  • Provide valid variability data during physical movement
  • Replace ECG for clinical or diagnostic purposes
  • Capture frequency-domain HRV (LF/HF ratio requires IBI data)

8.2 Practical Implications for Researchers

When a smartwatch is sufficient:

  • Large-scale screening studies where individual precision is less critical
  • Longitudinal tracking of stress patterns over days/weeks
  • Ecological momentary assessment in daily life
  • Studies where participant compliance with ECG is infeasible
  • Proof-of-concept investigations prior to clinical-grade studies

When ECG remains necessary:

  • Clinical diagnosis or intervention studies
  • Research requiring frequency-domain HRV analysis
  • Studies of rapid autonomic transitions (< 1 minute)
  • Populations with cardiac arrhythmias or vascular conditions
  • Regulatory or certification contexts

8.3 The Case for Trend Analysis Over Beat-to-Beat Metrics

A key insight from this validation work is that the smartwatch approach is better understood as trend analysis rather than HRV approximation. Rather than attempting to reconstruct beat-to-beat variability from 1 Hz HR data (a fundamentally lossy operation), we should leverage what smartwatches do well: capturing sustained changes in cardiac output over minutes to hours.

This reframing shifts the validation question from “How closely does PRV match HRV?” to “Can HR trends detect the same stress events that HRV detects?” The latter question is both more tractable and more relevant for consumer health applications.

8.4 Limitations

  1. Device specificity: Results may vary across smartwatch models and firmware versions; the Garmin Elevate Gen 5 sensor used here represents current high-end consumer PPG.
  2. Population: Validation in healthy young adults (university students) may not generalize to clinical populations, older adults, or those with darker skin tones (PPG sensitivity varies with melanin content).
  3. Protocol constraints: Controlled laboratory conditions (seated, low noise) represent best-case scenarios; daily-life performance will be lower.
  4. Proprietary algorithms: Garmin’s internal HR processing (filtering, artifact rejection) is a black box; we validate the output, not the algorithm.
  5. Sample size: The thesis targets 20–40 participants, which provides adequate power for within-subject analyses but limits generalizability claims.

9. Conclusion

This article presents a systematic framework for validating consumer smartwatch heart rate data against ECG-derived HRV for stress detection. The core finding is nuanced: smartwatches cannot replace ECG for classical HRV analysis, but they can detect meaningful stress-related cardiac changes when appropriate surrogate metrics, motion gating, and personal calibration are applied.

For the StressSmartWatch thesis project, this validation framework serves as the scientific backbone linking the technical infrastructure (web application, Garmin Connect IQ app, synchronization pipeline) to defensible research conclusions. By establishing clear agreement boundaries, we can make honest claims about what smartwatch-based stress estimation achieves and where it falls short.

The broader significance extends beyond this thesis: as consumer wearables become ubiquitous, understanding their validity envelope for health-relevant measurements is essential. Not every application requires clinical-grade HRV. For many use cases — personal stress awareness, workplace wellness programs, longitudinal research — the precision afforded by a well-validated smartwatch approach may be not only sufficient but preferable, given its unmatched scalability and ecological validity.


References

  1. Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet, 327(8476), 307-310.

  2. Georgiou, K., Larentzakis, A. V., Khamis, N. N., Alsuhaibani, G. I., Alaska, Y. A., & Giallafos, E. J. (2018). Can wearable devices accurately measure heart rate variability? A systematic review. Folia Medica, 60(1), 7-20.

  3. Giavarina, D. (2015). Understanding Bland Altman analysis. Biochemia Medica, 25(2), 141-151.

  4. McEwen, B. S. (2007). Physiology and neurobiology of stress and adaptation: Central role of the brain. Physiological Reviews, 87(3), 873-904.

  5. Nelson, B. W., & Allen, N. B. (2019). Accuracy of consumer wearable heart rate measurement during an ecologically valid 24-hour period: Intraindividual validation study. JMIR mHealth and uHealth, 7(3), e10828.

  6. Pinto, G., Carvalho, J. M., Barros, F., Soares, S. C., Pinho, A. J., & Bras, S. (2020). Multimodal emotion evaluation: A physiological model for cost-effective emotion classification. Sensors, 20(12), 3510.

  7. Plews, D. J., Scott, B., Altini, M., Wood, M., Kilding, A. E., & Laursen, P. B. (2017). Comparison of heart-rate-variability recording with smartphone photoplethysmography, Polar H7 chest strap, and electrocardiography. International Journal of Sports Physiology and Performance, 12(10), 1324-1328.

  8. Schafer, A., & Vagedes, J. (2013). How accurate is pulse rate variability as an estimate of heart rate variability? A review on studies comparing photoplethysmographic technology with an electrocardiogram. International Journal of Cardiology, 166(1), 15-29.

  9. Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology. (1996). Heart rate variability: Standards of measurement, physiological interpretation, and clinical use. Circulation, 93(5), 1043-1065.


Appendix: Quick-Start Validation Checklist

A.1 Before Data Collection

  • Confirm target Garmin model exposes HR at >= 1 Hz via FIT export
  • Test BBI/IBI availability on target device (model- and firmware-dependent)
  • Select and validate ECG reference device (Polar H10 recommended)
  • Synchronize all device clocks to NTP before each session
  • Calibrate accelerometer motion thresholds during pilot sessions
  • Define and document baseline period (minimum 5 minutes seated rest)

A.2 During Analysis

  • Align timestamps between smartwatch and ECG using event anchors
  • Apply motion gating before computing variability metrics
  • Compute both raw and baseline-calibrated features
  • Report agreement metrics (correlation, Bland-Altman, ICC) per condition
  • Run LOSO-CV classification and report AUC with confidence intervals
  • Perform ablation study across feature groups

A.3 When Reporting Results

  • Clearly state which smartwatch model and firmware version were used
  • Report motion gating thresholds and data exclusion rates
  • Present results separately for rest and motion conditions
  • Acknowledge the fundamental temporal resolution limitation
  • Avoid claiming “HRV measurement” when reporting HR-derived surrogates
  • Discuss clinical vs. research relevance of observed accuracy levels