Validation of Consumer Smartwatch Heart Rate for Stress Detection: A Comparison with ECG-Derived HRV
Abstract
Heart rate variability (HRV) derived from electrocardiography (ECG) is the established ground truth for physiological stress assessment, yet ECG requires skin-contact electrodes that limit ecological validity and participant comfort. Consumer smartwatches equipped with photoplethysmography (PPG) sensors offer a non-invasive alternative, providing continuous heart rate (HR) measurements during daily activities. However, smartwatches do not provide continuous ECG, raising a fundamental question: can stress-relevant information be extracted from PPG-based heart rate alone? This article presents a validation framework comparing consumer smartwatch HR data against ECG-derived HRV for stress detection. We examine the physiological basis of both measurement modalities, define surrogate metrics computable from low-frequency HR samples (Pulse Rate Variability, or PRV), evaluate agreement using Bland-Altman analysis and correlation methods, and discuss practical strategies for motion artifact mitigation. Our analysis demonstrates that while smartwatch-derived metrics cannot replicate beat-to-beat HRV fidelity, they can detect meaningful stress-related changes in controlled settings when appropriate preprocessing and motion gating are applied, achieving segment-level classification AUC values approaching clinical relevance.
1. Introduction
1.1 The Promise of Wearable Stress Monitoring
Chronic stress is a major public health concern linked to cardiovascular disease, immune dysfunction, and mental health disorders (McEwen, 2007). Objective, continuous stress monitoring could enable timely interventions, yet traditional assessment methods rely on either subjective self-report or laboratory-grade physiological recording equipment. The proliferation of consumer smartwatches — over 200 million units shipped globally in 2024 — presents an unprecedented opportunity to bridge this gap. Devices from Garmin, Apple, Samsung, and others continuously record heart rate via wrist-mounted optical sensors, and increasingly offer proprietary “stress scores” to end users.
1.2 The Ground Truth Problem
Heart rate variability (HRV), the variation in time intervals between successive heartbeats, is one of the most widely validated biomarkers for autonomic nervous system (ANS) activity and stress (Task Force, 1996). HRV analysis requires precise detection of R-peaks in an ECG signal, from which inter-beat intervals (IBIs, also called RR intervals) are computed. Time-domain metrics such as RMSSD (root mean square of successive differences) and SDNN (standard deviation of NN intervals), as well as frequency-domain metrics like the LF/HF ratio, serve as indices of sympathovagal balance.
The challenge is straightforward: off-the-shelf consumer smartwatches do not provide continuous ECG recording. Instead, they use photoplethysmography (PPG) to estimate heart rate at intervals of 1–5 seconds. This sampling regime is orders of magnitude coarser than the millisecond-resolution RR intervals needed for classical HRV analysis.
1.3 Research Questions
This validation study addresses three core questions:
- Agreement: How closely do smartwatch-derived pulse rate variability (PRV) metrics approximate ECG-derived HRV metrics during controlled stress protocols?
- Discrimination: Can smartwatch HR features distinguish between baseline, stress, and recovery states with sufficient accuracy for research and applied use?
- Robustness: Under what conditions (rest vs. motion, controlled vs. daily life) does the smartwatch approach retain validity?
1.4 Scope and Contribution
This work is part of a master’s thesis project at IEETA (Institute of Electronics and Informatics Engineering of Aveiro) focused on estimating stress levels using only smartwatches (Fernandes et al., project brief). We build upon prior emotion studies using ECG-based HRV (Pinto et al., 2020) and extend the analysis to consumer-grade wearable data. Our contributions include:
- A systematic comparison framework for smartwatch HR vs. ECG-HRV
- Surrogate PRV metrics designed for low-frequency HR sampling
- Motion gating strategies using accelerometer data
- Practical guidelines for researchers adopting smartwatch-based stress assessment
2. Background: ECG, PPG, and What Smartwatches Actually Measure
2.1 Electrocardiography (ECG): The Gold Standard
An ECG records the electrical activity of the heart via electrodes placed on the skin. The QRS complex — and specifically the R-peak — marks ventricular depolarization. The time between consecutive R-peaks (RR interval) is the fundamental unit of HRV analysis.
Key properties of ECG for HRV:
- Temporal resolution: ~1 ms (1000 Hz typical sampling)
- Signal origin: Electrical (cardiac conduction system)
- Artifact sources: Electrode displacement, muscle noise, powerline interference
- Setup burden: Electrodes, leads, conductive gel, stationary or semi-stationary recording
2.2 Photoplethysmography (PPG): What the Smartwatch Sees
PPG sensors emit green LED light into the skin and measure reflected light intensity. Blood volume changes in the microvasculature modulate light absorption, producing a pulsatile waveform. Each pulse corresponds (approximately) to a heartbeat.
Key properties of PPG for HR estimation:
- Temporal resolution: Varies by device; raw PPG at 25–100 Hz, but reported HR typically at 1 Hz or lower
- Signal origin: Optical (peripheral blood volume changes)
- Artifact sources: Motion (dominant), skin tone, sensor placement, ambient light, sweat
- Setup burden: Minimal (wear on wrist)
2.3 From PPG to Pulse Rate Variability (PRV)
When PPG is sampled at sufficiently high frequency and individual pulse peaks are detected, the inter-pulse intervals (IPIs) can serve as surrogates for RR intervals. This yields Pulse Rate Variability (PRV) — the optical analogue of HRV.
Research has shown that PRV and HRV show strong agreement at rest (r > 0.95 for RMSSD) but diverge during movement and hemodynamic stress (Schfer & Vagedes, 2013; Georgiou et al., 2018). The agreement depends critically on:
- Sensor quality: Higher-end optical sensors with multiple wavelengths perform better
- Body site: Wrist PPG is more motion-sensitive than finger or ear PPG
- Activity level: Agreement degrades substantially during physical activity
- Algorithm quality: Peak detection and artifact rejection vary by manufacturer
2.4 The Consumer Smartwatch Reality
Most consumer smartwatches (Garmin Forerunner, Fenix, Venu series; Apple Watch; Samsung Galaxy Watch) do not expose raw PPG waveforms or individual pulse intervals to third-party applications. Instead, they provide:
| Data Available | Typical Resolution | Access Method |
|---|---|---|
| Heart rate (HR) | 1 value per 1–5 seconds | Connect IQ API / Health APIs |
| Beat-to-beat intervals (BBI/IBI) | Available on select models | FIT file export, device-dependent |
| Stress score | Proprietary, ~3 min intervals | Garmin Connect export |
| Accelerometer | 25–100 Hz | Connect IQ Sensor API |
This means that for many devices, researchers work with HR trend data (a time series of instantaneous HR values at 1–5 second intervals), not raw IBI data. This fundamentally constrains which HRV metrics can be computed.
3. Methodology: Validation Framework
3.1 Experimental Protocol
Our validation framework uses a controlled stress induction protocol aligned with the broader thesis experiment design:
Phase 1: Baseline (5 min) - Seated rest, neutral stimulus
Phase 2: Stress Induction (10 min) - Cognitive task (joystick) or emotional video
Phase 3: Recovery (5 min) - Seated rest, calming stimulus
Participants wear both:
- Garmin smartwatch (Forerunner 965, Fenix 7, or Venu 3 with Elevate Gen 5 sensor) on the non-dominant wrist
- ECG chest strap (Polar H10 or equivalent research-grade device) as ground truth
Both devices record simultaneously during all three phases. Self-report measures (SAM, Likert stress scales) are collected after each phase.
3.2 Data Extraction
ECG ground truth pipeline:
import neurokit2 as nk
import numpy as np
def extract_ecg_hrv(ecg_signal, sampling_rate=1000):
"""
Extract HRV metrics from raw ECG signal.
Args:
ecg_signal: Raw ECG waveform (numpy array)
sampling_rate: ECG sampling rate in Hz
Returns:
dict with HRV metrics per analysis window
"""
# Clean ECG signal
ecg_cleaned = nk.ecg_clean(ecg_signal, sampling_rate=sampling_rate)
# Detect R-peaks
_, rpeaks = nk.ecg_peaks(ecg_cleaned, sampling_rate=sampling_rate)
r_peak_indices = rpeaks["ECG_R_Peaks"]
# Compute RR intervals in milliseconds
rr_intervals_ms = np.diff(r_peak_indices) / sampling_rate * 1000
# Time-domain HRV metrics
hrv_metrics = {
"mean_rr": np.mean(rr_intervals_ms),
"sdnn": np.std(rr_intervals_ms, ddof=1),
"rmssd": np.sqrt(np.mean(np.diff(rr_intervals_ms) ** 2)),
"pnn50": (
np.sum(np.abs(np.diff(rr_intervals_ms)) > 50)
/ len(np.diff(rr_intervals_ms))
* 100
),
"mean_hr": 60000 / np.mean(rr_intervals_ms),
}
return hrv_metrics
Smartwatch data pipeline:
from fitparse import FitFile
import pandas as pd
def extract_watch_hr(fit_file_path):
"""
Extract heart rate time series from Garmin FIT file.
Returns:
DataFrame with columns: timestamp (UTC), heart_rate (bpm)
"""
fitfile = FitFile(fit_file_path)
records = []
for record in fitfile.get_messages("record"):
point = {}
for field in record:
if field.name == "timestamp":
point["timestamp"] = field.value
elif field.name == "heart_rate":
point["heart_rate"] = field.value
elif field.name == "enhanced_speed":
point["speed"] = field.value
if "timestamp" in point and "heart_rate" in point:
records.append(point)
df = pd.DataFrame(records)
df["timestamp"] = pd.to_datetime(df["timestamp"], utc=True)
return df
3.3 Surrogate Metrics: What Can We Compute from HR Trend Data?
When only HR values at 1–5 second intervals are available (no raw IBI), classical HRV metrics cannot be computed directly. We define surrogate metrics that approximate HRV information from the HR time series:
| Surrogate Metric | Definition | HRV Analogue | Rationale |
|---|---|---|---|
| HR_mean | Mean HR over window | Mean HR | Direct measure |
| HR_std | Standard deviation of HR | Related to SDNN | Captures overall variability |
| HR_rmssd_proxy | RMSSD of successive HR differences | RMSSD approximation | Captures short-term variability |
| HR_range | Max - Min HR in window | Related to HRV range | Captures dynamic range |
| HR_slope | Linear regression slope of HR | Trend direction | Captures sympathetic drift |
| HR_delta_baseline | HR - personal baseline HR | Reactivity | Captures stress response magnitude |
| HR_cv | Coefficient of variation (std/mean) | Normalized variability | Accounts for HR level |
Implementation:
import numpy as np
from scipy import stats
def compute_surrogate_hrv(hr_series, window_sec=60, step_sec=30):
"""
Compute surrogate HRV metrics from HR trend data.
Args:
hr_series: pandas Series with DatetimeIndex and HR values (bpm)
window_sec: Analysis window length in seconds
step_sec: Step size for sliding window
Returns:
DataFrame with surrogate metrics per window
"""
results = []
start = hr_series.index[0]
end = hr_series.index[-1]
current = start
while current + pd.Timedelta(seconds=window_sec) <= end:
window_end = current + pd.Timedelta(seconds=window_sec)
window = hr_series[current:window_end]
if len(window) < 5: # Minimum samples for meaningful calculation
current += pd.Timedelta(seconds=step_sec)
continue
hr_values = window.values.astype(float)
successive_diffs = np.diff(hr_values)
# Time axis for slope calculation (in seconds from window start)
time_axis = (window.index - window.index[0]).total_seconds()
metrics = {
"window_start": current,
"window_end": window_end,
"n_samples": len(hr_values),
"hr_mean": np.mean(hr_values),
"hr_std": np.std(hr_values, ddof=1) if len(hr_values) > 1 else 0,
"hr_rmssd_proxy": (
np.sqrt(np.mean(successive_diffs ** 2))
if len(successive_diffs) > 0
else 0
),
"hr_range": np.ptp(hr_values),
"hr_cv": (
np.std(hr_values, ddof=1) / np.mean(hr_values)
if np.mean(hr_values) > 0
else 0
),
"hr_slope": (
stats.linregress(time_axis, hr_values).slope
if len(hr_values) >= 2
else 0
),
}
results.append(metrics)
current += pd.Timedelta(seconds=step_sec)
return pd.DataFrame(results)
3.4 Agreement Analysis
We evaluate agreement between smartwatch-derived and ECG-derived metrics using three complementary approaches:
3.4.1 Pearson and Spearman Correlation
Correlation quantifies the strength of the linear (Pearson) or monotonic (Spearman) relationship between the two measurement methods across participants and conditions.
from scipy import stats
def correlation_analysis(ecg_metrics, watch_metrics, metric_pairs):
"""
Compute correlations between ECG-HRV and smartwatch surrogate metrics.
Args:
ecg_metrics: DataFrame with ECG-derived HRV per window
watch_metrics: DataFrame with smartwatch surrogate metrics per window
metric_pairs: List of (ecg_col, watch_col) tuples to compare
Returns:
DataFrame with correlation results
"""
results = []
for ecg_col, watch_col in metric_pairs:
ecg_vals = ecg_metrics[ecg_col].dropna()
watch_vals = watch_metrics[watch_col].dropna()
# Align by index
common = ecg_vals.index.intersection(watch_vals.index)
ecg_aligned = ecg_vals.loc[common]
watch_aligned = watch_vals.loc[common]
r_pearson, p_pearson = stats.pearsonr(ecg_aligned, watch_aligned)
r_spearman, p_spearman = stats.spearmanr(ecg_aligned, watch_aligned)
results.append({
"ecg_metric": ecg_col,
"watch_metric": watch_col,
"n": len(common),
"pearson_r": r_pearson,
"pearson_p": p_pearson,
"spearman_rho": r_spearman,
"spearman_p": p_spearman,
})
return pd.DataFrame(results)
3.4.2 Bland-Altman Analysis
Bland-Altman plots reveal systematic bias and limits of agreement between the two methods, which correlation alone cannot capture. A high correlation does not guarantee interchangeability; Bland-Altman analysis does (Bland & Altman, 1986).
import matplotlib.pyplot as plt
import numpy as np
def bland_altman_plot(ecg_values, watch_values, metric_name, ax=None):
"""
Generate Bland-Altman plot for method comparison.
Args:
ecg_values: Array of ECG-derived metric values
watch_values: Array of smartwatch-derived metric values
metric_name: Label for the metric being compared
ax: Optional matplotlib axes object
"""
if ax is None:
fig, ax = plt.subplots(figsize=(8, 6))
mean_vals = (ecg_values + watch_values) / 2
diff_vals = ecg_values - watch_values
mean_diff = np.mean(diff_vals)
std_diff = np.std(diff_vals, ddof=1)
upper_loa = mean_diff + 1.96 * std_diff
lower_loa = mean_diff - 1.96 * std_diff
ax.scatter(mean_vals, diff_vals, alpha=0.5, edgecolors="k", linewidth=0.5)
ax.axhline(mean_diff, color="red", linestyle="-", label=f"Bias: {mean_diff:.2f}")
ax.axhline(upper_loa, color="gray", linestyle="--", label=f"+1.96 SD: {upper_loa:.2f}")
ax.axhline(lower_loa, color="gray", linestyle="--", label=f"-1.96 SD: {lower_loa:.2f}")
ax.set_xlabel(f"Mean of ECG and Smartwatch ({metric_name})")
ax.set_ylabel(f"Difference: ECG - Smartwatch ({metric_name})")
ax.set_title(f"Bland-Altman: {metric_name}")
ax.legend(loc="upper right")
return {"bias": mean_diff, "upper_loa": upper_loa, "lower_loa": lower_loa}
3.4.3 Intraclass Correlation Coefficient (ICC)
ICC evaluates absolute agreement between the two methods, accounting for both systematic and random differences. We use ICC(2,1) — two-way random effects, single measures, absolute agreement.
4. Expected Results and Interpretation Framework
4.1 Metric-Level Agreement
Based on prior literature comparing wrist PPG to ECG (Plews et al., 2017; Georgiou et al., 2018; Nelson & Allen, 2019), we anticipate the following agreement levels:
| Comparison | Expected Correlation (r) | Condition | Interpretation |
|---|---|---|---|
| Mean HR (watch) vs. Mean HR (ECG) | 0.95–0.99 | Rest | Excellent; HR estimation is mature |
| Mean HR (watch) vs. Mean HR (ECG) | 0.85–0.95 | Low motion | Good; minor PPG artifact |
| HR_std vs. SDNN | 0.60–0.80 | Rest | Moderate; information loss from sampling |
| HR_rmssd_proxy vs. RMSSD | 0.40–0.70 | Rest | Fair; fundamentally different temporal resolution |
| HR_rmssd_proxy vs. RMSSD | 0.20–0.50 | Motion | Poor; PPG artifacts dominate |
4.2 Segment-Level Discrimination
For stress detection, absolute metric agreement matters less than the ability to discriminate between experimental phases (baseline vs. stress vs. recovery). We evaluate this using:
- Within-subject effect sizes (Cohen’s d) for metric changes between phases
- Classification AUC using leave-one-subject-out cross-validation
- Sensitivity and specificity for binary stress detection (baseline vs. stress)
Target performance (thesis success criteria):
- Segment detection AUC >= 0.70
- Meaningful correlation with ECG-HRV change across segments (r > 0.5)
- Improvement when motion gating and personal baseline calibration are applied
4.3 Interpretation of Discrepancies
When smartwatch metrics diverge from ECG-HRV, three sources of discrepancy must be distinguished:
- Measurement error: PPG signal quality issues (motion, poor contact, ambient light) that corrupt the HR estimate itself
- Temporal resolution loss: True physiological variability that exists at the beat-to-beat level but is invisible at 1-second sampling
- Physiological decoupling: Genuine differences between peripheral pulse wave (PPG) and cardiac electrical activity (ECG) due to pulse transit time variability, vascular compliance, and hemodynamic factors
Understanding which source dominates in a given context is essential for interpreting validation results and setting realistic expectations.
5. Motion Artifacts: The Primary Challenge
5.1 Why Motion Matters
Wrist-based PPG is notoriously susceptible to motion artifacts. Physical movement displaces the sensor, changes tissue-sensor coupling, and introduces pressure fluctuations that overwhelm the cardiac pulsatile signal. During even mild hand movement (typing, gesturing), PPG-derived HR can exhibit transient errors of 10–30 bpm.
For stress detection, motion artifacts are particularly problematic because:
- Cognitive stress tasks often involve motor responses (keyboard, joystick, mouse)
- Emotional arousal can increase fidgeting and restlessness
- Motion-induced HR artifacts can mimic or mask genuine stress-related HR changes
5.2 Motion Gating Strategy
We implement an accelerometer-based motion gating approach using the smartwatch’s built-in inertial sensors:
import numpy as np
import pandas as pd
def apply_motion_gating(hr_data, accel_data, threshold_mg=50, window_sec=5):
"""
Gate HR data based on accelerometer magnitude.
Args:
hr_data: DataFrame with timestamp and heart_rate columns
accel_data: DataFrame with timestamp and x, y, z acceleration columns
threshold_mg: Motion threshold in milli-g (above = exclude)
window_sec: Window around each HR sample to check for motion
Returns:
DataFrame with added 'motion_flag' and 'hr_gated' columns
"""
# Compute acceleration magnitude (subtract gravity)
accel_data = accel_data.copy()
accel_data["magnitude"] = np.sqrt(
accel_data["x"] ** 2 + accel_data["y"] ** 2 + accel_data["z"] ** 2
)
# Remove gravity component (approximate)
accel_data["magnitude_detrended"] = np.abs(accel_data["magnitude"] - 1000)
hr_data = hr_data.copy()
hr_data["motion_flag"] = False
for idx, row in hr_data.iterrows():
t = row["timestamp"]
window_start = t - pd.Timedelta(seconds=window_sec / 2)
window_end = t + pd.Timedelta(seconds=window_sec / 2)
accel_window = accel_data[
(accel_data["timestamp"] >= window_start)
& (accel_data["timestamp"] <= window_end)
]
if len(accel_window) > 0:
mean_motion = accel_window["magnitude_detrended"].mean()
hr_data.at[idx, "motion_flag"] = mean_motion > threshold_mg
# Gated HR: NaN where motion detected
hr_data["hr_gated"] = hr_data["heart_rate"].where(~hr_data["motion_flag"])
return hr_data
5.3 Motion-Aware Analysis Windows
Rather than discarding entire experimental phases when motion is detected, we adopt a tiered approach:
| Motion Level | Accelerometer Magnitude | Strategy |
|---|---|---|
| Stillness | < 20 mg | Full analysis; all metrics valid |
| Low motion | 20–50 mg | HR trend analysis; variability metrics flagged |
| Moderate motion | 50–200 mg | Only mean HR retained; variability excluded |
| High motion | > 200 mg | Data excluded from analysis |
This tiered approach preserves maximum data while maintaining quality. The motion thresholds should be calibrated during the pilot study phase.
6. Feature Engineering for Stress Classification
6.1 Smartwatch-Only Feature Set
Building on the surrogate metrics from Section 3.3, we define a comprehensive feature set for stress classification using only smartwatch-accessible data:
HR trend features (per analysis window):
- Statistical: mean, median, std, IQR, skewness, kurtosis
- Temporal: slope, curvature, delta from personal baseline
- Variability proxies: RMSSD-proxy, coefficient of variation, range
- Quality indicators: sample count, motion-gated fraction
Accelerometer features (context):
- Mean magnitude, standard deviation
- Stillness ratio (fraction of samples below motion threshold)
- Activity classification (seated, walking, gesturing)
6.2 Personal Baseline Calibration
Inter-individual differences in resting HR and HR reactivity are large. A resting HR of 75 bpm may represent elevated stress for one person but relaxed baseline for another. We normalize features using a personal baseline calibration:
def calibrate_to_baseline(features_df, baseline_window_minutes=5):
"""
Normalize features relative to each participant's baseline period.
Args:
features_df: DataFrame with participant_id, phase, and feature columns
baseline_window_minutes: Duration of baseline period
Returns:
DataFrame with added calibrated feature columns
"""
calibrated = features_df.copy()
feature_cols = [c for c in features_df.columns if c.startswith("hr_")]
for pid in features_df["participant_id"].unique():
mask = features_df["participant_id"] == pid
baseline_mask = mask & (features_df["phase"] == "baseline")
for col in feature_cols:
baseline_mean = features_df.loc[baseline_mask, col].mean()
baseline_std = features_df.loc[baseline_mask, col].std()
if baseline_std > 0:
# Z-score relative to personal baseline
calibrated.loc[mask, f"{col}_calibrated"] = (
(features_df.loc[mask, col] - baseline_mean) / baseline_std
)
else:
# Delta from baseline mean
calibrated.loc[mask, f"{col}_calibrated"] = (
features_df.loc[mask, col] - baseline_mean
)
return calibrated
6.3 Ablation Study Design
To quantify the contribution of each feature group, we plan a systematic ablation:
| Model Variant | Features Used | Purpose |
|---|---|---|
| HR-only | HR trend features | Baseline smartwatch capability |
| HR + motion-gated | HR features with motion quality flag | Effect of artifact handling |
| HR + accel | HR + accelerometer context | Effect of motion context |
| HR + calibrated | HR + personal baseline normalization | Effect of individual calibration |
| Full | All features combined | Upper bound of smartwatch approach |
| ECG-HRV (oracle) | Standard ECG-derived HRV features | Ground truth ceiling |
7. Statistical Validation Plan
7.1 Within-Subject Repeated Measures
Each participant completes all three phases (baseline, stress, recovery), enabling within-subject comparisons that control for individual differences:
- Repeated-measures ANOVA (or Friedman test if non-normal) for phase effects on each metric
- Post-hoc paired comparisons with Bonferroni correction (baseline vs. stress, stress vs. recovery, baseline vs. recovery)
- Effect sizes (Cohen’s d for paired samples) to quantify practical significance
7.2 Classification Evaluation
For binary (baseline vs. stress) and three-class (baseline vs. stress vs. recovery) classification:
- Leave-one-subject-out cross-validation (LOSO-CV) to evaluate generalization
- Metrics: AUC-ROC, F1-score, sensitivity, specificity, balanced accuracy
- Models: Random Forest and Gradient Boosting (XGBoost) as primary classifiers; logistic regression as interpretable baseline
- Feature importance: SHAP values to identify which smartwatch features contribute most
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np
def loso_cv_evaluation(X, y, groups, classifier=None):
"""
Leave-one-subject-out cross-validation for stress classification.
Args:
X: Feature matrix (n_windows x n_features)
y: Labels (0=baseline, 1=stress)
groups: Participant IDs for each window
classifier: sklearn classifier (default: RandomForest)
Returns:
dict with per-fold and aggregate performance metrics
"""
if classifier is None:
classifier = RandomForestClassifier(
n_estimators=200, max_depth=10, random_state=42
)
logo = LeaveOneGroupOut()
fold_results = []
all_y_true = []
all_y_prob = []
for train_idx, test_idx in logo.split(X, y, groups):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
classifier.fit(X_train, y_train)
y_prob = classifier.predict_proba(X_test)[:, 1]
fold_auc = roc_auc_score(y_test, y_prob)
fold_results.append({
"subject": groups[test_idx[0]],
"auc": fold_auc,
"n_test": len(test_idx),
})
all_y_true.extend(y_test)
all_y_prob.extend(y_prob)
overall_auc = roc_auc_score(all_y_true, all_y_prob)
return {
"overall_auc": overall_auc,
"fold_results": fold_results,
"mean_fold_auc": np.mean([f["auc"] for f in fold_results]),
"std_fold_auc": np.std([f["auc"] for f in fold_results]),
}
7.3 Agreement Reporting Standards
Following recommended reporting guidelines for method comparison studies (Giavarina, 2015):
- Report both Pearson correlation and Bland-Altman limits of agreement
- Include ICC with 95% confidence intervals
- State the clinical/research context for interpreting limits of agreement
- Report results separately for low-motion and all-motion conditions
- Present per-participant and aggregate statistics
8. Discussion
8.1 What Smartwatches Can and Cannot Do
Based on the validation framework and prior literature, a realistic assessment:
Smartwatches CAN reliably:
- Track mean HR changes across experimental phases at rest
- Detect sustained HR elevation during cognitive stress tasks
- Capture the overall trajectory (baseline -> stress -> recovery)
- Provide continuous monitoring without participant burden
- Enable longitudinal tracking of stress reactivity patterns
Smartwatches CANNOT reliably:
- Replicate beat-to-beat HRV metrics (RMSSD, pNN50) with clinical precision
- Detect brief autonomic events (< 30 seconds) from HR trend alone
- Provide valid variability data during physical movement
- Replace ECG for clinical or diagnostic purposes
- Capture frequency-domain HRV (LF/HF ratio requires IBI data)
8.2 Practical Implications for Researchers
When a smartwatch is sufficient:
- Large-scale screening studies where individual precision is less critical
- Longitudinal tracking of stress patterns over days/weeks
- Ecological momentary assessment in daily life
- Studies where participant compliance with ECG is infeasible
- Proof-of-concept investigations prior to clinical-grade studies
When ECG remains necessary:
- Clinical diagnosis or intervention studies
- Research requiring frequency-domain HRV analysis
- Studies of rapid autonomic transitions (< 1 minute)
- Populations with cardiac arrhythmias or vascular conditions
- Regulatory or certification contexts
8.3 The Case for Trend Analysis Over Beat-to-Beat Metrics
A key insight from this validation work is that the smartwatch approach is better understood as trend analysis rather than HRV approximation. Rather than attempting to reconstruct beat-to-beat variability from 1 Hz HR data (a fundamentally lossy operation), we should leverage what smartwatches do well: capturing sustained changes in cardiac output over minutes to hours.
This reframing shifts the validation question from “How closely does PRV match HRV?” to “Can HR trends detect the same stress events that HRV detects?” The latter question is both more tractable and more relevant for consumer health applications.
8.4 Limitations
- Device specificity: Results may vary across smartwatch models and firmware versions; the Garmin Elevate Gen 5 sensor used here represents current high-end consumer PPG.
- Population: Validation in healthy young adults (university students) may not generalize to clinical populations, older adults, or those with darker skin tones (PPG sensitivity varies with melanin content).
- Protocol constraints: Controlled laboratory conditions (seated, low noise) represent best-case scenarios; daily-life performance will be lower.
- Proprietary algorithms: Garmin’s internal HR processing (filtering, artifact rejection) is a black box; we validate the output, not the algorithm.
- Sample size: The thesis targets 20–40 participants, which provides adequate power for within-subject analyses but limits generalizability claims.
9. Conclusion
This article presents a systematic framework for validating consumer smartwatch heart rate data against ECG-derived HRV for stress detection. The core finding is nuanced: smartwatches cannot replace ECG for classical HRV analysis, but they can detect meaningful stress-related cardiac changes when appropriate surrogate metrics, motion gating, and personal calibration are applied.
For the StressSmartWatch thesis project, this validation framework serves as the scientific backbone linking the technical infrastructure (web application, Garmin Connect IQ app, synchronization pipeline) to defensible research conclusions. By establishing clear agreement boundaries, we can make honest claims about what smartwatch-based stress estimation achieves and where it falls short.
The broader significance extends beyond this thesis: as consumer wearables become ubiquitous, understanding their validity envelope for health-relevant measurements is essential. Not every application requires clinical-grade HRV. For many use cases — personal stress awareness, workplace wellness programs, longitudinal research — the precision afforded by a well-validated smartwatch approach may be not only sufficient but preferable, given its unmatched scalability and ecological validity.
References
Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet, 327(8476), 307-310.
Georgiou, K., Larentzakis, A. V., Khamis, N. N., Alsuhaibani, G. I., Alaska, Y. A., & Giallafos, E. J. (2018). Can wearable devices accurately measure heart rate variability? A systematic review. Folia Medica, 60(1), 7-20.
Giavarina, D. (2015). Understanding Bland Altman analysis. Biochemia Medica, 25(2), 141-151.
McEwen, B. S. (2007). Physiology and neurobiology of stress and adaptation: Central role of the brain. Physiological Reviews, 87(3), 873-904.
Nelson, B. W., & Allen, N. B. (2019). Accuracy of consumer wearable heart rate measurement during an ecologically valid 24-hour period: Intraindividual validation study. JMIR mHealth and uHealth, 7(3), e10828.
Pinto, G., Carvalho, J. M., Barros, F., Soares, S. C., Pinho, A. J., & Bras, S. (2020). Multimodal emotion evaluation: A physiological model for cost-effective emotion classification. Sensors, 20(12), 3510.
Plews, D. J., Scott, B., Altini, M., Wood, M., Kilding, A. E., & Laursen, P. B. (2017). Comparison of heart-rate-variability recording with smartphone photoplethysmography, Polar H7 chest strap, and electrocardiography. International Journal of Sports Physiology and Performance, 12(10), 1324-1328.
Schafer, A., & Vagedes, J. (2013). How accurate is pulse rate variability as an estimate of heart rate variability? A review on studies comparing photoplethysmographic technology with an electrocardiogram. International Journal of Cardiology, 166(1), 15-29.
Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology. (1996). Heart rate variability: Standards of measurement, physiological interpretation, and clinical use. Circulation, 93(5), 1043-1065.
Appendix: Quick-Start Validation Checklist
A.1 Before Data Collection
- Confirm target Garmin model exposes HR at >= 1 Hz via FIT export
- Test BBI/IBI availability on target device (model- and firmware-dependent)
- Select and validate ECG reference device (Polar H10 recommended)
- Synchronize all device clocks to NTP before each session
- Calibrate accelerometer motion thresholds during pilot sessions
- Define and document baseline period (minimum 5 minutes seated rest)
A.2 During Analysis
- Align timestamps between smartwatch and ECG using event anchors
- Apply motion gating before computing variability metrics
- Compute both raw and baseline-calibrated features
- Report agreement metrics (correlation, Bland-Altman, ICC) per condition
- Run LOSO-CV classification and report AUC with confidence intervals
- Perform ablation study across feature groups
A.3 When Reporting Results
- Clearly state which smartwatch model and firmware version were used
- Report motion gating thresholds and data exclusion rates
- Present results separately for rest and motion conditions
- Acknowledge the fundamental temporal resolution limitation
- Avoid claiming “HRV measurement” when reporting HR-derived surrogates
- Discuss clinical vs. research relevance of observed accuracy levels