From Signal to Insight: A Complete Guide to EGM Processing for Machine Learning in Cardiac Electrophysiology Research

Leo Kelly Jan 12, 2026 605

This article provides a comprehensive guide for researchers and drug development professionals on processing intracardiac Electrogram (EGM) signals for machine learning feature extraction.

From Signal to Insight: A Complete Guide to EGM Processing for Machine Learning in Cardiac Electrophysiology Research

Abstract

This article provides a comprehensive guide for researchers and drug development professionals on processing intracardiac Electrogram (EGM) signals for machine learning feature extraction. It covers foundational concepts of EGM biophysics and noise, details preprocessing pipelines (filtering, segmentation, artifact removal) and feature engineering methods (time-domain, frequency-domain, non-linear). The guide addresses common challenges in signal quality and dataset imbalance, and establishes robust validation frameworks for comparing traditional biomarkers against ML-derived features. The goal is to equip scientists with the practical knowledge to build reliable, clinically translatable ML models for arrhythmia study and drug efficacy assessment.

Understanding the Raw Material: The Biophysics, Noise, and Components of Intracardiac EGMs

What is an EGM? Defining Intracardiac vs. Surface ECG Signals and Their Unique Information Content

An Electrogram (EGM) is a recording of the heart's electrical activity measured directly from the heart's surface or from within its chambers. This contrasts with a surface Electrocardiogram (ECG), which measures the same bioelectrical phenomena from electrodes placed on the skin. The proximity of EGM electrodes to the cardiac tissue provides a high-fidelity, localized signal with distinct information content compared to the spatially and temporally integrated view of the ECG.

Comparative Signal Characteristics

The fundamental differences between intracardiac EGM and surface ECG signals are summarized in the table below.

Table 1: Key Characteristics of Surface ECG vs. Intracardiac EGM

Parameter	Surface ECG	Intracardiac EGM
Electrode Location	Skin surface (limbs, chest)	Endocardial/Epicardial surface, within chambers
Signal Amplitude	0.5 - 5 mV	5 - 20 mV (often higher)
Frequency Bandwidth	0.05 - 150 Hz (diagnostic)	1 - 500+ Hz (up to 1kHz for research)
Spatial Resolution	Low (whole-heart summation)	High (localized, < 1 cm² area)
Primary Information	Global cardiac rhythm, conduction pathways, gross morphology	Local activation timing, fractionated potentials, depolarization/repolarization details
Key Components	P wave, QRS complex, T wave	Local activation potential, far-field components, stimulus artifacts
Dominant Noise Sources	Motion artifact, muscle EMG, powerline interference	Electrode-tissue interface noise, instrumentation noise

Unique Information Content and Physiological Basis

The information derived from each modality serves complementary purposes:

Surface ECG: Represents the summed vector of all cardiac depolarization and reparization waves as they propagate through the volume conductor of the body. It is the gold standard for diagnosing arrhythmias (e.g., atrial fibrillation, ventricular tachycardia), conduction disorders (e.g., AV block), and ischemia.
Intracardiac EGM: Provides a direct measurement of local myocardial activation. Key features include:
- Activation Timing: Precise local activation time (LAT) for mapping.
- Fractionated Potentials: Low-amplitude, high-frequency signals indicative of scarred or diseased tissue, critical for substrate-based ablation.
- Voltage: Amplitude correlates with local tissue health (e.g., scar voltage < 0.5 mV).
- Stimulus-Response: Direct capture and pacing threshold measurements.

Experimental Protocols for EGM/ECG Data Acquisition in Research

Protocol 1: Simultaneous Acquisition of Surface ECG and Intracardiac EGM in Preclinical Models

Objective: To correlate global cardiac electrical activity (ECG) with local myocardial electrophysiology (EGM) for feature validation.
Materials: See "The Scientist's Toolkit" below.
Methodology:
- Anesthetize and instrument the animal model (e.g., porcine, canine) according to IACUC-approved protocols.
- Place standard limb lead ECG electrodes on shaved skin.
- Under fluoroscopic or electroanatomical mapping guidance, advance a diagnostic electrophysiology catheter (e.g., duodecapolar, or mapping catheter) to the target chamber (e.g., right atrium, left ventricle).
- Connect both ECG surface electrodes and intracardiac catheter to a multi-channel bio-amplifier/recording system with a sampling rate ≥ 2 kHz per channel.
- Record a minimum of 5 minutes of baseline rhythm. Induce arrhythmia if required by the protocol (e.g., via programmed electrical stimulation).
- Synchronize all data streams using a common analog or digital trigger.
- Apply band-pass filtering post-acquisition (ECG: 0.5-150 Hz; EGM: 1-500 Hz).
- Annotate key fiducial points (ECG: P onset, R peak; EGM: local activation peak/dV/dt max) for temporal analysis.

Protocol 2: Processing EGM Signals for Machine Learning Feature Extraction

Objective: To generate a curated dataset of EGM features for arrhythmia classification or outcome prediction models.
Workflow: The following diagram outlines the core signal processing and feature engineering pipeline.

Diagram Title: EGM Feature Extraction Pipeline for ML

Detailed Steps:
- Pre-Processing: For each EGM channel, apply a 2nd-order 50/60 Hz notch filter, followed by a band-pass filter (e.g., 1-300 Hz Butterworth). Normalize amplitude (zero-mean, unit variance).
- Activation Detection: Use a validated algorithm (e.g., steepest negative dV/dt, wavelet transform) to mark the local activation time (LAT) for each beat.
- Beat Segmentation: Extract a window of data (e.g., 200 ms) centered on each detected LAT to create individual beat epochs. Reject epochs with excessive noise.
- Feature Extraction:
  - Time-Domain: Peak-to-peak amplitude, slew rate (max dV/dt), duration at 50% amplitude, root mean square (RMS).
  - Frequency-Domain: Dominant frequency, peak power spectral density, spectral entropy.
  - Non-Linear: Wavelet entropy, fractal dimension, Lyapunov exponent (for sequential beats).
- Dataset Curation: Tabulate features with labels (e.g., sinus rhythm, scar zone, arrhythmia type) into a structured array (e.g., .csv, .h5) for ML model input.

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Materials for EGM/ECG Research

Item	Function & Application
High-Density Mapping Catheter (e.g., PentaRay, HD Grid)	Provides simultaneous, spatially precise EGM recordings from multiple electrodes (e.g., 20-64 poles) for creating detailed activation maps.
Programmed Electrical Stimulator	Delivers precise pacing protocols (S1-S2, burst pacing) to induce and study arrhythmias in controlled experimental settings.
Multi-Channel Bioamplifier/Data Acquisition System (e.g., from ADInstruments, BIOPAC)	Amplifies, filters, and digitizes low-amplitude biological signals from both surface and intracardiac electrodes simultaneously.
3D Electroanatomical Mapping System (e.g., CARTO, EnSite)	Integrates EGM location, timing, and voltage with 3D geometry to create maps of cardiac electrical activity. Essential for translating local EGM data to structural context.
Signal Processing Software (e.g., LabChart, MATLAB with Signal Processing Toolbox, custom Python scripts)	Performs critical offline analysis: filtering, annotation, feature extraction, and statistical analysis of acquired EGM/ECG data.
Langendorff Perfused Heart Setup	Ex vivo model allowing for controlled, motion-stable acquisition of high-fidelity epicardial and endocardial EGMs without systemic confounding factors.

This application note details experimental protocols for investigating the biophysical basis of intracardiac electrogram (EGM) components. The work is framed within a broader thesis on developing interpretable machine learning features for cardiac electrophysiology. The core objective is to establish a causal, quantitative mapping between measurable tissue properties (e.g., conduction velocity, fibrosis density, ion channel function) and the morphological characteristics of EGM signals (far-field vs. near-field, unipolar vs. bipolar). This foundational mapping is essential for creating biologically grounded feature sets for ML models in arrhythmia research and drug development.

EGM Component Definitions and Determinants

EGM Component	Definition	Primary Biophysical Determinants	Typical Frequency Range	Spatial Sensitivity
Near-Field	Signal from myocytes within ~1-2 mm of electrode.	Local transmembrane action potential (TAP) morphology, local coupling resistance, direct tissue-electrode contact.	40-250 Hz	Highly localized (~1-2 mm radius).
Far-Field	Signal from myocardium remote (>1 cm) from electrode.	Global cardiac electrical propagation, tissue mass, tissue anisotropy, chamber geometry.	1-40 Hz	Broad, whole-chamber or cross-chamber.
Unipolar	Potential difference between intracardiac electrode and distant reference.	Summation of all electrical activity (near-field + far-field) along the path to the reference. TIP: Broad spatial view.	0.5-250 Hz	Very broad, omnidirectional.
Bipolar	Potential difference between two closely spaced intracardiac electrodes.	Spatial gradient of electrical potential. Emphasizes high-frequency components near the electrode pair. TIP: Localizes signal source.	30-500 Hz	Directional, localized to inter-electrode axis.

Quantitative Relationships: Tissue Properties to EGM Features

Table summarizing key quantitative mappings derived from experimental and simulation studies.

Tissue Property	Measured Metric	Primary EGM Impact	Quantifiable Effect on EGM	Approximate Scaling Law (from models)
Conduction Velocity (CV)	cm/ms	Bipolar EGM width, slew rate (dV/dt).	CV ↓ → Bipolar width ↑, amplitude ↓, fractionation ↑.	Bipolar Width ∝ 1 / CV (local).
Fibrosis Density	% area or collagen volume fraction (CVF).	Near-field amplitude, bipolar fractionation, late potentials.	CVF > 10-15% → consistent fractionation, amplitude reduction > 50%.	Signal Amplitude ∝ exp(-k * CVF).
Tissue Mass / Wall Thickness	mm or g	Far-field amplitude in unipolar signals.	Mass ↑ → Far-field amplitude ↑ linearly in unipolar EGMs.	Unipolar FF Amplitude ∝ Mass (remote).
Ion Channel Dysfunction (e.g., I_Na)	Maximal dV/dt of TAP	Bipolar EGM slew rate, near-field amplitude.	dV/dt_max ↓ 50% → Bipolar slew rate ↓ ~40%, amplitude ↓ ~30%.	Slew Rate ∝ dV/dt_max.
Electrode-Tissue Distance	mm	Near-field amplitude, high-frequency content.	Distance ↑ 1mm → Bipolar amplitude ↓ ~50%, high-freq. power ↓ sharply.	Amplitude ∝ 1 / Distance² (near-field).

Experimental Protocols

Protocol: Ex Vivo Mapping of Focal Fibrosis to Bipolar EGM Fractionation

Objective: To empirically correlate spatially registered histology (fibrosis quantification) with high-density bipolar EGM recordings.

Materials: Langendorff-perfused explanted heart (small animal or human), optical mapping system (optional), micro-electrode array (MEA) or multipolar catheter, perfusion system, rapid tissue freezer, histology setup (fixation, embedding, picrosirius red stain), confocal/standard microscope, co-registration software.

Methodology:

Heart Preparation & Perfusion: Establish Langendorff perfusion with oxygenated Tyrode's solution. Maintain temperature (37°C), pH (7.4), and perfusion pressure.
High-Density Electrophysiological Mapping:
- Position a high-density MEA (e.g., 128 electrodes, 0.5-1.0 mm spacing) on the epicardial region of interest (ROI).
- Record bipolar EGMs from all adjacent electrode pairs during steady-state pacing (cycle length 400-600ms).
- For each bipolar EGM, extract features: Number of Peaks (fractionation index), Peak-to-Peak Amplitude, Duration (total activation time), and Slew Rate.
- Create spatial maps of each EGM feature.
Tissue Registration & Freezing:
- Mark the MEA boundaries on the epicardium with sterile dye pins.
- Rapidly excise the mapped ROI and freeze in optimal cutting temperature (OCT) compound using isopentane cooled by liquid nitrogen.
Histological Processing & Co-Registration:
- Serially section tissue (5-10 µm thickness) perpendicular to epicardium.
- Stain with picrosirius red for collagen quantification.
- Image sections under polarized light (collagen appears birefringent) to compute Collagen Volume Fraction (CVF) per microscopic field (e.g., 200x200 µm).
- Using the dye marks and blood vessel patterns, digitally co-register each histological field with its corresponding EGM recording site from the MEA map.
Statistical Correlation: Perform linear/multivariate regression analysis between local CVF and bipolar EGM features (e.g., Number of Peaks, Amplitude).

Protocol: In Silico Study of Ion Channel Block on Unipolar vs. Bipolar EGMs

Objective: To isolate the effect of specific ionic current reduction (simulating drug effect) on EGM component morphology using a computational model.

Materials: Multi-scale computational modeling software (e.g., OpenCARP, COMSOL, custom Matlab/Python with CellML). Models: Human ventricular myocyte model (e.g., O'Hara-Rudy, Tomek-Rodriguez), 2D or 3D monodomain/bidomain tissue slab model with realistic fibrosis patterns, virtual electrode arrays.

Methodology:

Baseline Model Construction:
- Implement a 2D tissue sheet (e.g., 5x5 cm) with assigned fiber orientation.
- Incorporate a zone of diffuse fibrosis (15-30% CVF) using a fibroblast coupling model or by altering conductivity.
- Define virtual electrode locations: one unipolar (with distant reference) and one bipolar pair (2mm spacing) placed centrally.
Simulation of Propagation & EGMs:
- Stimulate at one edge to generate planar wave propagation across the sheet.
- Solve the monodomain/bidomain equations to compute extracellular potentials at each electrode.
- Extract Baseline Unipolar EGM (showing near-field and far-field components) and Baseline Bipolar EGM (subtraction of two nearby unipolars).
Intervention - Ion Channel Block:
- In the cell model, reduce the maximum conductance (g_max) of a target current (e.g., I_Na by 50%, I_Ca by 30%, I_Kr by 90%).
- Re-run the simulation with identical pacing.
- Extract Post-Block Unipolar and Bipolar EGMs.
Feature Extraction & Comparison:
- For Unipolar: Measure Far-field amplitude (early/low-freq component), Near-field amplitude (sharp, high-freq peak), Total duration.
- For Bipolar: Measure Peak-to-peak amplitude, Slew rate (max dV/dt), Duration.
- Compute percentage change from baseline for each feature under each channel block condition.
Output: A table linking specific channel block to directional changes in specific EGM components, informing ML feature selection for drug effect classification.

The Scientist's Toolkit: Key Research Reagent Solutions

Item / Reagent	Function in EGM-Biophysics Research	Example Product / Model
High-Density Multipolar Catheter/MEA	Provides spatially precise recording of EGMs for near-field localization and fractionation analysis.	PentaRay NAV Catheter (Biosense Webster), Advisor HD Grid Mapping Catheter (Abbott).
Optical Mapping Dye (Voltage-Sensitive)	Validates electrical propagation maps and provides gold-standard conduction velocity independent of electrodes.	RH237, Di-4-ANEPPS.
Perfusion System (Langendorff)	Maintains ex vivo heart viability and electrophysiological stability for controlled experiments.	Radnoti Langendorff System.
Histology Collagen Stain	Quantifies interstitial fibrosis (key tissue property) for direct correlation with EGM.	Picrosirius Red Stain Kit (Polysciences).
Computational Cardiac Electrophysiology Platform	Allows in silico perturbation of tissue properties (CV, fibrosis, ion channels) in isolation to study EGM effects.	OpenCARP (open-source), COMSOL Multiphysics with ACID add-on.
Fractionation Analysis Software	Automates detection and quantification of complex, fractionated EGMs (number of peaks, duration, voltage).	LabSystem PRO EP Recording System (Boston Scientific), custom Matlab/Python toolkits.

Visualization Diagrams

Title: Mapping Tissue Properties to EGM Features

Title: Ex Vivo EGM-Fibrosis Correlation Workflow

Title: In Silico EGM Sensitivity Analysis Protocol

Within the thesis "Advanced EGM Signal Processing for Robust Machine Learning Feature Extraction in Cardiac Safety Pharmacology," accurate identification and mitigation of noise is paramount. Intracardiac electrogram (EGM) signals, crucial for assessing cardiac electrophysiology in preclinical and clinical drug development, are susceptible to corruption by pervasive noise sources. These artifacts can obscure true biological signals, leading to inaccurate feature extraction and compromising machine learning model performance. This document details the characterization and experimental protocols for three predominant noise enemies: Baseline Wander (BW), Powerline Interference (PLI), and Motion Artifact (MA).

The table below summarizes the key attributes of each noise source, essential for designing digital filters and ML denoising algorithms.

Table 1: Quantitative Characterization of Common EGM Noise Sources

Noise Source	Typical Frequency Range	Amplitude Range	Primary Origin	Key Morphological Feature
Baseline Wander (BW)	< 1 Hz	Up to 15% of EGM amplitude	Respiration, electrode-skin impedance changes	Slow, sinusoidal drift of signal isoelectric line.
Powerline Interference (PLI)	50 Hz or 60 Hz (± harmonics)	10 µV – 5 mV	Capacitive/inductive coupling from AC mains	Persistent sinusoidal oscillation superimposed on signal.
Motion Artifact (MA)	0.1 Hz – 10 Hz	Can exceed EGM amplitude	Physical movement, electrode displacement	Abrupt, non-stationary, high-amplitude transients.

Experimental Protocols for Noise Induction & Study

Protocol:In-VitroPLI and BW Characterization Setup

Objective: To systematically record and quantify PLI and BW in a controlled benchtop environment simulating clinical recording setups.

Materials: See Scientist's Toolkit (Section 6.0).

Methodology:

Setup: Place a saline-filled tank (simulating torso conductivity) on a non-conductive surface. Submerge a commercial catheter electrode and a reference Ag/AgCl electrode.
Signal Generation: Use a programmable signal generator to inject a synthetic cardiac EGM waveform (e.g., mimicking ventricular depolarization) through a pair of dedicated stimulating electrodes.
PLI Induction: Position a standard AC power cable (120V/60Hz or 230V/50Hz) at varying distances (5-50 cm) from the recording electrodes and data acquisition (DAQ) system cables. Loop the cable to enhance electromagnetic coupling.
BW Induction: Mechanically oscillate the recording electrode vertically (0.1-0.5 Hz) using a calibrated linear actuator to simulate respiratory-induced electrode motion relative to the medium.
Data Acquisition: Acquire signals via a biopotential amplifier (gain: 1000, bandwidth: 0.1-500 Hz) and DAQ system (sampling rate: 2 kHz). Record three separate 5-minute epochs: (i) Clean EGM, (ii) EGM + PLI, (iii) EGM + BW.
Analysis: Compute Power Spectral Density (PSD) to identify peak interference frequencies. Measure signal-to-noise ratio (SNR) as: SNR (dB) = 20 log₁₀(Psignal / Pnoise).

Protocol:In-VivoMotion Artifact Provocation

Objective: To elicit and characterize motion artifacts in an anesthetized preclinical model.

Methodology:

Animal Preparation: Anesthetize and instrument a canine or swine subject per IACUC-approved protocols. Position a deflectable diagnostic catheter in the right ventricle under fluoroscopic guidance.
Baseline Recording: Record stable bipolar EGM from the catheter tip for 5 minutes (reference period).
Artifact Provocation: Implement a series of controlled maneuvers: a. Catheter Tap: Gently tap the catheter shaft proximal to the insertion site. b. Body Roll: Slowly tilt the surgical table approximately 15 degrees left and right. c. Respiration Increase: Adjust ventilator parameters to increase tidal volume by 30% for 60 seconds.
Synchronized Recording: Synchronize EGM recording (high sampling rate: 4 kHz) with accelerometer data (placed on the animal's torso) and ventilator phase output.
Analysis: Use accelerometer data to time-lock EGM transients. Characterize MA amplitude, duration, and spectral profile via short-time Fourier transform (STFT).

Visualizing the Noise Identification & Processing Workflow

Diagram Title: EGM Noise Source Identification and Mitigation Pathway for ML

Diagram Title: In-Vitro PLI & BW Characterization Protocol Flow

The Scientist's Toolkit: Key Research Reagents & Materials

Table 2: Essential Materials for EGM Noise Research

Item	Function/Application
Programmable Signal Generator	Synthesizes pristine, known-parameter cardiac EGM templates for controlled noise addition studies.
Biopotential Amplifier (Isolated)	Amplifies microvolt-level EGM signals with high common-mode rejection ratio (CMRR >100 dB) to reject inherent interference.
High-Resolution DAQ System	Acquires signals at >= 2 kHz sampling rate to accurately resolve high-frequency noise components and EGM morphology.
Saline-Filled Tank/Phantom	Provides a volume conductor model for in-vitro experimentation, allowing reproducible electrode positioning and noise coupling.
Diagnostic Electrophysiology Catheter	Standardized tool for intracardiac signal recording; subject to motion and interference in clinical settings.
3-Axis Accelerometer	Synchronously records mechanical motion to establish causality for motion artifact identification.
Digital Filtering Software (e.g., LabVIEW, Python SciPy)	Implements and tests noise removal algorithms (e.g., high-pass, notch, adaptive filters) prior to ML pipeline integration.

Application Notes

Intracardiac electrograms (EGMs) provide critical, high-fidelity electrophysiological data essential for diagnosing arrhythmias, guiding ablation therapy, and assessing drug efficacy. The fundamental characteristics of these signals—including amplitude, frequency, morphology, and complexity—vary systematically based on both the type of arrhythmia (e.g., Atrial Fibrillation/AFib vs. Ventricular Tachycardia/VT) and the anatomical recording site (atrial vs. ventricular myocardium). For research aimed at developing machine learning (ML) features for automated diagnosis and mapping, understanding these variations is paramount. Atrial signals during AFib are characterized by low-voltage, high-frequency, and irregular activations, reflecting chaotic, multi-wavelet reentry. In contrast, ventricular EGMs during VT often show higher amplitude, more organized, and slower periodic signals, consistent with a macro-reentrant or focal mechanism. Site-specific differences are equally critical; atrial myocardium inherently generates faster, lower amplitude signals than ventricular tissue due to electrophysiological and structural properties. These distinctions form the basis for feature engineering in ML pipelines, where time-domain (e.g., voltage, slew rate), frequency-domain (e.g., dominant frequency, organization index), and complexity-based (e.g., entropy, fractal dimension) features must be tailored and validated for the specific clinical context.

Table 1: Characteristic EGM Parameters by Arrhythmia Type and Recording Site

Parameter	Sinus Rhythm (Atrium)	AFib (Atrium)	Sinus Rhythm (Ventricle)	VT (Ventricle)
Voltage Amplitude (mV)	1.5 - 4.0	0.1 - 0.5	5.0 - 10.0	1.0 - 5.0
Dominant Frequency (Hz)	5 - 7	6 - 12	3 - 5	3 - 7
Cycle Length (ms)	600 - 1000	100 - 200	600 - 1000	200 - 400
Slew Rate (V/s)	0.5 - 1.5	0.05 - 0.2	1.0 - 3.0	0.2 - 1.0
Organization Index	High (0.8-1.0)	Low (0.1-0.3)	High (0.8-1.0)	Medium-High (0.5-0.8)
Sample Entropy	Low (<0.5)	High (>1.5)	Low (<0.5)	Medium (0.8-1.2)

Note: Values are generalized from contemporary literature and may vary based on specific patient pathology, recording electrode type (bipolar/unipolar), and inter-electrode spacing.

Experimental Protocols

Protocol 1: Acquisition of Clinical EGMs for Feature Database Construction

Objective: To collect a standardized dataset of intracardiac EGMs during different arrhythmias from specified sites for ML feature research. Materials: See "Scientist's Toolkit" below. Methodology:

Patient Preparation & Consent: Obtain IRB approval and informed consent. Perform standard pre-procedure preparations.
Electrode Catheter Placement: Under fluoroscopic/3D mapping guidance, position diagnostic catheters:
- A decapolar catheter in the coronary sinus (CS) for left atrial/CS recordings.
- A duodecapolar catheter along the right atrial free wall and crista terminalis.
- A quadripolar catheter at the right ventricular apex.
Signal Acquisition & Arrhythmia Induction:
- Record 60 seconds of baseline sinus rhythm from all catheters.
- For AFib: If the patient is in sinus rhythm, induce AFib via rapid atrial pacing or isoproterenol infusion.
- For VT: Perform programmed electrical stimulation (PES) from the RV apex with up to 3 extra stimuli to induce VT.
Data Recording: Using the electrophysiology lab system, record unipolar and bipolar EGMs from all catheter electrodes simultaneously with surface ECG leads. Settings: Sampling rate ≥ 1000 Hz, bandpass filter 0.05-500 Hz for unipolar, 30-500 Hz for bipolar.
Annotation: An expert electrophysiologist will annotate the onset/offset of each arrhythmia episode and label recording sites.
Export: Export data segments in a standard format (e.g., .mat, .txt) with full metadata.

Protocol 2: In-Silico Simulation of Arrhythmia EGMs

Objective: To generate synthetic EGM data with known ground truth for validating feature robustness. Methodology:

Model Selection: Use a detailed cardiac tissue model (e.g., Courtemanche-Ramirez-Nattel for atrium, ten Tusscher-Panfilov for ventricle) integrated into a monodomain or bidomain framework.
Arrhythmia Simulation:
- AFib: Initiate in a 2D or 3D atrial tissue sheet by applying S1-S2 cross-field stimulation or seeding multiple random reentrant wavelets.
- VT: Initiate in a ventricular tissue slab using a rapid pacing protocol or by creating a zone of slowed conduction to establish a reentrant circuit.
Virtual Electrogram Calculation: Simulate bipolar EGMs by calculating the extracellular potential difference between two points in the model, incorporating electrode size and spacing.
Parameter Variation: Systematically vary parameters (e.g., fibrosis density, ion channel conductances) to simulate different pathological substrates.
Noise Addition: Add realistic noise (50/60 Hz interference, baseline wander, myopotential) to the clean simulated signals.

Protocol 3: Feature Extraction and Comparative Analysis Workflow

Objective: To extract, compare, and validate ML-relevant features from EGMs grouped by arrhythmia type and site. Methodology:

Preprocessing: Apply a notch filter (50/60 Hz). For bipolar signals, apply a high-pass filter (30 Hz). Normalize amplitudes.
Segmentation: Segment continuous recordings into 5-second non-overlapping epochs labeled by rhythm and site.
Feature Extraction: For each epoch, calculate a comprehensive feature set:
- Time-Domain: Peak-to-peak voltage, maximal slew rate, local activation time (LAT) variability.
- Frequency-Domain: Dominant frequency (DF), DF organization index (ratio of DF power to total power).
- Complexity: Sample entropy, multiscale entropy, wavelet entropy, fractal dimension.
Statistical Comparison: Use non-parametric tests (Kruskal-Wallis with post-hoc Dunn's) to compare each feature across the four groups: Atrial-AFib, Atrial-Sinus, Ventricular-VT, Ventricular-Sinus.
Feature Selection: Apply dimensionality reduction (e.g., PCA) or feature importance ranking (e.g., random forest) to identify the most discriminative features for classifying arrhythmia and site.

Visualizations

Title: EGM Feature Extraction & Analysis Workflow

Title: Factors Determining EGM Characteristics

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for EGM Research

Item	Function in Research
Clinical-Grade Electrophysiology Catheter (e.g., Duodecapolar, PentaRay)	High-density, multi-electrode mapping catheters for acquiring spatially detailed bipolar/unipolar EGMs from specific cardiac chambers.
3D Electroanatomic Mapping System (e.g., CARTO, EnSite)	Provides precise 3D spatial localization of each EGM recording site, enabling correlation of signal features with anatomy.
Biophysical Simulation Software (e.g., OpenCARP, COMSOL)	Platforms for running in-silico cardiac tissue models to generate synthetic EGM data with controllable parameters.
Signal Processing Toolkit (e.g., MATLAB Wavelet Toolbox, Biosig for Python)	Software libraries containing validated algorithms for filtering, segmenting, and extracting time/frequency/complexity features from EGM signals.
Isolated Animal Heart Perfusion System (Langendorff)	Ex-vivo model for recording high-fidelity EGMs from atrial and ventricular tissue during pharmacologically induced arrhythmias.
Programmable Electrical Stimulator	Essential for arrhythmia induction protocols in both clinical studies and experimental models.
Data Annotation Software (e.g., LabChart, Custom GUI)	Allows expert manual review and labeling of EGM recordings, creating the ground-truth dataset for supervised ML.

Within electrophysiology research for drug development, intracardiac electrograms (EGMs) are the primary data source for investigating arrhythmia mechanisms and compound effects. Extracting ML-ready features from these signals is a central thesis of modern computational cardiology. This application note establishes that rigorous, high-fidelity preprocessing is the foundational, non-negotiable step determining the validity of all downstream feature engineering and model outcomes. Without it, extracted features represent artifact, not biology.

The High-Fidelity EGM Processing Pipeline: A Protocol

The following protocol details the mandatory steps to transform raw EGM recordings into a curated dataset for feature extraction.

Protocol 1.1: From Raw Acquisition to Cleaned Time-Series Objective: To remove non-cardiac noise and preserve morphologically significant components of the EGM. Materials: Multichannel electrophysiology recording system, isolated animal or human heart preparation, bipolar or unipolar electrodes, data acquisition unit (≥ 1 kHz sampling rate), computational environment (e.g., Python with SciPy/NumPy, MATLAB). Procedure:

Signal Acquisition: Record EGMs at a minimum sampling frequency of 1 kHz. For ventricular signals or complex fractionated electrograms, 2 kHz or higher is recommended. Ensure proper grounding to minimize 50/60 Hz line interference.
Digital Filtering: a. High-Pass Filter: Apply a zero-phase Butterworth high-pass filter (order 2-4) with a cutoff at 0.5 Hz to remove baseline wander and very low-frequency drift. b. Low-Pass Filter: Apply a zero-phase Butterworth low-pass filter (order 4-6) with a cutoff at 250 Hz to suppress high-frequency thermal noise and prevent aliasing for subsequent downsampling. c. Notch Filter: If significant line interference persists, apply a narrow band-stop (notch) filter at 50/60 Hz and its first harmonic (100/120 Hz).
Powerline & Artifact Rejection: Employ adaptive subtraction techniques (e.g., template matching) for large pacing artifacts or mechanical motion artifacts that filters cannot remove without signal distortion.
Quality Control & Segmentation: Visually inspect cleaned signals. Segment data into individual beats or episodes based on stimulus markers or detected activation times.

Quantitative Impact of Processing on Feature Stability

The table below summarizes experimental data demonstrating how preprocessing fidelity directly affects the coefficient of variation (CV) for common EGM features, a critical metric for ML dataset robustness.

Table 1: Feature Stability as a Function of Preprocessing Rigor

EGM Feature	Raw Signal CV (%)	With Basic Filtering CV (%)	With High-Fidelity Processing CV (%)	Notes
Peak-to-Peak Amplitude (mV)	35.2	18.7	8.1	Highly susceptible to baseline wander.
Local Activation Time (ms)	22.5	10.3	3.8	Jitter reduced by precise high-pass filtering.
Complex Fractionated Interval (ms)	45.8	30.1	15.4	Uncontrolled noise falsely extends intervals.
Spectral Dominant Frequency (Hz)	40.1	25.6	12.9	Line noise creates spurious spectral peaks.
Organizational Index (Unitless)	50.3	32.5	18.2	Noise degrades correlation-based metrics severely.

Experimental Protocol for Validation

Protocol 2.1: Validating Preprocessing Efficacy for ML Objective: To empirically test the hypothesis that classifier performance is dependent on preprocessing quality. Experimental Design:

Dataset Creation: From a repository of porcine infarct-model EGMs (n=500 recordings), create three datasets:
- Dataset A (Raw): Unprocessed signals.
- Dataset B (Basic): Signals with only 30-250 Hz bandpass filtering.
- Dataset C (High-Fidelity): Signals processed per Protocol 1.1, including adaptive artifact removal.
Feature Extraction: From each dataset, extract a standardized panel of 20 temporal and spectral features (e.g., from Table 1).
Model Training & Evaluation: Train a random forest classifier to identify "infarct zone" vs. "healthy zone" EGMs using a 70/30 train-test split. Perform 5-fold cross-validation.
Metrics: Compare mean accuracy, F1-score, and feature importance rankings across Datasets A, B, and C.

Expected Outcome: Dataset C will yield significantly higher accuracy and F1-score, with feature importance weights that align with known electrophysiological biomarkers, unlike Datasets A and B where importance is skewed by noise-corrupted features.

Visualizing the Critical Workflow & Signal Degradation Pathway

Title: The Critical Data Pathway: High-Fidelity Processing Determines ML Success

Title: Sources of Noise Corrupting the True EGM Signal

The Scientist's Toolkit: Essential Research Reagent Solutions

Item/Category	Function in EGM Processing & ML Feature Research
High-Impedance, Bipolar Electrodes	Minimizes far-field signal pickup, providing a localized EGM critical for detecting discrete pathological signals.
Optical Mapping-Compatible Dye (e.g., Di-4-ANEPPS)	Provides gold-standard validation for activation/recovery times derived from electrical EGMs, grounding ML features in biology.
Selective Ion Channel Blockers (e.g., E-4031, Dofetilide)	Used to create controlled pharmacological models of Long QT or specific arrhythmias, generating well-labeled EGM data for supervised ML.
Programmable Electrical Stimulator	Enforces consistent pacing protocols (S1-S2, burst pacing) to provoke and record repetitive or arrhythmic events for feature analysis.
Langendorff Perfusion System (ex-vivo)	Maintains stable, isolated heart preparations for long-duration, low-noise EGM recordings required for training deep learning models.
Digital Real-Time Recording Software (e.g., LabChart, EP-Workmate)	Acquires synchronous, high-sample-rate data from multiple electrodes, ensuring temporal alignment of all channels for spatial feature extraction.
Signal Processing Suite (e.g., MATLAB Signal Toolbox, Python BioSPPy)	Implements standardized, reproducible digital filters and feature extraction algorithms essential for creating consistent ML inputs.

Building the Pipeline: Step-by-Step EGM Preprocessing and Feature Engineering for ML Models

Within the broader thesis on Electrogram (EGM) signal processing for machine learning feature research, raw intracardiac signals contain both physiological information and pervasive noise. Effective preprocessing is critical for extracting robust, noise-resistant features for downstream ML models in drug development and electrophysiology research. This protocol details three core digital filtering strategies.

Quantitative Filter Comparison

Table 1: Standard Filter Specifications for Intracardiac EGMs

Filter Type	Typical Passband/Cutoff Frequencies	Attenuation (Stopband)	Common Filter Order	Primary Application in EGM Processing
Band-pass (Butterworth)	1-300 Hz or 30-300 Hz	≥ 20 dB at 0.5 Hz & 350 Hz	4th - 6th	Remove baseline wander & high-frequency EMI. Preserve ventricular/atrial components.
Notch (IIR)	50 Hz or 60 Hz ± 2 Hz	≥ 40 dB at exact line frequency	2nd (Q=30-60)	Eliminate powerline interference (50/60 Hz).
Adaptive (LMS/NLMS)	Variable, based on reference noise	Dependent on convergence factor μ	N/A (Filter length: 32-64 taps)	Remove in-band noise (e.g., muscle artifact, breathing) where static filters fail.
Band-pass (Chebyshev I)	1-300 Hz	≥ 50 dB at 0.1 Hz & 500 Hz	5th - 8th	Steeper roll-off for high-noise environments. Accepts passband ripple.
Savitzky-Golay (Smoothing)	N/A (Polynomial fitting)	N/A	Window: 5-21 pts, Poly: 3-5	Preserve peak morphology while smoothing high-frequency noise.

Table 2: Performance Metrics on Simulated EGM Data (Signal-to-Noise Ratio Improvement)

Filter Type	Input SNR (dB)	Output SNR (dB)	Artifact Introduced	Computational Load (Relative)
Butterworth Band-pass	10	18	Low (phase distortion minimal with forward-backward)	Low
IIR Notch (60 Hz)	10 (with line noise)	22	Moderate (risk of signal ringing)	Very Low
Adaptive LMS	5 (non-stationary noise)	15	Low (if reference appropriate)	High
No Filtering	10	10	None	None

Experimental Protocols

Protocol 3.1: Band-pass Filtering for Baseline EGM Cleanup

Objective: Remove out-of-band noise to isolate the cardiac signal of interest (typically 1-300 Hz).

Materials: Raw unipolar or bipolar EGM time-series data (sampled at ≥ 1 kHz). Software: MATLAB (Signal Processing Toolbox), Python (SciPy), or LabVIEW.

Method:

Specification: Define passband f_low = 1 Hz, f_high = 300 Hz. For atrial signals, consider f_low = 30 Hz.
Design: Use a 5th-order zero-phase Butterworth filter (to prevent phase distortion).
- In MATLAB: [b,a] = butter(5, [f_low f_high]/(fs/2), 'bandpass');
- In Python: from scipy.signal import butter, filtfilt; b, a = butter(5, [f_low, f_high], btype='band', fs=fs)
Application: Apply using forward-backward filtering (filtfilt).
Validation: Plot Power Spectral Density (PSD) pre- and post-filtering. Confirm attenuation outside passband.

Protocol 3.2: Notch Filtering for Powerline Interference

Objective: Attenuate 50/60 Hz line noise and its harmonics without distorting EGM morphology.

Method:

Detection: Perform FFT on a representative signal segment to confirm exact noise frequency (often 60.0 Hz ± 0.1 Hz).
Design: Use a 2nd-order IIR notch filter with a quality factor (Q) of 35.
- In MATLAB: wo = 60/(fs/2); bw = wo/35; [b,a] = iirnotch(wo, bw);
Application: Apply using filtfilt.
Validation: Inspect time-domain signal for removal of 60 Hz oscillation and check PSD for a clear notch.

Protocol 3.3: Adaptive Noise Cancellation for In-Band Artifacts

Objective: Remove noise (e.g., electromyographic) with frequency overlap with the cardiac signal.

Method:

Reference Signal: Obtain a noise reference, either from a separate accelerometer/EMG channel or derived from the primary signal (e.g., using a separate high-pass filtered version >100 Hz).
Algorithm Setup: Implement Normalized Least Mean Squares (NLMS) adaptive filter.
- Filter length (L): 32 taps.
- Step size (μ): 0.01 (normalized).
Iteration: Allow the filter weights to converge over a training segment (≥ 500 ms).
Output: The filter output is the "clean" EGM. The error signal is the noise estimate.
Validation: Compare the autocorrelation of the output signal with the raw; should show cleaner periodic peaks.

Visualization of Workflows

Title: Sequential EGM Preprocessing Filtering Workflow

Title: Adaptive Noise Cancellation System Block Diagram

The Scientist's Toolkit

Table 3: Essential Research Reagents & Solutions for EGM Filtering Experiments

Item Name	Function/Application in Protocol	Example Product/Specification
Programmable Electrophysiology Amplifier/DAQ	Acquire raw, high-fidelity intracardiac signals with adjustable gain. Essential for all protocols.	Intan RHD Series, ADInstruments PowerLab, Blackrock Microsystems CerePlex.
Ag/AgCl Electrodes (Epicardial or Intracardiac)	Provide stable, low-noise electrical interface for EGM recording.	Plastics One EEG/ECG electrodes, bipolar/multipolar EP catheters.
Physiological Saline (0.9% NaCl) or Krebs-Henseleit Solution	Maintain tissue viability during ex-vivo or animal model EGM recordings.	Sigma-Aldrich, prepared with 5.6 mM Glucose, gassed with 95% O2/5% CO2.
Signal Processing Software License	Implement and validate filtering algorithms.	MATLAB + Signal Processing Toolbox, Python (SciPy, NumPy, MNE-Python).
Synthetic EGM & Noise Dataset	Benchmark filter performance with known ground truth.	MIT-BIH Arrhythmia Database, simulated noisy EGMs (e.g., with added 50/60 Hz sinusoid, EMG noise).
Line Noise Simulator/Injector	Calibrate notch filters by introducing known interference.	Function generator (e.g., Rigol DG1022Z) coupled via a non-invasive transformer.
Computational Environment	Run adaptive filters in real-time or offline. Requires predictable timing.	Desktop with multicore CPU (Intel i7/equivalent), ≥16 GB RAM, Real-time OS extension (e.g., Ubuntu with PREEMPT_RT).

Within the broader thesis on electrogram (EGM) signal processing for machine learning (ML) feature extraction, the reproducibility and biological relevance of derived features depend critically on a standardized preprocessing workflow. Following initial denoising and filtering, Workflow 2 addresses the challenges of signal heterogeneity by implementing structured segmentation, temporal alignment, and amplitude normalization. This protocol details the application notes for these techniques to ensure consistent analysis across multi-electrode arrays, subjects, and experimental conditions for downstream ML model training in cardiac electrophysiology and drug development research.

Core Techniques: Application Notes

Segmentation

Segmentation isolates discrete physiological events from continuous EGM recordings. For ML, consistent event windows are essential for feature comparison.

Protocol: R-Peak and Activation Window Segmentation

Input: Filtered bipolar or unipolar EGM signals.
R-Peak Detection: Apply the Pan-Tompkins algorithm or a similar QRS detector to a surface ECG channel or a representative EGM channel.
- Algorithm parameters (e.g., refractory period, threshold) must be fixed for an entire dataset.
Activation Time (AT) Detection: For intracardiac EGMs, identify local activation within a search window (e.g., -30 ms to +50 ms) around the R-peak.
- Method: Use maximum -dV/dt for unipolar signals or maximum absolute amplitude for bipolar signals.
Segment Extraction: Extract a window of fixed duration around each fiducial point (R-peak or AT).
- Example Window: -50 ms to +150 ms relative to fiducial point.
- Segments containing noise or ectopic beats (detected via aberrant RR intervals) should be tagged and optionally excluded.

Table 1: Segmentation Algorithm Performance Metrics

Algorithm	Target	Sensitivity (%)	Positive Predictivity (%)	Computational Cost (ms/beat)
Pan-Tompkins	R-Peak	99.3	99.7	~1.2
Wavelet-Based	R-Peak	99.5	99.6	~4.8
Maximum -dV/dt	Unipolar AT	N/A	N/A	~0.5
Peak Bipolar	Bipolar AT	N/A	N/A	~0.3

Alignment

Temporal alignment corrects for small temporal jitter between recorded activations of the same event, ensuring features are compared at equivalent physiological phases.

Protocol: Dynamic Time Warping (DTW) for EGM Alignment

Input: Segmented EGM beats for a single channel across multiple cycles.
Template Selection: Select the median beat or a visually representative, noise-free beat as the template.
Warping Path Calculation:
- Compute a cost matrix between the template and a target beat.
- Find the optimal warping path that minimizes the cumulative distance, subject to step pattern constraints (e.g., Sakoe-Chiba band).
Application: Apply the derived warping path to the target beat to align its time axis to the template.
Iteration: Repeat for all beats and all channels. Alignment should be performed channel-wise.

Normalization

Normalization scales signal amplitudes to a common range, reducing inter-subject and inter-recording variability not attributable to the experimental condition.

Protocol: Baseline-Corrected Peak-to-Peak Normalization

Input: Aligned EGM segments.
Baseline Correction: For each segment, calculate the mean amplitude of a pre-activation baseline period (e.g., -50 ms to -10 ms prior to AT). Subtract this value from the entire segment.
Scale Calculation: Identify the absolute peak-to-peak amplitude of the baseline-corrected segment.
Normalization: Divide the entire baseline-corrected segment by the peak-to-peak amplitude. Resulting values typically range from -1 to 1.
- Alternative: Z-score normalization using the mean and standard deviation of the segment's baseline period may be used for certain spectral features.

Table 2: Impact of Normalization on Feature Variance

Feature	Raw Signal (Mean ± SD)	Post-Normalization (Mean ± SD)	% Reduction in SD
Peak Amplitude (mV)	2.5 ± 1.8	1.0 ± 0.1	94.4%
Integral (mV·ms)	45.3 ± 32.1	18.2 ± 2.3	92.8%
Duration at 50% (ms)	12.4 ± 3.1	12.4 ± 3.1	0%

Integrated Preprocessing Workflow Diagram

Title: EGM Preprocessing Workflow 2: Segmentation, Alignment, Normalization

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for EGM Preprocessing & Analysis

Item	Function in Workflow
High-Density Mapping System (e.g., Prucka Cardiolab, EP-Workmate)	Acquires raw, multichannel EGM and surface ECG signals with precise temporal synchronization.
Signal Processing Suite (MATLAB with Signal Processing Toolbox, Python SciPy/NumPy)	Provides algorithmic foundation for implementing custom segmentation, DTW, and normalization code.
Open-Source ECG Toolbox (e.g., WFDB Toolbox, BioSPPy)	Offers tested implementations of standard detectors (Pan-Tompkins) for validation and benchmarking.
Annotation Software (e.g., LabChart, Custom GUI)	Enables manual verification and correction of automated fiducial point (AT) detection.
Computational Environment (Jupyter Notebook, MATLAB Live Script)	Allows for interactive, step-by-step development and documentation of the preprocessing pipeline.

Experimental Validation Protocol

Title: Protocol for Validating Preprocessing Workflow Efficacy on Simulated and Clinical EGM Data

Objective: To quantify the reduction in signal variance and improvement in ML feature discriminability achieved by Workflow 2.

Materials:

Dataset A: Simulated EGM signals with known temporal jitter and amplitude variation.
Dataset B: Clinical high-density EGMs from 10 patients (atrial fibrillation ablation procedure).
Software: Custom Python/Matlab scripts implementing Workflow 2.

Methods:

Apply Workflow: Process both datasets through the sequential steps: Segmentation -> Alignment -> Normalization.
Quantify Variance: For Dataset A, measure the standard deviation of activation timing and peak amplitude before and after alignment/normalization.
Feature Extraction: From Dataset B, extract 5 common ML features (e.g., RMS voltage, dominant frequency, complexity index) from both raw and preprocessed signals.
Assess Discriminability: Using labeled regions (sinus rhythm vs. arrhythmia), calculate the Fisher Score or t-statistic for each feature pre- and post-processing to measure between-class separation.
Statistical Analysis: Perform paired t-tests on the variance metrics and discriminability indices.

Expected Outcome: A significant reduction in within-class variance and a significant increase in feature discriminability scores post-preprocessing, confirming the workflow's utility for robust ML feature preparation.

Within a broader thesis on electrogram (EGM) signal processing for deriving machine learning-ready features, this protocol addresses two critical preprocessing challenges: the removal of non-physiological artifacts (e.g., motion, pacing) and the suppression of far-field ventricular (FFV) signals from atrial EGMs. Clean atrial substrate characterization is paramount for applications in atrial fibrillation research, drug efficacy studies, and ablation target identification.

Core Signal Processing Algorithms & Quantitative Comparisons

Artifact Removal Methods

Artifacts are typically transient, high-amplitude, broad-spectrum disturbances.

Table 1: Comparative Performance of Artifact Removal Techniques

Method	Core Principle	Optimal Use Case	Atrial Signal Preservation (Reported SNR Improvement)	Computational Load
Template Subtraction	Average artifact waveform is subtracted from detected events.	Regular pacing artifacts, catheter knock.	High (8-12 dB)	Low
Wavelet Denoising	Thresholding of wavelet coefficients in artifact-dominated scales.	Non-stationary, sharp artifacts.	Moderate (6-10 dB)	Medium
Adaptive Filtering (RLS/NLMS)	Uses a reference channel (e.g., pacing signal) to predict & cancel artifact.	Reference-correlated artifacts.	High (10-15 dB)	High
Blank-and-Interpolate	Simple replacement of artifact-contaminated segments.	Simple, large-amplitude spikes.	Low (Potential signal loss)	Very Low

Far-Field Ventricular (FFV) Signal Cancellation

FFV signals represent ventricular depolarization (QRS) obscuring atrial electrograms.

Table 2: FFV Removal Algorithm Comparison

Algorithm	Key Inputs	Advantages	Limitations (Reported Residual FFV)
Independent Component Analysis (ICA)	Multi-channel EGMs (≥3).	Blind separation, no timing reference needed.	Channel count requirement, ordering ambiguity (≈15% residual).
Spatial Cancellation (e.g., V-subtraction)	A unipolar EGM and a coincident ventricular reference.	Intuitive, computationally simple.	Requires precise temporal alignment (<5% residual).
Adaptive Template Subtraction	Atrial EGM and QRS template from ventricular channel.	Effective for consistent FFV morphology.	Fails with variable conduction (≈10% residual).
Common Average Referencing	All electrodes on an array.	Reduces common-mode signals (FFV).	Also attenuates common-mode atrial signals.

Experimental Protocols

Protocol for Validation of Artifact Removal

Title: In-silico & In-vitro Validation of Artifact Filters

Materials:

Source Data: High-resolution atrial EGMs (e.g., from CARTO or Ensite systems) during sinus rhythm and pacing.
Artifact Simulation: Clean EGMs are synthetically contaminated with modeled pacing artifacts (monophasic/biphasic pulses) or motion artifact templates.
Ground Truth: The original, clean EGM segment.

Method:

Data Segmentation: Isolate episodes with and without artifacts. Annotate artifact onset/offset.
Algorithm Application: Apply each method from Table 1 to the contaminated signal.
Performance Quantification:
- Calculate Signal-to-Noise Ratio (SNR) before and after processing: SNR = 20*log10(RMS(signal) / RMS(noise)).
- Compute Root Mean Square Error (RMSE) between processed signal and the ground truth clean EGM.
- Visually inspect for atrial signal distortion (e.g., alteration of fractionated electrogram morphology).

Protocol for FFV Removal Efficacy Assessment

Title: Quantifying Atrial Substrate Revelation Post-FFV Cancellation

Materials:

Recordings: Simultaneous unipolar/bipolar atrial EGMs and a clear ventricular reference (e.g., surface ECG lead II or intracardiac RV electrogram).
Annotation: Precise fiducial markers for atrial (P-wave) and ventricular (R-wave) activations.

Method:

Alignment: Temporally align ventricular reference to atrial channels using cross-correlation.
FFV Cancellation: Apply chosen FFV removal algorithm (e.g., Spatial Cancellation): a. For each ventricular event, segment the corresponding FFV in the atrial EGM. b. Scale and subtract the ventricular reference template from the atrial channel. c. Interpolate the subtracted segment to maintain continuity.
Analysis:
- Amplitude Analysis: Measure peak-to-peak atrial EGM amplitude in the P-wave region before and after FFV removal.
- Spectral Analysis: Compute power spectral density (0-100 Hz) to observe reduction in ventricular-dominated frequencies (~5-20 Hz).
- Feature Stability: Calculate stability of machine learning features (e.g., Shannon entropy, dominant frequency) across consecutive cycles post-processing.

Visualization of Workflows

Title: Atrial EGM Preprocessing: Artifact & FFV Removal Pipeline

Title: Decision Workflow for Far-Field Ventricular Cancellation

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions for EGM Preprocessing Studies

Item / Solution	Function in Protocol	Example/Notes
High-Resolution Electrophysiology System	Acquisition of raw, multi-channel intracardiac EGMs.	Biosemi, EP-Workmate, CARTO 3. Provides digital data export (e.g., .txt, .mat).
Signal Processing Software Library	Implementation of algorithms (filtering, ICA, wavelet).	MATLAB with Signal Processing Toolbox, Python (SciPy, PyWavelets, MNE).
Synthetic EGM Generator	Creates ground truth data with controlled artifacts/FFV.	In-house or commercial simulators (e.g., MIT-BIH Arrhythmia Generator).
Pre-annotated Public EGM Database	For benchmarking and validation.	PhysioNet Computing in Cardiology Challenges data (e.g., 2020/2021 AF events).
Precision Timing Alignment Tool	Micro-adjustment of ventricular reference latency.	Cross-correlation peak detection algorithms with sub-sample interpolation.
Feature Extraction Suite	Quantifies outcome of preprocessing for ML.	Custom scripts for calculating complex fractionated atrial electrogram (CFAE) indices, organizational metrics.

This document details application notes and protocols for extracting time-domain and amplitude features from Electrogram (EGM) signals. This work is a foundational component of a broader thesis on EGM signal processing for machine learning-based cardiac electrophysiology research. The primary goal is to generate robust, quantifiable features that can discriminate between healthy and pathological tissue substrates, thereby enabling applications in drug efficacy testing, ablation target identification, and arrhythmia mechanism characterization.

Core Feature Definitions & Quantitative Summaries

Voltage-Based Features

Voltage features quantify the amplitude characteristics of the EGM, reflecting tissue viability and depolarization strength.

Table 1: Core Voltage-Domain Features

Feature Name	Mathematical Definition	Physiological Correlation	Typical Normal Range (Bipolar, Peak-to-Peak)	Pathological Threshold
Peak-to-Peak Voltage (Vpp)	( V_{pp} = \max(S(t)) - \min(S(t)) )	Tissue viability, mass of activating myocytes.	1.5 - 5.0 mV	< 0.5 mV (scar)
Root Mean Square Voltage (VRMS)	( V{RMS} = \sqrt{\frac{1}{N} \sum{i=1}^{N} S_i^2} )	Overall signal energy.	0.2 - 1.2 mV	< 0.1 - 0.15 mV
Peak Negative Voltage (Vmin)	( V_{min} = \min(S(t)) )	Local activation amplitude.	-0.5 to -2.5 mV	> -0.5 mV
Average Absolute Voltage (Vabs)	( V{abs} = \frac{1}{N} \sum{i=1}^{N}	S_i	)	Mean rectified amplitude.	0.1 - 0.8 mV	Context-dependent

Complexity & Fractionation Indices

These features describe the morphology and temporal fragmentation of the EGM, indicative of discontinuous, anisotropic conduction.

Table 2: Complexity & Fractionation Features

Feature Name	Calculation Protocol	Interpretation	Normal Value	High Fractionation Value
Number of Peaks (NP)	Count of local extrema exceeding noise threshold (±0.05 mV).	Direct measure of temporal fragmentation.	1-3	≥ 4
Short-Term Fractionation (STF)	( \frac{\text{NP}}{\text{EGM Duration (ms)}} )	Peaks per unit time.	< 0.1 peaks/ms	> 0.15 peaks/ms
Complex Fractionated Electrogram (CFE) Mean	Average interval between consecutive detected peaks.	Inverse of peak frequency.	> 120 ms	< 70 ms
CFE Standard Deviation	Std. dev. of inter-peak intervals.	Regularity of fractionation.	Low	High (irregular)
Shannon Entropy (SE)	( SE = -\sum pi \log2(p_i) ) for binned signal amplitudes.	Signal unpredictability & disorder.	Low (< 2.5)	High (≥ 3.0)

Experimental Protocols for Feature Extraction

Protocol: Acquisition & Preprocessing for Feature Engineering

Objective: Obtain clean, physiological EGM signals suitable for time-amplitude analysis. Materials: See "The Scientist's Toolkit" below. Procedure:

Signal Acquisition: Acquire bipolar EGMs from mapping system (e.g., CARTO, EnSite). Ensure contact force is stable (>5g). Sampling rate ≥ 1 kHz (recommended 2 kHz).
Bandpass Filtering: Apply a 4th-order Butterworth bandpass filter (30-500 Hz) to remove far-field activity and high-frequency noise.
Notch Filtering (Optional): Apply a 50/60 Hz notch filter if line noise is present.
Baseline Wander Removal: Apply a high-pass filter at 1 Hz or use polynomial/spline fitting and subtraction.
Signal Trimming: Isolate a 2-second window or specific number of beats. For beat-specific features, window around a fiducial point (e.g., V-peak in unipolar).
No Floor Estimation: Calculate noise floor from isoelectric segment. Define amplitude threshold as 3× RMS noise.
Output: Preprocessed EGM snippet ready for feature computation.

Protocol: Automated Computation of Fractionation Indices

Objective: Calculate NP, CFE Mean, and CFE Standard Deviation reproducibly. Input: Preprocessed EGM signal (S). Algorithm:

Peak Detection: a. Identify all local maxima and minima in S. b. Apply amplitude threshold: Discard extrema where |amplitude| < (0.05 mV OR 3× noise floor). c. Apply temporal threshold: Merge extrema occurring within a refractory period (e.g., 15 ms).
Peak Validation: Count the final set of validated peaks (NP).
Inter-Peak Interval (IPI) Calculation: Compute the time difference between consecutive peaks (maxima or minima).
CFE Metrics: a. CFE Mean: ( \text{CFE}{\text{mean}} = \frac{1}{M} \sum{j=1}^{M} IPIj ), where M is the number of intervals. b. CFE Standard Deviation: ( \text{CFE}{\text{SD}} = \sqrt{\frac{1}{M} \sum{j=1}^{M} (IPIj - \text{CFE}_{\text{mean}})^2 } ).
Output: NP, CFE Mean (ms), CFE SD (ms).

Visualizations

Title: EGM Feature Extraction Workflow

Title: Feature Engineering in Broader Thesis Context

The Scientist's Toolkit

Table 3: Essential Research Reagents & Materials

Item	Function in EGM Feature Research	Example/Specification
Clinical Electrophysiology System	Acquires raw, high-fidelity intracardiac EGMs.	CARTO 3 (Biosense Webster), EnSite Precision (Abbott).
High-Resolution Mapping Catheter	Provides the bipolar electrode pairs for EGM recording.	PentaRay (Biosense Webster), Advisor HD Grid (Abbott).
Signal Processing Software (Library)	Implements filtering, peak detection, and feature algorithms.	MATLAB Signal Processing Toolbox, Python (SciPy, NumPy).
Digital Filter Set	Removes noise and artifacts to isolate local EGM components.	Butterworth Bandpass (30-500 Hz), Notch (50/60 Hz).
Peak Detection Algorithm	Identifies local deflections for complexity analysis.	Custom script with amplitude/refractory thresholds.
Validation Phantom/Simulator	Bench-testing of feature accuracy using known signals.	ECG/EGM signal simulator with programmable complexity.
Database Management System	Stores raw signals, computed features, and patient metadata.	SQL database, MATLAB .mat structures, HDF5 files.

Within the broader thesis on Electrogram (EGM) signal processing for machine learning feature research, the extraction of robust, physiologically relevant features is paramount. While time-domain features capture amplitude and timing, they are insufficient for characterizing the complex, non-stationary nature of cardiac arrhythmias. Spectral and time-frequency features, derived from transformations like the Discrete Fourier Transform (DFT) and Wavelet Transforms, provide a critical lens into the frequency content and its temporal evolution. These features are hypothesized to be potent discriminators for substrate characterization, therapy efficacy assessment in drug development, and arrhythmia risk stratification in preclinical and clinical research.

Core Spectral & Time-Frequency Feature Definitions

Discrete Fourier Transform (DFT) & Derived Features

The DFT decomposes a finite-length EGM signal segment into its constituent sinusoidal frequency components. For a discrete signal x[n] of length N, the DFT X[k] is: X[k] = Σ_{n=0}^{N-1} x[n] * e^{-j(2π/N)kn}, for k = 0, 1, ..., N-1. From the power spectral density (PSD, S[k] = |X[k]|²), key features are extracted.

Table 1: Key Spectral Features from DFT/PSD

Feature	Mathematical Definition	Physiological Interpretation in EGM
Dominant Frequency (DF)	argmax_k (S[k])	The peak frequency of depolarization; high DF often indicates rapid, organized sources (e.g., rotor cores) or rapid focal activity.
Organizational Index (OI)	Σ_{k∈BW} S[k]² / (Σ_{k∈BW} S[k])²	Quantifies concentration of power; higher OI suggests more periodic, organized activity.
Spectral Concentration (SC)	Σ_{k=f1}^{f2} S[k] / Σ_{k=0}^{fNyq} S[k]	Fraction of power within a band (e.g., 4-9 Hz for AF); indicates prevalence of pathologic frequencies.
Spectral Entropy	- Σ_{k∈BW} p_k log₂(p_k) where p_k=S[k]/ΣS	Measure of spectral randomness; high entropy suggests disorganized, complex activation.
Normalized Power in Bands	P_{band} / P_{total}	Power in predefined bands (e.g., 0-2 Hz: slow, 2-8 Hz: medium, 8-20 Hz: fast).

Wavelet Transform & Time-Frequency Features

The Continuous Wavelet Transform (CWT) provides a time-frequency representation, crucial for non-stationary EGM analysis. CWT(a,b) = (1/√|a|) ∫ x(t) ψ((t-b)/a) dt, where *ψ is the mother wavelet, a is scale (inverse of frequency), and b is translation (time). Discrete Wavelet Transform (DWT) uses dyadic scaling for efficient decomposition into approximation (low-frequency) and detail (high-frequency) coefficients.

Table 2: Key Time-Frequency Features from Wavelet Analysis

Feature	Description	Application in EGM Analysis
Wavelet Energy per Band	Energy of DWT detail coefficients at each decomposition level.	Tracks shifts in spectral content over time (e.g., transient high-frequency bursts).
Wavelet Entropy	Entropy calculated from the relative energy distribution across wavelet scales.	Quantifies temporal stability of signal organization.
Ridge Extraction	Tracking the scale (frequency) of maximum CWT magnitude over time.	Identifies the instantaneous dominant frequency trajectory.
Time-Dependent Spectral Peak	The peak frequency in the CWT magnitude spectrum at each time point.	Maps focal accelerations or wavebreak occurrences.

Experimental Protocols for Feature Extraction

Protocol: DFT-Based Feature Extraction from Intracardiac EGMs

Objective: Compute standardized spectral features from unipolar or bipolar EGM recordings for substrate classification. Materials: See Scientist's Toolkit. Preprocessing Steps:

Signal Selection: Isolate a 4-second stable recording segment (avoiding pacing artifacts or far-field intervals).
Detrending: Apply a high-pass filter (cutoff: 0.5 Hz) or subtract a least-squares linear fit to remove baseline wander.
Windowing: Apply a Hanning window to the segment to mitigate spectral leakage.
Zero-Padding: Zero-pad the signal to the next power of two to increase frequency resolution. DFT Computation & Feature Extraction:
Compute the FFT (fast implementation of DFT) on the preprocessed segment.
Calculate the single-sided PSD. For sampling frequency Fs, the frequency vector resolves up to Fs/2.
Identify the Dominant Frequency (DF) as the frequency bin with the maximum PSD magnitude in the 3-20 Hz range (valid for atrial/ventricular arrhythmias).
Calculate Organizational Index (OI) and Spectral Entropy using the PSD values within the 3-20 Hz band.
Compute Normalized Power in the Slow (3-5 Hz), Medium (5-8 Hz), and Fast (8-20 Hz) bands. Output: A feature vector [DF, OI, Spectral Entropy, Pslow, Pmedium, P_fast] for each EGM segment.

Protocol: Time-Frequency Analysis Using the Continuous Wavelet Transform

Objective: Characterize the temporal evolution of spectral content in complex fractionated EGMs. Preprocessing: Follow steps 1-2 from Protocol 3.1. CWT Computation:

Mother Wavelet Selection: Choose the complex Morlet wavelet (cmor in MATLAB/Python's pywt) for an optimal balance between time and frequency localization.
Scale Setup: Define scales linearly corresponding to frequencies from 1 Hz to Fs/2. Use at least 128 scales.
CWT Execution: Compute the CWT, resulting in a complex matrix W(a,b). Feature Extraction:
Compute the scalogram (magnitude of W(a,b) squared).
Ridge Extraction: For each time point b, find the scale a that maximizes the scalogram magnitude. Convert scale to instantaneous frequency.
Statistical Summaries: Calculate the mean, standard deviation, and skewness of the instantaneous dominant frequency over the 4-second window.
Wavelet Entropy: Compute the total energy at each scale, normalize to a probability distribution, and calculate Shannon entropy. Output: A feature vector [Mean iDF, Std iDF, Skew iDF, Wavelet Entropy].

Title: Workflow for Spectral & Time-Frequency Feature Extraction from EGMs

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials & Tools for EGM Spectral Feature Research

Item/Category	Example Product/Solution	Function in Research
High-Fidelity Data Acquisition	ADInstruments PowerLab, Intan RHD Recording System	Provides low-noise, high-resolution (≥1 kHz sampling) analog-to-digital conversion of raw analog EGMs.
Signal Processing Software Library	MATLAB Wavelet Toolbox, Python (SciPy, PyWavelets, NumPy)	Platforms for implementing DFT, CWT/DWT, and custom feature extraction algorithms.
Mother Wavelet for CWT	Complex Morlet Wavelet (cmor)	Provides a good trade-off between time and frequency resolution for biological signals.
Spectral Analysis Plugin	LabChart Pro ECG Analysis Module, EMKA iox2	Commercial software offering built-in FFT and time-frequency analysis for rapid prototyping.
Validated Preprocessing Filters	Butterworth or Chebyshev IIR Digital Filters	Removes line noise (e.g., 50/60 Hz notch) and baseline wander without distorting signal content.
Reference Datasets	PhysioNet Computing in Cardiology Challenge Datasets, Custom Preclinical Porcine AF Models	Benchmarked, annotated EGM data for validating and comparing feature performance.

Application Notes for Drug Development & Research

Quantifying Anti-Arrhythmic Drug (AAD) Effects

Use Case: Assess acute electrophysiological effect of a novel AAD on atrial fibrillation substrate. Protocol Adaptation:

Baseline Recording: Acquire high-density epicardial or endocardial EGMs during induced AF in preclinical model.
Post-Dose Recording: Acquire EGMs at peak plasma concentration of the compound.
Feature Extraction: Apply Protocol 3.1 to multiple (e.g., 100) consecutive 4-second segments from both baseline and post-dose states.
Statistical Analysis: Perform paired statistical testing (e.g., Wilcoxon signed-rank) on extracted features (e.g., Dominant Frequency, Spectral Entropy). Expected Outcome: An effective AAD targeting atrial remodeling may significantly reduce Dominant Frequency and increase Organizational Index, indicating slowed and more organized activity.

Identifying Ablation Targets via Time-Frequency Signatures

Use Case: Use wavelet-based features to identify sites of persistent high-frequency drivers. Protocol Adaptation:

High-Density Mapping: Acquire EGMs from a grid/multi-electrode array during sustained arrhythmia.
Feature Mapping: For each electrode site, compute the Mean Instantaneous Dominant Frequency (from Protocol 3.2) and Wavelet Entropy.
Spatial Visualization: Create contour maps (feature maps) overlaid on anatomical geometry. Interpretation: Sites exhibiting persistently high Mean iDF with low Wavelet Entropy are candidate locations for stable rotational or focal sources.

Title: Integration of Spectral Features into EGM ML Research Pipeline

Application Notes and Protocols

Within a broader thesis on EGM signal processing for machine learning features research, quantifying signal complexity and organization is paramount for distinguishing pathological from physiological cardiac rhythms. Traditional linear features (e.g., amplitude, frequency) often fail to capture the intricate, non-linear dynamics of atrial and ventricular arrhythmias. This document details the application of non-linear and entropy-based features to intracardiac electrograms (EGMs) and surface ECGs.

1. Theoretical Foundation and Feature Definitions

Non-linear dynamics and information theory provide metrics to quantify the unpredictability, randomness, and complexity of a time series signal like an EGM.

Table 1: Key Non-Linear and Entropy-Based Features for EGM Analysis

Feature	Mathematical Basis	Physiological Interpretation (in EGM context)	Typical Value Range (Normal Sinus Rhythm vs. Fibrillation)
Sample Entropy (SampEn)	Negative natural logarithm of the conditional probability that two sequences similar for m points remain similar at the next point (m+1).	Measures signal irregularity. Lower values indicate more self-similarity/regularity.	NSR: Lower (e.g., 0.5-1.2). AF/VF: Higher (e.g., 1.5-2.5).
Multiscale Entropy (MSE)	SampEn calculated over multiple temporal scales via coarse-graining.	Assesses complexity across different time scales. Healthy systems show high complexity across scales.	NSR: Entropy remains relatively high across scales. AF/VF: Entropy decays rapidly with scale.
Detrended Fluctuation Analysis (DFA) α-exponent	Quantifies long-range power-law correlations in a non-stationary signal.	α ~0.5: white noise (e.g., VF). α ~1.0: 1/f noise (healthy). α ~1.5: Brownian noise.	NSR: α ~0.8-1.2. AF: α ~0.5-0.8. VF: α ~0.5.
Lyapunov Exponent (λ)	Average rate of separation of infinitesimally close trajectories in state space.	Quantifies sensitivity to initial conditions (chaos). Positive λ suggests chaotic dynamics.	NSR: Near zero or slightly negative. Sustained AF/VF: Positive (e.g., 0.05-0.3 bits/s).
Lempel-Ziv Complexity (LZC)	Estimates the number of distinct substrings and their rate of occurrence.	Measures complexity in terms of compressibility. More complex = less compressible.	NSR: Lower complexity (~0.1-0.3). AF/VF: Higher complexity (~0.4-0.7).

2. Experimental Protocol: Feature Extraction from High-Resolution EGMs

Objective: To compute a standardized panel of non-linear features from unipolar/bipolar intracardiac EGMs to classify arrhythmia substrates.

Materials & Reagents:

Electrophysiology Recording System: (e.g., Labsystem Pro, EP-Workmate) with bandwidth 0.05-500 Hz.
Catheter: Diagnostic electrophysiology catheter (e.g., duodecapolar, PentaRay).
Signal Acquisition: Analog-to-digital converter (ADC) with ≥ 1 kHz sampling rate (≥ 2 kHz recommended).
Reference Electrode: Surface ECG electrodes.
Software: MATLAB (with Signal Processing Toolbox) or Python (SciPy, NumPy, nolds, antropy packages).
Data: 60-second epochs of stable rhythm (e.g., Sinus Rhythm, Atrial Flutter, Atrial Fibrillation).

Protocol:

Signal Acquisition & Preprocessing:
- Acquire EGM signals from targeted cardiac chambers.
- Apply a 0.5-250 Hz bandpass filter to remove baseline wander and high-frequency noise.
- For bipolar EGMs, ensure consistent inter-electrode spacing and orientation.
- Downsample to a standardized sampling frequency (Fs, e.g., 1000 Hz) if necessary.
- Normalize the signal to zero mean and unit variance.

Epoch Selection:
- Visually inspect and select a 10-30 second artifact-free, stable rhythm segment.
- Avoid segments with catheter movement or far-field interference.
State-Space Reconstruction (for DFA, Lyapunov):
- Use time-delay embedding: For signal x(i), construct state vectors: Y(i) = [x(i), x(i+τ), ..., x(i+(m-1)τ)].
- Estimate delay (τ) using the first minimum of the mutual information function.
- Estimate embedding dimension (m) using the false nearest neighbors method.
Feature Computation:
- Sample Entropy: Use the entropy.sample_entropy function from the antropy Python package. Parameters: m=2, r=0.2 * (signal std. dev.).
- Multiscale Entropy: Coarse-grain the time series to scales 1-20. Compute SampEn at each scale.
- DFA: Integrate and detrend signal in windows of varying sizes. Calculate scaling exponent α from the log-log plot of fluctuation vs. window size.
- Lempel-Ziv Complexity: Binarize the signal (values above median = 1, below = 0). Compute normalized LZC using standard algorithm.
Validation & Statistical Analysis:
- Compute features for a cohort (e.g., n=20 patients per rhythm type).
- Perform Kruskal-Wallis test with post-hoc Dunn's test to identify significant (p<0.05) inter-group differences.
- Use principal component analysis (PCA) to visualize feature separability.

3. The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials and Computational Tools

Item	Function in EGM Complexity Research
High-Density Mapping Catheter (e.g., Advisor HD Grid)	Provides dense, spatially coherent EGM data essential for analyzing organizational gradients.
Open-Source Python Library: `antropy`	Provides optimized, clinically validated implementations of SampEn, Permutation Entropy, LZC, and DFA.
Custom MATLAB `lyapunovExponent` Script	Implements Rosenstein's algorithm for estimating the largest Lyapunov exponent from short, noisy EGM data.
Clinical EP Database (e.g., CU Ventricular Tachyarrhythmia Database)	Provides validated, annotated EGM/ECG signals for benchmarking new features.
Phase Mapping Software Module	Converts voltage-time signals into phase-time signals, enabling analysis of rotor and wavefront dynamics via entropy.

4. Workflow and Pathway Visualizations

Diagram Title: Non-Linear Feature Extraction & ML Classification Workflow

Diagram Title: Position within Broader EGM Feature Engineering Thesis

This document serves as an Application Note within a broader thesis research program focused on developing novel electrophysiological biomarkers from intracardiac electrogram (EGM) signals. The core challenge is the transformation of processed, feature-rich EGM data into structured vector representations suitable for downstream machine learning (ML) analysis. This protocol details the standardization of this critical step for both supervised (e.g., classification of arrhythmia substrates) and unsupervised (e.g., patient phenotyping) learning tasks in cardiac drug development and basic electrophysiology research.

Key Data Features & Vector Representation Schema

Processed EGM data yields a multi-dimensional set of features. The following table categorizes common feature classes and their typical scalar outputs for vector construction.

Table 1: Feature Classes from Processed EGM Signals for ML Vectorization

Feature Class	Example Features	Description	Typical Dimension (per EGM)	Vector Component Prefix
Temporal	Activation Time, Segment Duration (e.g., fractionated interval)	Timings of key signal events or intervals.	3-10 scalars	`T_`
Amplitude	Peak-to-Peak Voltage, Local Mean Amplitude	Voltage magnitude measurements.	2-5 scalars	`A_`
Spectral	Dominant Frequency, Shannon Spectral Entropy	Frequency-domain and complexity metrics.	3-7 scalars	`F_`
Morphological	Correlation Coefficients, Wavelet Coefficients, Principal Components	Shape descriptors comparing to a template or using decomposition.	5-20+ scalars	`M_`
Non-linear Dynamics	Lyapunov Exponent, Sample Entropy	Measures of signal predictability and chaos.	2-4 scalars	`N_`
Signal Quality	Signal-to-Noise Ratio, Baseline Wander Index	Metrics assessing recording fidelity.	2-3 scalars	`Q_`

Protocol: Constructing the Consolidated Feature Vector (CFV)

Objective: To aggregate all extracted features from one or more EGMs into a single, consistently ordered numerical vector.

Procedure:

Feature Selection: For a given experiment, define the exact set of n features to be used (e.g., T_act_time, A_peak_peak, F_dom_freq, M_corr_coef_1).
Normalization: Apply a standard scaling method to each feature across the entire dataset to mitigate bias from differing units and scales.
- Z-score Normalization: x_norm = (x - μ) / σ (Recommended for Gaussian-like distributions).
- Min-Max Scaling: x_norm = (x - min(x)) / (max(x) - min(x)) (Recommended for bounded features).
Concatenation: Define a fixed order for the n normalized features (e.g., all Temporal, then Amplitude, then Spectral). The CFV for a single EGM recording is then: CFV_egm = [f1_norm, f2_norm, ..., fn_norm].
Multi-EGM & Multi-Channel Aggregation: For analyses involving multiple EGMs (e.g., from a catheter with 10 electrodes) or time-series of beats:
- Option A (Pooled): Concatenate CFVs from all sources into one large vector (length = nfeatures * nsources).
- Option B (Summarized): Calculate statistics (mean, standard deviation, max) across the CFVs from each source, then concatenate these statistics.

Output: A 2D matrix X of dimensions [n_samples, n_features] for input into ML algorithms.

Experimental Workflow: From Raw EGM to ML Input

Protocol Title: Integrated Workflow for EGM Feature Vectorization

Materials & Setup:

Raw EGM data from clinical EP study or preclinical animal model.
Signal processing software (e.g., custom Python/Matlab scripts, LabChart, EMKA).
Computing environment with Python (NumPy, SciPy, Scikit-learn) or equivalent.

Methodology:

Signal Preprocessing: Apply bandpass filtering (30-300 Hz for bipolar EGMs), notch filtering (line noise), and baseline correction to raw signals.
Segmentation & Annotation: Isolate individual beats or time windows of interest. Annotate key fiducial points (e.g., activation time) manually or via automated detector.
Feature Extraction: For each segment, compute all features listed in the defined schema (Table 1).
Data Structuring: Compile features into a structured table (e.g., Pandas DataFrame) where rows are samples and columns are features. Include metadata columns (e.g., PatientID, ArrhythmiaType for supervised learning).
Vectorization Pipeline: Apply the CFV construction protocol (Section 2.1) to the feature table, producing matrix X.
Target Vector Definition (For Supervised Learning): Create vector y containing class labels or continuous values corresponding to each sample in X.
Train-Test Split: Partition [X, y] into training and hold-out test sets (e.g., 80/20 split) before any model development to avoid data leakage.

Diagram 1: EGM data processing and vectorization workflow (67 chars)

Application: Protocol for Unsupervised Phenotype Discovery

Objective: To identify novel patient/ substrate clusters based solely on EGM feature patterns.

Protocol:

Data Preparation: Construct matrix X from a cohort of patients using the workflow in Section 3. Omit any disease label metadata from X.
Dimensionality Reduction (Optional but Recommended): Apply Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) to X to reduce to 2-10 principal components for visualization and noise reduction, yielding X_reduced.
Clustering Algorithm Application: Apply an unsupervised algorithm to X_reduced (or X).
- K-Means Clustering: Specify expected number of clusters k. Use elbow method on Within-Cluster-Sum-of-Squares to infer k.
- Hierarchical Clustering: Creates a dendrogram. Cut tree to form clusters.
- DBSCAN: Density-based; good for identifying outliers.
Cluster Validation & Interpretation: Evaluate cluster stability (silhouette score). Characterize each cluster by the mean feature values of its members to define the electrophysiological "phenotype."

Diagram 2: Unsupervised phenotyping workflow using EGM features (66 chars)

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools for EGM Feature Engineering & ML Vectorization

Item / Solution	Function in Workflow	Example / Specification
High-Fidelity Data Acquisition System	Records raw, low-noise intracardiac signals with precise timing.	Prucka CardioLab, EP-Workmate, ADInstruments PowerLab.
Biophysical Signal Processing Suite	Performs essential filtering, segmentation, and foundational feature extraction.	MATLAB Signal Processing Toolbox, Python (SciPy, Biosppy), EMKA Analytics.
Domain-Specific Feature Library	Custom codebase for calculating advanced EGM features (e.g., non-linear dynamics, complex fractionation indices).	Custom Python/Matlab modules implementing published algorithms.
Normalization & Scalering Library	Standardizes feature scales for stable ML performance.	`sklearn.preprocessing.StandardScaler`, `MinMaxScaler`.
Structured Data Container	Holds features, metadata, and labels in a unified, programmatically accessible format for vectorization.	Pandas DataFrame (Python), R Data Frame, MATLAB Table.
Dimensionality Reduction Toolkit	Reduces feature space for visualization, clustering, and combating the "curse of dimensionality."	`sklearn.decomposition.PCA`, `sklearn.manifold.TSNE`.
ML Algorithm Frameworks	Implements supervised classifiers and unsupervised clustering algorithms.	Scikit-learn, TensorFlow/PyTorch (for deep learning).
Validation & Metrics Package	Quantifies ML model performance or cluster quality.	`sklearn.metrics` (accuracy, silhouette score).

Navigating Pitfalls: Solving Common EGM Processing Challenges for Robust ML Features

Within research focused on extracting machine learning (ML) features from electrogram (EGM) signals, signal quality is the foundational determinant of model robustness. Poor quality data segments can introduce noise-confounded features, leading to biased or non-generalizable ML models. This document provides application notes and protocols for diagnosing signal quality issues and establishes a decision framework for choosing between segment re-processing and discard, critical for constructing reliable training datasets in therapeutic development.

Quantitative Metrics for Signal Quality Assessment

The following metrics, calculated on a per-segment basis, provide objective criteria for quality assessment. Thresholds are derived from current literature and empirical studies in electrophysiology research.

Table 1: Key Quantitative Metrics for EGM Signal Quality Assessment

Metric	Formula / Description	Typical Optimal Range	Threshold for Poor Quality	Primary Diagnostic Indication
Signal-to-Noise Ratio (SNR)	10 log₁₀(Psignal / Pnoise)	> 20 dB	< 15 dB	Low signal amplitude or high broadband noise.
Baseline Wander Index (BWI)	Std. dev. of low-pass filtered (< 1 Hz) signal	< 0.05 mV	> 0.1 mV	Drift, respiration artifact, poor electrode contact.
Peak Spectral Density (PSD) Ratio	PSD in EGM band (40-250 Hz) / PSD in line-noise band (58-62 Hz or local equivalent)	> 10	< 3	Significant 50/60 Hz mains interference.
Fraction of Saturated Samples	(Count(sample = ±ADC range) / Total samples) * 100	< 0.1%	> 5%	Over-amplification, clipping, motion artifact.
Normalized Amplitude Range	(Max – Min) / Median Absolute Deviation	5 – 50	> 100 or < 2	Outliers, electrode pop, or extremely low amplitude.

Diagnostic Workflow & Decision Protocol

The following logical workflow guides the researcher from raw segment assessment to the final decision.

Diagram Title: EGM Segment Quality Decision Workflow

Detailed Re-processing Protocols

Protocol for 50/60 Hz Line Noise Removal

Objective: Attenuate narrowband mains interference without distorting EGM components.

Notch Filter Application: Apply a zero-phase IIR notch filter (e.g., Butterworth) at the mains frequency (50 or 60 Hz). Use a narrow bandwidth (Q > 30).
Validation: Compute the PSD Ratio (Table 1) on the filtered segment. Verify attenuation in the target band with minimal impact on adjacent EGM frequencies (40-250 Hz).
Alternative - Adaptive Subtraction: For variable frequency noise, use a reference channel or an adaptive LMS filter to model and subtract the interference.

Protocol for Baseline Wander Correction

Objective: Remove low-frequency drift (< 1 Hz) to restore isoelectric baseline.

Estimation: Fit a low-order polynomial (order 3-5) or a spline function to the local minima (or median-filtered signal) of the raw segment.
Subtraction: Subtract the estimated baseline trend from the original signal.
Validation: Calculate the BWI on the corrected segment. Ensure high-frequency EGM components are not altered.

Protocol for High-Frequency Noise Suppression

Objective: Reduce broadband myoelectric or environmental noise.

Analysis: Inspect PSD to identify noise bandwidth. If noise is outside the primary EGM band of interest (e.g., >300 Hz for bipolar EGMs), apply a zero-phase low-pass filter with a conservative cutoff (e.g., 250-300 Hz).
Wavelet Denoising: For in-band noise, use wavelet transform (e.g., Daubechies 4). Apply soft thresholding to detail coefficients, then reconstruct.
Validation: Assess SNR improvement. Visually confirm preservation of key depolarization morphology.

The Scientist's Toolkit

Table 2: Essential Research Reagent Solutions for EGM Signal Processing

Item / Solution	Function in EGM Research	Example / Specification
High-Fidelity Data Acquisition System	Converts analog cardiac potentials to digital signals with minimal distortion.	Multi-channel systems (e.g., Prucka CardioLab, EP-Workmate) with ≥ 16-bit ADC and sampling rate ≥ 1 kHz.
Clinical-Grade Electrodes & Catheters	Ensures stable, low-impedance contact with cardiac tissue for signal pickup.	Sterile, irrigated or non-irrigated diagnostic catheters (e.g., DECANAV, Advisor HD Grid).
Digital Signal Processing (DSP) Library	Provides validated algorithms for filtering, transformation, and analysis.	Python: SciPy, NumPy, PyWavelets. MATLAB: Signal Processing Toolbox, Wavelet Toolbox.
Reference Signal Database	Curated set of labeled EGM segments for validating processing pipelines and ML features.	Publicly available datasets (e.g., PhysioNet's AFDB, MIT-BIH Arrhythmia) or proprietary institutional libraries.
Annotation & Analysis Software	Enables manual review, labeling, and feature measurement from processed signals.	Custom MATLAB/Python GUIs, or commercial software (e.g., LabChart, EMKA).

Experimental Protocol: Validating Feature Stability Post-Re-processing

This protocol is critical for ML research to ensure re-processing does not artificially alter clinically relevant features.

Aim: To compare the stability of key ML-derived features (e.g., fractionated interval, dominant frequency, organization index) before and after application of re-processing steps.

Segment Selection: Randomly select 100 segments of varying initial quality from an EGM database.
Baseline Feature Extraction: Compute target features from raw segments.
Targeted Re-processing: Apply appropriate correction protocols from Section 4 based on each segment's diagnostic profile.
Post-processing Feature Extraction: Compute the same features from the corrected segments.
Statistical Analysis: For each feature, perform Bland-Altman analysis and calculate the Intraclass Correlation Coefficient (ICC) between pre- and post-processing values.
Acceptance Criterion: A feature is deemed "stable" if the 95% limits of agreement are within ±10% of the feature's dynamic range and ICC > 0.9. Segments where correction leads to unstable features should be discarded.

Diagram Title: Feature Stability Validation Protocol

A systematic, metric-driven approach to diagnosing EGM signal quality is non-negotiable for robust ML feature research. Re-processing is justified for correctable, non-physiological artifacts (line noise, wander), while segments with fundamental corruption (saturation, loss of contact) must be discarded to preserve dataset integrity. The provided protocols and validation framework ensure that the resulting features accurately reflect underlying cardiac electrophysiology, thereby supporting the development of reliable ML models for drug and device development.

Electrogram (EGM) signal processing is a cornerstone of modern electrophysiology research and drug development. The increasing reliance on machine learning (ML) to extract diagnostic and prognostic features from EGM data is challenged by significant data heterogeneity. This heterogeneity stems from variations across multiple clinical centers, recording device manufacturers and models, and inconsistent gain settings during acquisition. This Application Note, framed within a broader thesis on EGM signal processing for ML feature research, provides detailed protocols and strategies to manage this heterogeneity, ensuring robust, generalizable ML model development.

Quantifying the Heterogeneity Challenge

The table below summarizes key sources of heterogeneity and their measurable impact on EGM signal characteristics, based on current literature and device specifications.

Table 1: Primary Sources and Impact of EGM Data Heterogeneity

Heterogeneity Source	Specific Variables	Typical Impact on Raw Signal	Quantifiable Metric Range (Example)
Multi-Center	Skin preparation, electrode type/placement, ambient noise, SOP variations.	Baseline wander (0.1-5 Hz), power-line interference (50/60 Hz), amplitude scaling.	SNR variation: 15 dB to 30 dB.
Multi-Device	Analog front-end bandwidth, sampling frequency, ADC resolution, filter roll-off.	Spectral content alteration, amplitude saturation, aliasing.	Bandwidth: 100-1000 Hz; Sampling: 256 Hz - 2 kHz; ADC: 12-24 bits.
Variable Gain	Manual or automatic gain control (AGC) settings during recording.	Global amplitude scaling, clipping, altered noise floor.	Amplitude scaling factor: 0.5x to 100x.

Core Strategies and Application Protocols

Strategy: Universal Signal Preconditioning

This foundational protocol aims to bring all raw signals to a common baseline before feature extraction.

Protocol 1.1: Standardized Preprocessing Workflow

Input: Raw EGM time-series from any source.
Resampling: Use polyphase anti-aliasing filtering to resample all signals to a unified frequency (e.g., 1 kHz). Tool: SciPy resample_poly.
Gain Normalization: Apply amplitude-based normalization. Compute the robust signal amplitude (e.g., median absolute deviation, MAD) for each channel. Scale the entire signal by 1 / (MAD + ε).
Bandpass Filtering: Apply a zero-phase Butterworth filter (order 4) with cutoff frequencies of [3 Hz, 150 Hz] to remove extreme low/high-frequency artifacts while preserving EGM components.
Powerline Noise Removal: Apply a 50/60 Hz notch filter (Q=30) or use spectral interpolation.
Output: Preconditioned EGM signal ready for downstream processing or feature extraction.

Diagram Title: Standardized EGM Signal Preconditioning Workflow

Strategy: Device-Specific Transfer Functions & Digital Twins

To counteract device-specific filtering, create digital inverse filters or device twins.

Protocol 2.1: Characterizing and Inverting Device Transfer Function

Stimulus: Record a known calibrated input (e.g., square wave, white noise) on each device model.
Estimation: Compute the empirical transfer function (ETF) using Welch's method between the known input and the recorded output.
Modeling: Fit a stable digital filter (e.g., FIR using least-squares) to approximate the inverse of the ETF.
Application: Apply the derived inverse filter to clinical EGM recordings from that specific device to "standardize" its spectral profile towards a reference.

Diagram Title: Device Transfer Function Harmonization Process

Strategy: Data-Centric Feature Engineering

Develop features that are intrinsically robust to residual heterogeneity.

Protocol 3.1: Extraction of Invariant Morphological Features

Input: Preconditioned EGM signal from Protocol 1.1.
Fiducial Point Detection: Use a prominent peak detector (e.g., based on amplitude threshold) to identify a reference point (R_peak) for each complex.
Cycle Alignment: Segment signal into windows around each R_peak and align using dynamic time warping (DTW) or cross-correlation.
Feature Calculation:
- Normalized Amplitude: (Peak - Baseline) / (Global MAD from Protocol 1.1).
- Time-Derivative Features: Compute first derivative; its extremum is velocity. Normalize by the signal's energy.
- Area Under Curve (AUC): Calculate AUC for the segmented complex, then normalize by the segment duration and the robust amplitude.
- Non-Linear Energy: Compute Teager-Kaiser Energy Operator output, then normalize by the mean energy of the segment.

Table 2: Heterogeneity-Robust Feature Set

Feature Category	Specific Feature	Calculation Method	Robustness Rationale
Temporal	Normalized Complex Duration	Duration / Median Cycle Length	Mitigates heart rate variability.
Morphological	Normalized Amplitude	(Peak - Baseline) / MAD	Invariant to linear gain scaling.
Spectral	Spectral Entropy	Shannon entropy of PSD	Describes shape, not absolute power.
Fractional	Dominant Frequency Ratio	LF Power (3-15Hz) / Total Power	Relative measure, device-agnostic.

Validation Protocol: Leave-One-Center-Out (LOCO) ML Testing

Protocol 4.1: Rigorous Generalizability Assessment

Dataset Partition: Split data by recording center or device manufacturer.
Model Training: Train an ML model (e.g., Random Forest, CNN) on all but one partition.
Testing: Evaluate the trained model exclusively on the held-out partition.
Iteration & Metric: Repeat for all partitions. Report mean and standard deviation of performance metrics (AUC, F1-score) across all folds. A low standard deviation indicates successful heterogeneity management.

Diagram Title: Leave-One-Center-Out (LOCO) Validation Schema

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials and Digital Tools for EGM Heterogeneity Research

Item / Solution	Function / Purpose	Example Product / Library
Biophysical Signal Simulator	Generates ground-truth EGM signals with programmable parameters for controlled validation.	MathWorks Simscape Electrical, Python: NeuroKit2 `ecg_simulate`.
Programmable Data Acquisition System	Records calibrated inputs to characterize real device transfer functions.	Intan Technologies RHD USB Interface Board, Texas Instruments ADS129x Series EVM.
Digital Signal Processing Library	Provides standardized, optimized implementations of filters, resamplers, and feature extractors.	Python: SciPy, PyWavelets, BioSPPy. MATLAB: Signal Processing Toolbox.
Dynamic Time Warping (DTW) Algorithm	Aligns EGM complexes of non-uniform duration before feature extraction.	Python: `dtw-python`, `tslearn.metrics.dtw`. R: `dtw` package.
Synthetic Data Augmentation Tool	Artificially introduces controlled heterogeneity (noise, gain drift, filter effects) to expand training data.	Python: `Augmenty`, custom scripts using NumPy.
ML Framework with Explainability	Trains models and provides feature importance to identify which features generalize best.	Python: scikit-learn, PyTorch, TensorFlow, with SHAP or LIME.

Within the thesis on EGM signal processing for ML feature research, the class imbalance problem is a critical bottleneck. When developing models to detect rare events—such as specific ablation targets in atrial electrograms (EGMs) or sporadic arrhythmia episodes like ventricular tachycardia (VT) in Holter data—the scarcity of positive samples severely biases models toward the majority class (normal sinus rhythm). This application note details current techniques and protocols to address this imbalance, ensuring robust, generalizable models for clinical and drug development applications.

The following table summarizes the performance and characteristics of primary techniques used to handle class imbalance in cardiac electrophysiology ML, based on recent literature (2023-2024).

Table 1: Comparative Analysis of Imbalance Handling Techniques for EGM-based Arrhythmia Detection

Technique Category	Specific Method	Reported Best-Case F1-Score (Minority Class)	Key Advantage	Primary Risk	Computational Cost
Data-Level	Synthetic Minority Over-sampling (SMOTE)	0.78	Generates plausible synthetic EGM beats	May create noisy samples in high-dimensions	Medium
Data-Level	Adaptive Synthetic Sampling (ADASYN)	0.81	Focuses on difficult-to-learn samples	Can over-amplify borderline outliers	Medium-High
Algorithm-Level	Cost-Sensitive Learning	0.83	Directly embeds clinical cost of misclassification	Requires careful cost matrix tuning	Low
Algorithm-Level	Focal Loss (Adaptation)	0.85	Down-weights easy negatives automatically	Hyperparameter (γ) sensitivity	Low
Hybrid	SMOTE + Ensemble (SMOTEBoost)	0.87	Combines data generation and algorithmic focus	Risk of overfitting with small datasets	High
Novel Architecture	Deep Metric Learning (Triplet Loss)	0.82	Learns robust embeddings for rare classes	Requires careful triplet mining	High
Signal Augmentation	Physiologically-Informed Augmentation (e.g., time-warping)	0.79	Preserves underlying electrophysiology	May not cover full pathological spectrum	Medium

Experimental Protocols

Protocol 1: Implementing Cost-Sensitive Random Forest for Scarce Ablation Target Detection

Objective: To train a classifier for identifying localized micro-reentrant circuits in high-density atrial EGM maps where targets comprise <2% of data segments.

Materials:

High-density (256-electrode) atrial EGM recordings (5 patients, persistent AF).
Labeled dataset: [Normal: 98,500 segments, Ablation Target: 1,500 segments].
Computing environment: Python 3.9+, scikit-learn 1.3, imbalanced-learn 0.11.

Procedure:

Feature Extraction: For each 2-second EGM segment, extract 45 features (time-domain: voltage, slew rate; frequency-domain: dominant frequency, organizational index; phase-domain: entropy).
Train/Test Split: Perform a patient-stratified split: 4 patients for training, 1 patient for testing.
Cost Matrix Definition: Define misclassification cost matrix in consultation with electrophysiologists:
- False Negative (miss target): Cost = 10
- False Positive (ablate normal tissue): Cost = 3
- Correct classifications: Cost = 0
Model Training: Train a Random Forest classifier (n_estimators=500) with class_weight='balanced_subsample' and implement custom cost-sensitive pruning during tree construction to minimize total expected cost.
Validation: Use 5-fold cross-validation on the training set, prioritizing Minimum Cost as the primary metric instead of accuracy.
Evaluation: Report on test set: Cost-Sensitive Error, Precision-Recall AUC, and Specificity at 95% Sensitivity.

Protocol 2: Hybrid SMOTE-Ensemble for Rare Ventricular Arrhythmia Detection

Objective: To detect rare non-sustained VT episodes (<0.5% prevalence) in 24-hour ambulatory ECG/EGM recordings.

Materials:

48-hour ambulatory EGM datasets from implantable loop recorders (ILR), 200 subjects.
Labeled hourly segments: [Normal/SVT: 47,000, Rare VT: 230].
Tool: imbalanced-learn for SMOTE, XGBoost for ensemble.

Procedure:

Preprocessing: Bandpass filter (0.5-40 Hz), R-peak detection, segment into 10-beat windows centered on R-peak.
Dimensionality Reduction: Apply Principal Component Analysis (PCA) to morphological features, retain 95% variance before synthesis to mitigate SMOTE's "curse of dimensionality."
Stratified Synthetic Sampling: Apply SMOTE only to the training fold within each cross-validation split, preventing data leakage. Oversample minority class to 15% prevalence (not 50%).
Ensemble Training: Train an XGBoost model with modified objective 'binary:logistic' and scaleposweight parameter set to inverse class ratio. Use early stopping based on validation log loss.
Threshold Tuning: Post-training, adjust decision threshold on validation set to maximize Geometric Mean (G-Mean) of sensitivity and specificity.
Final Assessment: Evaluate on the held-out test set using metrics robust to imbalance: Area Under the Precision-Recall Curve (AUPRC) and F2-Score (emphasizing recall).

Visualization of Methodologies

Title: Hybrid SMOTE & Cost-Sensitive Training Workflow

Title: Metric Learning with Triplet Loss for Rare Event Embedding

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials and Tools for Imbalanced EGM ML Research

Item Name / Solution	Supplier / Library	Primary Function in Protocol	Key Consideration
imbalanced-learn 0.11.0	Scikit-learn Consortium	Provides implemented resampling (SMOTE, ADASYN) and ensemble methods.	Ensure version compatibility with base sklearn.
XGBoost 1.7+	DMLC	Gradient boosting ensemble with native `scale_pos_weight` for imbalance.	GPU acceleration recommended for large EGM datasets.
WFDB Toolbox 5.0	PhysioNet	Reading, writing, and processing EGM/ECG signals from standard databases.	Critical for reproducible data ingestion.
PyTorch Lightning	Lightning AI	Structuring deep learning code (e.g., for metric learning) for clarity and reproducibility.	Abstracts boilerplate, aids in multi-GPU training.
Custom Cost Matrix	Researcher-Defined	Quantifies clinical risk of different error types (FN vs FP).	Must be developed in direct consultation with clinical partners.
Synthetic Patient Generator (e.g., FECGSYN)	Open-Source Simulators	Generates physiologically-plausible synthetic EGM for extreme augmentation.	Validate synthetic feature distribution matches real data.
MLflow / Weights & Biases	Open Source / Commercial	Tracks hyperparameters, metrics, and models across hundreds of imbalance-mitigation experiments.	Essential for managing the large hyperparameter search space.

Within the broader thesis on Electrogram (EGM) signal processing for machine learning feature research, the optimization of preprocessing hyperparameters is a critical, task-specific step. Raw EGM signals are contaminated by noise and artifacts; the selection of filter cut-off frequencies and segmentation window parameters directly controls the quality of derived features for downstream arrhythmia classification or drug effect quantification. This document provides application notes and protocols for systematically tuning these hyperparameters to maximize signal fidelity and feature robustness for specific experimental or clinical tasks in cardiac research and drug development.

The Scientist's Toolkit: Key Research Reagent Solutions

The following table details essential materials and computational tools for EGM hyperparameter tuning experiments.

Item Name	Function/Brief Explanation
High-Density Mapping System (e.g., Prucka CardioLab, Rhythmia)	Acquires raw, unprocessed intracardiac electrogram (EGM) signals. Provides the fundamental data substrate.
Programmable Bio-Amplifier (e.g., from ADInstruments, Neuralynx)	Allows real-time application of hardware filters for initial noise reduction before digital processing.
Digital Signal Processing Suite (e.g., MATLAB with Signal Processing Toolbox, Python SciPy/NumPy)	Core software environment for implementing and testing digital filters, segmentation algorithms, and feature extraction.
Reference Annotated EGM Database (e.g., from PhysioNet, proprietary lab datasets)	Gold-standard labeled data (e.g., activation times, arrhythmia type) required for supervised tuning and validation.
Computational Environment (e.g., Jupyter Notebook, MATLAB Live Script)	Enables reproducible scripting of the hyperparameter search workflow and data visualization.
Feature Extraction Library (Custom or Toolbox e.g., BioSPPy)	Codebase to calculate ML features (e.g., complexity, frequency domain, amplitude) from segmented waveforms.

Core Hyperparameter Definitions & Quantitative Benchmarks

Filter Cut-off Frequency Ranges

Appropriate bandpass filtering is essential to isolate the physiological EGM component (typically 30-300 Hz) from low-frequency motion artifact and high-frequency noise.

Table 1: Standard and Task-Specific Filter Cut-off Recommendations

Signal Type / Research Task	Recommended Bandpass Cut-offs (Hz)	Primary Noise Target	Rationale
Standard Bipolar EGM (Activation Mapping)	High-pass: 16-30; Low-pass: 250-500	Low: Drift; High: Electrosurgical/EMI	Balances signal stability with component preservation.
Unipolar EGM (Fractionation Analysis)	High-pass: 0.5-1; Low-pass: 250-300	Low: ST-Segment; High: EMI	Preserves very low-frequency components critical for far-field assessment.
Atrial Fibrillation EGMs	High-pass: 30-40; Low-pass: 240-300	Low: Ventricular Far-Field	Aggressively removes ventricular far-field signals.
EGMs for Drug Effect on Repolarization	High-pass: 0.5-2; Low-pass: 100-150	Low: Baseline Wander; High: Myocyte Depolarization	Isolates lower-frequency repolarization phase.

Segmentation Window Parameters

Windowing defines the epoch for feature calculation and must align with the physiological event of interest.

Table 2: Segmentation Window Strategies

Segmentation Basis	Window Length & Alignment	Key Application
Fixed Duration around Annotation	e.g., [-50ms, +100ms] around activation	Stable, periodic rhythms; activation feature analysis.
Adaptive to Cycle Length	e.g., 70-80% of local CL	Atrial fibrillation or tachyarrhythmias with variable CL.
Sliding Window for Continuous Analysis	e.g., 500ms window, 50ms step	Detection of transient events or continuous trend analysis.
R-Peak / Activation Triggered	From detection point to next detection point	Beat-to-beat variability and morphology comparison.

Experimental Protocol for Systematic Hyperparameter Tuning

Protocol: Grid Search for Filter-Window Optimization

Objective: To determine the optimal pair of bandpass cut-offs and segmentation window length for maximizing the classification accuracy of atrial tachycardia (AT) vs. sinus rhythm (SR) using EGM morphology features.

Materials:

Dataset of 1000 annotated EGM recordings (500 AT, 500 SR) from high-density mapping.
Computing platform with Python 3.9+ and libraries: SciPy, scikit-learn, NumPy, Pandas.

Procedure:

Define Hyperparameter Grid:
- High-pass cut-off (Hz): [1, 10, 20, 30, 40]
- Low-pass cut-off (Hz): [100, 200, 300, 400]
- Segmentation window (ms): [150, 200, 250, 300] centered on activation annotation.

Preprocessing & Feature Extraction Loop:
- For each (high_cut, low_cut, window_len) combination: a. Apply 4th-order Butterworth bandpass filter with (high_cut, low_cut) to raw EGM. b. Segment signal using the defined window_len. c. Extract a standardized feature vector per segment: [Root Mean Square, Shannon Entropy, Dominant Frequency, Wavelet Energy]. d. Store feature matrix and labels.
Model Training & Validation:
- Use a fixed, simple classifier (e.g., Linear SVM with C=1).
- Perform a stratified 5-fold cross-validation on the feature set from each hyperparameter set.
- Record the mean cross-validation F1-score for the AT class.
Optimal Set Selection:
- Identify the hyperparameter triple that yields the highest mean F1-score.
- Validate stability by inspecting performance variance across folds.

Deliverable: A 3D performance matrix (or 2D slices) identifying the optimal region for the specific task.

Protocol: Validating Segmentation for Fractionated EGMs

Objective: To establish the optimal adaptive window length for quantifying fractionation in persistent atrial fibrillation (persAF) EGMs before and after drug administration.

Materials:

Continuous 60-second persAF EGM recordings pre- and post- drug (e.g., Sodium Channel Blocker).
Annotation of local activation times (LATs) via certified algorithm.

Procedure:

Define Adaptive Window Strategies:
- Strategy A: Window = [LAT - 25ms, LAT + 75ms]
- Strategy B: Window = 90% of local cycle length, centered on LAT.
- Strategy C: Fixed 120ms window starting at LAT.

Apply Strategies and Calculate Fractionation Index (FI):
- For each activation in the recording, apply the three windowing strategies.
- Within each window, calculate FI = (number of deflections crossing 0.05mV threshold) / (window duration in ms).
- Compute the average FI for the entire recording under each strategy.
Assess Drug Effect Sensitivity:
- Calculate the relative change in average FI post-drug for each strategy: ΔFI% = (FIpost - FIpre) / FI_pre * 100.
- The optimal strategy is the one that yields a ΔFI% with the highest statistical significance (lowest p-value from paired t-test) and greatest effect size (Cohen's d), indicating highest sensitivity to the drug's electrophysiological effect.

Deliverable: A table comparing ΔFI% and its statistical robustness across windowing strategies, identifying the most sensitive one for the drug study.

Visualizing the Hyperparameter Tuning Workflow

Title: EGM Processing Hyperparameter Tuning Workflow

Title: Signal Transformation via Key Hyperparameters

Application Notes & Protocols

Thesis Context: These notes are formulated within a research thesis focused on extracting novel, prognostically significant features from unipolar and bipolar Electrogram (EGM) signals for machine learning (ML) applications in cardiac electrophysiology and anti-arrhythmic drug development.

1. Quantitative Data Summary: Processing Complexity vs. Scale Requirements

Table 1: Comparative Analysis of EGM Signal Processing Algorithms

Algorithm / Task	Time Complexity	Typical Execution Time (Single 10s EGM)	Primary Use Case	Scalability Challenge
Bandpass Filtering (Butterworth)	O(n)	~2-5 ms	Noise removal, baseline wander correction.	Highly scalable for real-time streams and large databases.
Wavelet Denoising	O(n log n)	~50-150 ms	Non-stationary noise removal, feature preservation.	Moderate scaling; batch processing for large databases.
Activation Time (dV/dt max)	O(n)	~1-3 ms	Real-time annotation for mapping systems.	Highly scalable; core for high-density array processing.
Phase Mapping (Hilbert Transform)	O(n log n)	~20-50 ms	Rotor and driver identification.	Challenging for real-time 3D mapping; used in post-analysis.
Conduction Velocity Estimation	O(n²) per region	~500-2000 ms	Tissue property quantification.	High computational load for dense arrays; often offloaded.
Deep Feature Extract. (1D CNN)	O(n * k) [Inference]	~100-300 ms (GPU)	Automated complex pattern recognition.	Training is resource-heavy; inference can be optimized for scale.

Table 2: Computational Infrastructure for Different Analysis Scales

Analysis Scale	EGM Volume	Recommended Infrastructure	Key Efficiency Strategy	Latency Tolerance
Real-Time Clinical Mapping	~100-500 channels @ 1kHz	Multi-core CPU + FPGA/GPU acceleration	Stream processing, optimized fixed-point math.	Very Low (<50ms)
Medium-Scale Retrospective Study	10,000-100,000 EGMs	High-performance CPU cluster, parallel file system.	Embarrassingly parallel per-signal jobs.	Moderate (Hours/Days)
Large Database Mining (e.g., ALL-ML)	>1 Million EGMs	Cloud-based distributed computing (Spark, Dask).	Dimensionality reduction before ML, columnar storage.	High (Days/Weeks)

2. Experimental Protocols

Protocol A: Efficient Real-Time EGM Feature Extraction for High-Density Mapping

Objective: To implement a pipeline for calculating activation time, amplitude, and basic frequency features from a 64-electrode basket catheter with <20ms latency.

Signal Acquisition: Acquire unipolar EGMs at 2000 Hz. Apply hardware-based analog bandpass filtering (30-500 Hz).
Preprocessing (CPU): Implement a digital 2nd-order Butterworth bandpass filter (30-400 Hz) using a cascaded forward-backward (filtfilt) method for zero-phase distortion. Utilize vectorized operations on multi-channel array.
Feature Extraction (Optimized):
- Activation Time: Compute numerical derivative via central difference. Identify maximal negative dV/dt using a sliding window peak detector. Implement in C++ as a Python extension.
- Amplitude: Calculate bipolar electrograms from adjacent unipolar pairs. Find peak-to-peak amplitude in a window around activation.
Real-Time Constraints: Allocate a fixed buffer for 2-second data chunks. Profile code to eliminate memory allocation delays within the main loop. Use a ring buffer structure for continuous data flow.

Protocol B: Large-Scale EGM Feature Database Construction for ML Training

Objective: To uniformly process >100,000 archived EGMs to generate a standardized feature set for classifier development.

Data Curation: Compile EGM records and metadata into a manifest (CSV). Use a distributed filesystem (e.g., Lustre) or cloud bucket (e.g., AWS S3).
Containerized Processing: Package the processing environment (Python, libraries) into a Docker/Singularity container for reproducibility.
Parallelized Workflow:
- Use a workload manager (e.g., Snakemake, Nextflow) to define the pipeline: Import → Filter (1-250 Hz) → Denoise (Wavelet, level 5, 'sym4') → Extract Features (Table 1) → Store.
- Distribute individual EGM processing jobs across a HPC cluster or Kubernetes cluster. Each job writes output to a common database (e.g., PostgreSQL with TimescaleDB, or Parquet files).
Feature Storage: Store scalar features (amplitude, duration, etc.) in a structured SQL table. Store full signal snippets or complex vectors (e.g., wavelet coefficients) in a linked columnar storage format (Parquet) optimized for bulk ML reading.

3. Mandatory Visualizations

Diagram 1: EGM Processing Workflows: Real-Time vs. Large-Scale

Diagram 2: Trade-offs in Computational EGM Analysis

4. The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational & Data Resources for EGM/ML Research

Resource / Tool	Category	Function in EGM Feature Research
Biosignal Toolkit (e.g., BioSPPy, WFDB)	Software Library	Provides standardized, validated implementations of filters, feature extractors, and I/O for physiological signals.
NumPy/SciPy (with MKL/OpenBLAS)	Computational Backend	Enables vectorized, high-performance mathematical operations on large EGM arrays. Optimized linear algebra is critical.
GPU-Accelerated Libraries (CuPy, RAPIDS)	Hardware Acceleration	Dramatically speeds up wavelet transforms, CNN inference, and large matrix operations for database-scale analysis.
TimescaleDB / PostgreSQL + pgvector	Database	Stores time-series EGM metadata and extracted features efficiently. Supports time-based queries and embedding similarity search.
Apache Parquet + Pandas/Dask	File Format & Processing	Columnar storage for massive feature sets, enabling efficient disk I/O and out-of-core computation for ML.
Lab Streaming Layer (LSL)	Data Acquisition Framework	Standardized protocol for synchronizing real-time EGM streams with other data (e.g., ECG, hemodynamics) for unified processing.

Proving Value: Validating ML-Ready EGM Features and Benchmarking Against Traditional Biomarkers

Within the broader thesis on Electrogram (EGM) signal processing for machine learning (ML) feature research, robust validation frameworks are paramount. EGM signals, recorded from the heart via catheters, contain complex spatiotemporal information used to characterize cardiac arrhythmia substrates. Extracted features—such as fractionation indices, voltage amplitudes, frequency domain components, and entropy measures—form the basis for ML models aimed at predicting ablation targets, arrhythmia recurrence, or disease progression. Without rigorous validation, these models risk overfitting, data leakage, and poor generalizability, ultimately failing in clinical translation. This document details application notes and protocols for three critical validation paradigms applied specifically to EGM-derived features.

Core Validation Frameworks: Definitions and Comparative Analysis

Table 1: Comparison of Validation Frameworks for EGM Feature Models

Framework	Core Principle	Typical Data Split	Primary Use Case in EGM Research	Key Advantages	Key Limitations
k-Fold Cross-Validation (CV)	Iterative partitioning of the available dataset into k complementary subsets (folds).	All data used for both training and validation, but not simultaneously. k=5 or k=10 common.	Model development & hyperparameter tuning with limited patient cohort data.	Maximizes data usage; provides robust performance estimate variance.	High computational cost; risk of over-optimism if dataset is small or heterogeneous.
Hold-Out Testing	Single, definitive split into distinct training, validation (optional), and test sets.	Common splits: 70/15/15 or 80/20 (train/test). Test set is locked.	Initial proof-of-concept studies with larger datasets; assessing final model performance.	Simple, fast, mimics a true independent test if split correctly.	Performance estimate is highly sensitive to a single, arbitrary split; less stable.
Independent Cohort Validation	Validation using data collected from a distinct population, often at a different center or time.	Training: Cohort A. Validation: Entirely separate Cohort B.	Confirmatory validation for clinical readiness; assessing geographical/temporal generalizability.	Gold standard for assessing real-world generalizability and mitigating center-specific bias.	Requires significant logistical effort to acquire independent data; may fail due to legitimate population shifts.

Application Notes & Experimental Protocols

Protocol: k-Fold Cross-Validation for EGM Feature Selection

Objective: To reliably estimate the performance of a classifier predicting AF recurrence using intracardiac EGM features, while selecting the most informative feature subset.

Pre-processing & Feature Extraction:

Signal Acquisition: Import bipolar and unipolar EGM recordings from a 3D mapping system (e.g., CARTO, Ensite).
Segmentation: Isolate stable 2.5-second segments during sustained arrhythmia or stable rhythm.
Feature Computation: Calculate a broad feature library per segment:
- Time Domain: Peak-to-peak voltage, Root Mean Square (RMS), Local Activation Time (LAT) variance.
- Complexity: Number of deflections, Fractionation Interval (FI), Shortest Complex Interval (SCI).
- Frequency Domain: Dominant Frequency (DF), Organization Index (OI).
- Non-linear: Approximate Entropy, Wavelet Transform coefficients.

Cross-Validation Workflow:

Patient-Level Splitting: Assign all EGM segments from a single patient to the same fold to prevent data leakage. Use StratifiedKFold (scikit-learn) based on patient outcome (e.g., recurrence yes/no).
Iteration (k=5): For each of the 5 folds:
- Training Set (4 folds): Perform Z-score normalization (fit on training fold only). Apply recursive feature elimination (RFE) with a support vector machine (SVM) to identify the top 10 features.
- Validation Fold (1 fold): Apply the same scaling transform from the training set. Evaluate the SVM model (trained on the 4 folds using the selected features) on the held-out validation fold. Record performance metrics (AUC, accuracy, F1-score).
Aggregation: Calculate the mean and standard deviation of the performance metrics across all 5 folds. The final feature set is determined by the union or consensus of features selected in each iteration.

Diagram Title: 5-Fold Cross-Validation Workflow for EGM Features

Protocol: Hold-Out Testing for Ablation Target Classifier

Objective: To obtain a final, unbiased performance estimate of a pre-specified deep learning model that identifies critical ablation sites from high-density grid EGM data.

Protocol:

Initial Data Curation: Pool EGM recordings from N patients who underwent ablation. Label each EGM site as "effective" or "ineffective" ablation target based on acute procedural outcome (termination/transformation of arrhythmia).
Stratified Hold-Out Split: Before any analysis, perform a single, patient-level random split (80%/20%) into a Development Set and a locked Hold-Out Test Set. Preserve the ratio of outcome labels in both sets.
Development Phase (Using Development Set only):
- Further split the Development Set into training/validation (e.g., 75%/25%) for model tuning.
- Train a Convolutional Neural Network (CNN) on time-frequency spectrograms of EGM signals.
- Tune hyperparameters (learning rate, dropout) based on validation set performance.
- Once satisfied, freeze the model architecture and parameters.
Final Evaluation (Single Use of Test Set):
- Apply the finalized model to the locked Hold-Out Test Set.
- Compute the final performance metrics (e.g., Sensitivity, Specificity, AUC). This report represents the unbiased estimate of future performance.

Protocol: Independent Cohort Validation

Objective: To validate an EGM-based fibrosis detection algorithm developed at a primary center against data from a separate, international center.

Protocol:

Model Development (Center A):
- Develop and fully finalize the model (including feature scaling parameters) using Center A's internal data and internal validation (CV or Hold-Out).
- Document all pre-processing steps, feature definitions, and model coefficients/weights.
Independent Cohort Acquisition (Center B):
- Acquire raw EGM data from Center B, from a different mapping system if possible, with a similar but not identical patient phenotype (e.g., persistent AF patients).
- Critical: Apply only the pre-processing and feature extraction pipeline as defined and fixed in Step 1. No re-training or re-calibration is allowed on Center B's data.
Blinded Validation:
- Center A provides the model executable or code to Center B.
- Center B runs the model on their local data and generates predictions.
- A pre-specified statistical analysis plan (comparing model predictions to gold-standard MRI fibrosis maps) is executed by a third-party statistician.

Diagram Title: Independent Cohort Validation Protocol Flow

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for EGM Feature Validation Studies

Item / Solution	Function in EGM Research	Example / Specification
High-Density Mapping Catheter	Acquires spatially dense intracardiac EGM signals. Essential for extracting regional features.	Abbott Advisor HD Grid, Biosense Webster PentaRay.
3D Electroanatomic Mapping (EAM) System	Records, visualizes, and exports spatially tagged EGM data with anatomical context.	CARTO 3 (Biosense Webster), EnSite Precision (Abbott).
Digital Signal Processing (DSP) Software Library	Provides standardized algorithms for filtering, segmenting, and extracting features from raw EGM.	MATLAB Signal Processing Toolbox, Python SciPy & NumPy, LabVIEW.
Arrhythmia Induction & Stimulation Protocol	Standardizes the physiological state during EGM recording (e.g., pacing cycle length).	Programmed electrical stimulation (PES) protocols.
Reference Standard Labels	Provides ground truth for supervised ML model training and validation.	Acute ablation success (termination), Long-term recurrence (1-year follow-up), MRI-based scar/fibrosis.
Statistical Computing Environment	Implements CV splits, trains ML models, and computes performance metrics.	Python with scikit-learn, PyTorch; R with `caret` or `mlr3`.
Secure Data Anonymization Tool	Prepares patient data for multi-center sharing, required for independent validation.	HIPAA-compliant de-identification software (e.g., DICOM Anonymizer).

Application Notes

Within the context of a thesis on EGM signal processing for machine learning (ML) research, this document provides a framework for comparing novel ML-derived electrophysiological features against established Electrogram (EGM) metrics. The core hypothesis is that ML features—extracted via time-frequency analysis, nonlinear dynamics, or topological data analysis—can offer superior predictive value for arrhythmic risk stratification and drug efficacy assessment compared to traditional metrics like voltage amplitude, cycle length (CL), and fractionation indices.

The challenge lies in rigorous, standardized benchmarking. These Application Notes outline the experimental protocols, validation pipelines, and analytical tools required to perform such comparisons, ensuring findings are robust, reproducible, and translatable to pre-clinical and clinical drug development.

Key Concepts & Established EGM Metrics

Electrogram (EGM): A recording of cardiac electrical activity from electrodes in contact with the myocardium.

Voltage (Peak-to-Peak Amplitude): Indicator of tissue viability and substrate health. Low voltage often correlates with fibrosis.
Cycle Length (CL): The interval between successive depolarizations. A fundamental measure of arrhythmia rate and tissue refractoriness.
Fractionation: Complex, multi-component EGMs. Quantified by metrics like Number of Peaks, Shortest Complex Interval (SCI), or Duration. Associated with slow, discontinuous conduction in pathological substrate.

ML-Derived Features: Higher-dimensional descriptors capturing nonlinear patterns not apparent in traditional metrics.

Examples: Entropy measures (Shannon, Sample, Approximate), Wavelet Transform coefficients, Recurrence Quantification Analysis (RQA) variables, Persistent Homology features from topological analysis.

Experimental Protocols

Protocol 1: In-Silico Benchmarking Using Computational Heart Models

Objective: To compare feature performance in a controlled environment with a known ground truth.

Model Selection: Utilize a validated computational model (e.g., TP06 human ventricular myocyte model integrated into 2D/3D tissue monodomain simulations).
Substrate Generation: Introduce gradients of fibrosis (10%-50%) via increased connective tissue resistance to create stable reentrant circuits (rotors).
EGM Simulation: Simulate unipolar or bipolar EGMs at multiple virtual electrode sites covering zones of healthy tissue, border zone, and dense core.
Feature Extraction:
- Traditional: Compute Voltage, CL, Number of Peaks, EGM Duration.
- ML Features: Calculate a suite of features (e.g., Spectral Entropy, Dominant Frequency, Lyapunov Exponent).
Performance Benchmarking: For each simulated site, define the "ground truth" substrate classification (e.g., Healthy, Border Zone, Core). Train a simple classifier (e.g., Random Forest) using (a) only traditional metrics and (b) only ML features. Compare AUC-ROC for classification accuracy.

Protocol 2: Ex-Vivo/In-Vitro Validation in Langendorff-Perfused Hearts

Objective: To validate feature performance in real biological tissue under controlled pharmacological intervention.

Heart Preparation: Isolate and Langendorff-perfuse a rabbit or guinea pig heart. Maintain temperature, pH, and perfusion pressure.
Arrhythmia Induction & Recording: Use rapid pacing or a combination of burst pacing and pharmacological challenge (e.g., Acetylcholine + Isoproterenol) to induce atrial or ventricular arrhythmia. Record high-density epicardial EGMs (e.g., using a 128-electrode array).
Pharmacological Intervention: Administer a known antiarrhythmic drug (e.g., Dofetilide, a Class III agent) at a therapeutic concentration. Continuously record EGMs during wash-in and wash-out.
Signal Processing & Analysis: For each electrode and time segment:
- Pre-process: Bandpass filter (1-500 Hz), remove baseline wander.
- Extract Features: Compute both traditional and ML feature sets.
Correlation with Outcome: Primary outcome: termination or destabilization of arrhythmia. Determine which feature set (traditional vs. ML) shows a stronger and earlier correlative change with successful drug response.

Protocol 3: Retrospective Analysis of Clinical Electrophysiology Study Data

Objective: To benchmark features against clinical endpoints.

Data Curation: Obtain de-identified high-resolution EGM recordings from patients undergoing ablation for ventricular tachycardia (VT). Data must include signals from mapped VT circuits and non-critical sites.
Annotation: Sites must be annotated per clinical metrics: Voltage (<0.5mV = scar, 0.5-1.5mV = border zone), Presence of Fractionated Potentials, and clinical outcome annotation (e.g., Site of Successful Ablation, Critical Isthmus).
Blinded Feature Analysis: Extract ML features from all sites without knowledge of clinical annotation.
Statistical Benchmarking: Perform univariate and multivariate logistic regression to predict the clinical outcome (e.g., critical site). Compare the explanatory power (e.g., Likelihood Ratio Chi-Square) of a model containing only traditional metrics versus one containing ML features.

Data Presentation

Feature Category	Specific Metric	AUC-ROC (Healthy vs. Diseased)	p-value (vs. Voltage)	Computational Cost (ms/signal)
Traditional	Voltage (Peak-to-Peak)	0.82	(Ref)	0.5
Traditional	Fractionation Duration	0.76	0.12	1.2
Traditional	Cycle Length Variability	0.71	0.03	2.1
ML-Derived	Wavelet Entropy	0.91	0.01	15.7
ML-Derived	RQA Determinism	0.88	0.02	85.3
ML-Derived	1st Persistence Homology Score	0.93	<0.01	120.5

Table 2: Key Research Reagent Solutions & Materials

Item Name	Function/Application in EGM-ML Research
Langendorff Perfusion System	Ex-vivo heart maintenance for controlled electrophysiological study and drug testing.
Multi-Electrode Array (MEA) (e.g., 128 channels)	High-spatial-resolution EGM acquisition from epicardial or endocardial surfaces.
Optical Mapping Setup (Di-4-ANEPPS dye, LED excitation)	Provides gold-standard measurement of action potential duration and conduction velocity for validation.
Class III Antiarrhythmic Agent (e.g., Dofetilide, E-4031)	Positive control reagent to prolong action potential duration and alter EGM features.
Pro-Fibrotic Agent (e.g., TGF-β1)	Used in cell or tissue culture models to create a fibrotic substrate that alters EGM fractionation.
Human iPSC-Derived Cardiomyocytes	Provides a reproducible, human-based cellular model for high-throughput drug screening.
Signal Processing Suite (e.g., custom Python with SciPy, PyWavelets)	Essential for filtering, segmenting, and extracting both traditional and ML features from raw EGM data.

Visualization

Title: EGM ML Feature Benchmarking Workflow

Title: Drug Effect on EGM & Feature Sensitivity Pathway

1. Introduction & Thesis Context Within the broader thesis of developing machine learning (ML) models for cardiac electrophysiology (EP), a critical validation gap exists between engineered electrogram (EGM) features and ground-truth biological states. This document outlines the application notes and protocols for establishing a "Gold Standard" correlative framework, bridging processed intracardiac signal data with anatomical (imaging), histological (tissue), and clinical (patient outcome) endpoints. This correlation is essential for developing interpretable, biologically-relevant ML features for use in drug efficacy studies and ablation therapy development.

2. Core Data Tables

Table 1: Key Processed EGM Features for Correlation

Feature Category	Specific Metric	Processing Method (Typical)	Proposed Biological Correlate
Time-Domain	Voltage Amplitude (Peak-to-Peak)	Bandpass (30-300Hz) filtering, peak detection	Local tissue viability, fibrosis burden
	Fractionation Index (e.g., Number of Peaks)	Complex fractionated EGM (CFAE) analysis	Myocardial disorganization, slow conduction zones
	Duration (ms)	Signal envelope calculation	Area of slow conduction, scar border zone
Frequency-Domain	Dominant Frequency (DF)	Fast Fourier Transform (FFT) or Welch's method	Rotor core activity, driver stability
	Organization Index (OI)	Spectral coherence analysis	Myocardial organization vs. disorganization
Non-Linear	Approximate Entropy (ApEn)	Time-series complexity calculation	Electrophysiological stability/chaos
	Wavelet-Derived Features	Discrete Wavelet Transform (DWT)	Multi-scale conduction properties

Table 2: Target Endpoint Datasets for Correlation

Endpoint Type	Modality/Source	Key Extractable Metrics	Temporal Context
Anatomical	Electroanatomic Mapping (EAM)	Voltage (scar, healthy), Local Activation Time (LAT), Geometry	Peri-procedural
	Cardiac MRI (Late Gadolinium Enhancement)	Fibrosis volume, location, transmurality	Pre/Post-procedural
Histological	Endomyocardial Biopsy (from mapped site)	Fibrosis %, Myocyte disarray, Inflammatory infiltrate, Connexin expression	Peri-procedural (acute)
	Explant Heart Analysis	Regional tissue architecture, ion channel density (immunohistochemistry)	Post-transplant
Clinical	Patient Follow-up	Arrhythmia recurrence (via monitor), Symptom score, Cardiovascular hospitalization	Long-term (e.g., 12-month)

3. Experimental Protocols

Protocol 1: Peri-Procedural Multi-Modal Data Acquisition & Co-Registration Objective: To spatially align processed EGM features with anatomical (EAM, MRI) and acute histological data from precisely located biopsy sites.

Pre-Procedure Imaging: Acquire high-resolution cardiac MRI with LGE sequences. Segment the left/right atrium/ventricle and delineate fibrotic regions.
Intra-Procedure EAM & EGM Recording: Perform standard EP study. Using a 3D EAM system (e.g., CARTO, Ensite), create a detailed geometry shell. At each mapped point (N>200 per chamber), acquire stable, 5-second unipolar and bipolar EGM recordings from the mapping/ablation catheter. Annotate each point with: 3D spatial coordinates, LAT, and bipolar voltage.
Targeted Biopsy Acquisition: Based on pre-defined EAM voltage zones (e.g., healthy (>1.5mV), dense scar (<0.1mV), border zone (0.1-1.5mV)), select 5-8 target sites for biopsy using a bioptome under fluoroscopic/EAM guidance. Record the exact 3D coordinates of each biopsy.
Data Co-Registration: Export EAM geometry, point data (voltage, LAT), and biopsy coordinates. Use software (e.g., MATLAB with custom scripts, ADAS-3D) to co-register the EAM shell with the pre-operative LGE-MRI surface using landmark- or surface-based registration. Validate registration accuracy (<2mm mean error).
EGM Signal Processing: For each recorded EGM at all mapped points (including biopsy sites), apply standardized processing pipelines to extract features listed in Table 1.

Protocol 2: Histological Processing & Quantitative Analysis Objective: To generate quantitative histological metrics from biopsy samples for direct correlation with EGM features from the same site.

Sample Fixation & Sectioning: Fix biopsy samples in 10% neutral buffered formalin for 24-48 hours. Process, paraffin-embed, and section into 4-5μm slices.
Staining Protocol:
- Masson's Trichrome (for fibrosis): Stain according to standard protocol. Scan slides using a high-resolution digital pathology scanner.
- Immunohistochemistry (e.g., for Connexin 43): Perform antigen retrieval, block, incubate with primary anti-Cx43 antibody, apply labeled polymer, develop with DAB, counterstain with hematoxylin.
Digital Image Analysis: Use quantitative analysis software (e.g., QuPath, ImageJ with custom macros).
- For Trichrome: Apply color deconvolution. Calculate percentage of fibrotic tissue (blue area) vs. total tissue area per high-power field (HPF). Analyze 3-5 HPFs per sample.
- For Cx43: Quantify signal intensity, lateralization index, or percentage of gap junction-positive area.

Protocol 3: Longitudinal Clinical Outcome Correlation Objective: To correlate baseline EGM feature maps with long-term patient outcomes.

Outcome Data Collection: Establish a prospective registry. Primary endpoint: arrhythmia recurrence (AF/AFL/VT) lasting >30 seconds on 24-month intermittent or implantable cardiac monitor. Secondary endpoints: symptom severity (e.g., EHRA score), heart failure hospitalization, need for repeat ablation.
Feature Map Summarization: For each patient, create summary statistics (mean, standard deviation, skewness, percentage of area) of each EGM feature (Table 1) within pre-defined anatomical zones (e.g., entire chamber, specific veins, scar regions).
Statistical Correlation: Perform time-to-event analysis (Cox proportional hazards) using EGM feature summaries as continuous or dichotomized variables. Use machine learning (e.g., random survival forests) to identify the most predictive multi-feature signature for clinical recurrence.

4. Visualization Diagrams

Title: Multi-Modal Data Integration & Correlation Workflow

Title: Logical Pathway from Tissue to ML Prediction

5. The Scientist's Toolkit: Key Research Reagent Solutions

Item	Function in Protocol	Example/Specification
3D Electroanatomic Mapping System	Provides spatial coordinates, voltage maps, and LAT maps; platform for EGM acquisition.	CARTO 3 (Biosense Webster), EnSite Precision (Abbott).
High-Definition Mapping Catheter	Acquires high-fidelity, stable bipolar/unipolar EGMs with precise electrode spacing.	PentaRay (Biosense Webster), Advisor HD Grid (Abbott).
Bioptome	For obtaining targeted endomyocardial biopsy samples from specific mapped sites.	Cordis 7Fr or comparable, with fluoroscopic visibility.
Digital Pathology Scanner	Creates high-resolution whole-slide images for quantitative histology analysis.	Leica Aperio, Hamamatsu NanoZoomer.
Quantitative Image Analysis Software	Enables unbiased, high-throughput measurement of fibrosis %, connexin distribution, etc.	QuPath, HALO, ImageJ/Fiji with custom scripts.
Signal Processing Software Library	For standardized extraction of EGM features (time, frequency, non-linear domains).	Custom MATLAB/Python toolboxes (e.g., BioSPPy, EEGLab-inspired).
Data Co-Registration Software	Aligns EAM geometry, MRI surfaces, and biopsy coordinates into a common coordinate system.	ADAS-3D, EP-NAV, or custom ICP algorithm implementations.
Primary Antibody for Connexin 43	Labels gap junctions for immunohistochemical analysis of electrical coupling.	Anti-GJA1/Cx43 antibody (e.g., Abcam ab11370).

This Application Note provides a detailed framework for applying Explainable AI (XAI) techniques to machine learning models that use processed Electrogram (EGM) signals as input features. Within the broader thesis on EGM signal processing for ML features, the transition from high-performing "black-box" models to interpretable, clinically and scientifically actionable insights is critical. For researchers, scientists, and drug development professionals, understanding why a model makes a particular prediction (e.g., classifying arrhythmia type, predicting drug-induced proarrhythmic risk) is as important as the prediction's accuracy. This document outlines protocols and methodologies for dissecting model decisions, ensuring that predictions are based on physiologically relevant EGM-derived features rather than spurious artifacts.

Core XAI Methodologies for EGM-Based Models

The following table summarizes principal XAI techniques, their applicability to different model types common in EGM analysis, and key quantitative outputs.

Table 1: XAI Techniques for EGM-Based Predictive Models

XAI Technique	Model Type Applicability	Core Principle	Key Interpretable Output for EGM	Quantitative Metric (Example)
SHAP (SHapley Additive exPlanations)	Tree-based (RF, XGBoost), Deep Learning, Linear	Game theory-based; measures each feature's contribution to a specific prediction.	Per-prediction importance of each EGM feature (e.g., APD90, conduction velocity).	SHAP value (mean \|SHAP\| = 0.15 for feature 'Repolarization Dispersion')
LIME (Local Interpretable Model-agnostic Explanations)	Model-agnostic	Approximates complex model locally with an interpretable surrogate model (e.g., linear).	Identifies which regions of the input EGM signal (time segments) drove a classification.	Feature weights in local surrogate model (Weight = +2.3 for amplitude in window 50-100ms)
Gradient-weighted Class Activation Mapping (Grad-CAM)	Convolutional Neural Networks (CNNs)	Uses gradients flowing into the final CNN layer to highlight important regions in input.	Heatmap overlay on the 2D input (e.g., time-frequency representation of EGM).	Intensity of heatmap activation at a specific time-frequency coordinate.
Permutation Feature Importance	Model-agnostic	Measures increase in prediction error after permuting a feature's values.	Global ranking of overall importance of processed EGM features to model performance.	Increase in RMSE after permutation (ΔRMSE = 0.08 for 'Fractionated Activity Index')
Partial Dependence Plots (PDPs)	Model-agnostic	Illustrates marginal effect of one or two features on the predicted outcome.	Shows how predicted arrhythmia risk changes as a specific EGM feature (e.g., beat-to-beat variability) varies.	Predicted probability range across feature values (e.g., 0.1 to 0.9).

Experimental Protocol: Applying SHAP to an Arrhythmia Classification Model

This protocol details steps to explain a trained XGBoost model that classifies EGMs into Ventricular Tachycardia (VT) vs. Normal Sinus Rhythm (NSR) based on 20 engineered features.

Aim: To identify which processed EGM features are most influential for the model's classifications and to validate their physiological plausibility.

Materials & Pre-trained Model:

Input Data: Dataset of 5000 processed EGM recordings, each represented by a 20-dimensional feature vector (e.g., cycle length, organization index, dominant frequency, peak-to-peak voltage).
Model: A pre-trained XGBoost classifier with known performance metrics (e.g., AUC > 0.95).
Ground Truth: Annotated clinical/experimental labels (VT/NSR).

Procedure:

Model Prediction: Run the test dataset (n=1000) through the pre-trained XGBoost model to generate predictions and probabilities.
SHAP Explainer Initialization:
- Import the shap Python library.
- Initialize the TreeExplainer with the trained XGBoost model.
- Compute SHAP values for the entire test set: shap_values = explainer.shap_values(X_test).
Global Interpretability Analysis:
- Generate a summary plot of mean absolute SHAP values across all test samples to rank global feature importance.
- Plot SHAP summary beeswarm plots to visualize the distribution of SHAP values per feature and their correlation with feature values (e.g., high dominant frequency pushes prediction towards VT).
Local Interpretability Analysis:
- For specific, challenging, or high-confidence predictions, create a SHAP force plot for a single EGM sample.
- This plot visually deconstructs the model's base value and shows how each feature pushed the prediction from the base value to the final output.
Biological/Clinical Validation:
- Correlate top SHAP-identified features with known electrophysiological markers from the literature (e.g., high repolarization dispersion -> VT).
- Design a follow-up in silico or in vitro experiment to perturb the top-identified feature and observe if the predicted outcome changes as expected.

Diagram Title: SHAP Analysis Workflow for EGM Model Explainability

Table 2: Key Research Reagent Solutions for XAI-EGM Validation Studies

Item Name	Function/Description	Example Product/Source
Human iPSC-Derived Cardiomyocytes	Provides a physiologically relevant in vitro system to validate model predictions by experimentally manipulating features identified by XAI (e.g., altering conduction with a gap junction blocker).	Fujifilm Cellular Dynamics iCell Cardiomyocytes, Axol Biosciences Human iPSC-CMs.
Multi-Electrode Array (MEA) System	Records high-fidelity, spatially resolved EGM signals from cardiomyocyte monolayers or tissue slices, generating the raw input data for feature engineering and model testing.	Multi Channel Systems MEA2100, Axion Biosystems Maestro.
Optogenetic Actuators (e.g., Channelrhodopsin-2)	Enables precise, contactless perturbation of excitation patterns (a key EGM feature) to test causal relationships suggested by XAI outputs.	AAV vectors expressing ChR2 under cardiac-specific promoters.
Pharmacological Agents (Ion Channel Modulators)	Tools to selectively alter specific EGM components (e.g., sodium channel blocker to slow conduction, hERG blocker to prolong repolarization) for hypothesis testing.	Tetrodotoxin (Na+ blocker), E-4031 (IKr blocker), Isoproterenol (β-adrenergic agonist).
In Silico Cardiac Electrophysiology Models	Computational models (e.g., O'Hara-Rudy, ToR-ORd) to simulate EGM changes in response to virtual perturbations of parameters linked to XAI-identified features.	OpenCOR simulation environment, CellML model repositories.

Protocol: Gradient-Based Saliency Mapping for CNN-Based EGM Analysis

Aim: To visualize which time-frequency regions in a spectrogram representation of an EGM are most critical for a CNN's classification of drug-induced proarrhythmia risk.

Materials:

Input Data: EGM signals transformed into time-frequency spectrograms (using Continuous Wavelet Transform or Short-Time Fourier Transform).
Model: A trained CNN (e.g., ResNet-18) for binary classification (High Risk / Low Risk).
Software: Deep learning framework with automatic differentiation (PyTorch/TensorFlow).

Procedure:

Input Preparation: Forward propagate a single EGM spectrogram through the CNN until the final convolutional layer.
Gradient Calculation:
- For the target class (e.g., "High Risk"), compute the gradient of the class score with respect to the feature maps of the last convolutional layer.
- These gradients represent the importance of each feature map for the target class.
Feature Map Weighting:
- Perform global average pooling on the gradients to obtain a weight for each feature map.
- Generate a weighted combination of the feature maps from the last convolutional layer. This is the "class activation map."
Upsampling & Overlay:
- Upsample the class activation map to the original input spectrogram dimensions using bilinear interpolation.
- Normalize the activation map and overlay it as a heatmap on the original EGM spectrogram.
Interpretation:
- Regions with high activation (hot colors) indicate time-frequency components (e.g., specific frequencies at specific times) that the CNN used most strongly for its "High Risk" prediction. Correlate these regions with known proarrhythmic signatures (e.g., late-peaking low-frequency components).

Diagram Title: Grad-CAM Saliency Map Generation for EGM Spectrograms

Integrating XAI into the EGM signal processing and ML pipeline is non-negotiable for credible translation to drug development and clinical research. Best practices include:

Use Multiple XAI Methods: No single method provides a complete picture. Use a combination (e.g., SHAP for global feature importance, Grad-CAM for spatial/temporal localization).
Prioritize Physiological Plausibility: The ultimate goal is not just to explain the model, but to extract a biologically coherent hypothesis. Features highlighted by XAI must be reconciled with known cardiac electrophysiology.
Design Experiments to Test XAI Outputs: Treat XAI outputs as hypotheses. The most powerful use of XAI is to guide targeted in vitro, in silico, or in vivo experiments to validate the causal role of identified features.

1. Introduction This application note details the integration of intracardiac electrogram (EGM) signal processing and machine learning (ML) within preclinical antiarrhythmic drug development. It provides a framework for quantifying drug-induced changes in EGM features, serving as a chapter in a broader thesis on ML-feature research from bio-signals. The protocols enable objective, high-throughput assessment of drug efficacy on cardiac electrophysiology.

2. Key EGM Features for Quantification The following quantitative features, derived from processed EGM signals, serve as primary biomarkers for drug assessment.

Table 1: Core EGM Features for Antiarrhythmic Drug Assessment

Feature Category	Specific Feature	Physiological/Drug Effect Correlation	Typical Change with Effective AAD
Temporal	Activation Time (AT)	Local conduction velocity.	Prolongation (slowed conduction).
	Complex Fractionated EGM Duration (CFE-d)	Presence of arrhythmogenic substrate.	Reduction (stabilization of substrate).
Amplitude & Power	Peak-to-Peak Amplitude	Tissue viability, coupling.	Variable (context-dependent).
	Dominant Frequency (DF)	Rate of local repetitive activation.	Reduction (slowed rotor activity).
Spectral & Entropy	Shannon Entropy	Signal irregularity/organization.	Reduction (increased organization).
	Wavelet Decomposition Energy	Multi-scale electrical activity.	Shift in energy bands.
Morphological	Slope	Maximum dv/dt, depolarization speed.	Reduction (slowed upstroke).
	Phase Analysis	Wavefront discontinuity, rotors.	Increased singularity point residency time.

3. Experimental Protocol: Ex Vivo Langendorff-Perfused Heart Model This protocol quantifies drug effects on EGM features in a controlled, intact-organ system.

3.1 Materials & Reagents Research Reagent Solutions:

Item	Function & Specification
Tyrode's Solution	Physiological perfusion buffer (pH 7.4, 37°C, bubbled with 95% O2/5% CO2).
Test Antiarrhythmic Compound	Dissolved in DMSO or Tyrode's to final working concentration; vehicle control prepared in parallel.
Arrhythmogenic Challenge Agent	e.g., Acetylcholine + Caffeine for triggered activity, or rapid pacing protocols.
High-Density Multielectrode Array (HD-MEA)	128-256 electrodes for simultaneous EGM acquisition from epicardial/endocardial surface.
Data Acquisition System	Amplifier (0.05-500 Hz bandpass), 1 kHz+ sampling rate per channel, optical isolation.

3.2 Stepwise Procedure

Heart Preparation: Isolate heart from anesthetized animal (e.g., guinea pig, rabbit). Cannulate aorta and initiate Langendorff perfusion with warm, oxygenated Tyrode's solution.
Baseline Stabilization: Perfuse for 20 minutes to stabilize electrophysiological parameters.
Baseline EGM Recording: Place HD-MEA on region of interest (e.g., left ventricle). Record 5 minutes of stable sinus rhythm EGMs.
Arrhythmia Induction (Pre-Drug): Apply arrhythmogenic challenge (e.g., burst pacing). Record 2 minutes of arrhythmic activity or confirm sustained arrhythmia.
Drug Administration: Switch perfusion to Tyrode's containing the test antiarrhythmic compound at target concentration. Perfuse for 15-20 minutes to ensure tissue equilibration.
Post-Drug EGM Recording: Record 5 minutes of sinus rhythm EGMs under drug perfusion.
Arrhythmia Challenge (Post-Drug): Re-apply the identical arrhythmogenic challenge. Record outcome (e.g., arrhythmia duration, success/failure of induction).
Signal Processing & Feature Extraction: Apply 50/60 Hz notch filter and bandpass filter (1-250 Hz). For each electrode, extract features listed in Table 1 from pre-drug and post-drug epochs using custom algorithms.
Statistical Analysis: Perform paired t-tests or ANOVA on feature distributions (pre- vs. post-drug). Quantify % change.

4. Experimental Protocol: In Vivo Chronic Myocardial Infarction (MI) Model This protocol assesses drug efficacy in a pathological substrate relevant to ventricular tachycardia (VT).

4.1 Materials & Reagents

Item	Function & Specification
Programmable Electrical Stimulator	For programmed ventricular stimulation (PVS) protocols.
Clinical Electrophysiology (EP) Catheter	4-pole or 20-pole mapping catheter for endocardial EGM recording.
3D Electroanatomic Mapping (EAM) System	e.g., CARTO or Ensite, for spatial registration of EGM features.
Telemetry Implant	For continuous ECG monitoring pre- and post-drug administration.

4.2 Stepwise Procedure

MI Model Creation: Induce myocardial infarction via surgical coronary artery ligation in a large animal (e.g., swine). Allow 4-6 weeks for scar formation.
Baseline Electrophysiology Study (EPS): Anesthetize animal. Insert EP catheter into ventricle. Perform 3D EAM during sinus rhythm to create baseline voltage and feature maps. Perform PVS to induce VT (define baseline inducibility).
Baseline EGM Acquisition: Export dense, localized EGM data from the EAM system (≥1000 points per map) from scar, border zone, and healthy tissue.
Drug Administration: Administer test compound via intravenous infusion to achieve target plasma concentration.
Post-Drug EPS & EAM: Repeat EAM and PVS protocol identically after drug equilibrium is reached (e.g., 30 mins post-infusion).
Feature Mapping & Analysis: Compute EGM features (Table 1) for each mapping point. Generate difference maps (post-drug minus pre-drug) for each feature. Correlate spatial feature changes with zones where VT was rendered non-inducible.

5. Data Analysis & ML Integration Workflow

Diagram 1: EGM processing and ML analysis workflow for drug assessment.

6. Signaling Pathways & Drug Action Context

Diagram 2: From drug target to EGM feature change and efficacy.

Conclusion

Effective EGM signal processing is the critical bridge between raw physiological data and actionable machine learning insights in cardiac electrophysiology. This guide has outlined a complete pathway: from understanding the foundational biophysics and noise, through implementing rigorous preprocessing and diverse feature engineering pipelines, to troubleshooting practical challenges and establishing robust validation frameworks. The key takeaway is that the reliability of any subsequent ML model is fundamentally constrained by the quality and thoughtfulness of this initial signal processing stage. For researchers and drug developers, mastering these techniques enables the derivation of novel, quantitative biomarkers from EGMs that can improve arrhythmia mechanism characterization, ablation target identification, and objective assessment of therapeutic interventions. Future directions will involve greater automation via deep learning-based denoising, standardized processing pipelines for multi-modal data integration (imaging + EGMs), and the development of validated digital endpoints for use in clinical trials, ultimately accelerating the translation of computational analysis into improved patient care.