Back to Home

The Real-Time Challenge

Real-time EEG processing is a fundamentally different problem than offline analysis. When you are running a neurofeedback session or a closed-loop brain stimulation protocol, the only number that matters is your worst-case execution time. Average latency is meaningless. If your system runs at 5ms for 999 out of 1,000 cycles but spikes to 200ms on that one remaining cycle, your feedback loop is broken.

The reason this is hard: EEG microstates, the quasi-stable topographic patterns that represent discrete moments of brain processing, last roughly 80 to 120 milliseconds. If your system introduces jitter on that timescale, you are feeding back information about a brain state that has already transitioned. The feedback becomes noise.

A general-purpose operating system is the primary enemy here. The Linux kernel, even with PREEMPT_RT patches, can introduce scheduling jitter of tens of milliseconds. Context switches, page faults, interrupt coalescing - any of these can blow your latency budget on a single cycle. Windows is worse. macOS is not even in the conversation.

Most commercial neurofeedback systems today operate with total closed-loop delays between 300ms and 1,000ms. That is enough for slow cortical potential training, but it is far too slow for protocols targeting specific oscillatory states. The current state of the art in research systems achieves total loop latency under 50ms. We are targeting under 10ms, which requires rethinking every layer of the stack from electrode to feedback output.

300-1000ms
Commercial systems
< 50ms
Research state of art
< 10ms
BitBlend target

Acquisition Hardware

The analog front-end determines your signal quality ceiling. Everything downstream - filtering, feature extraction, classification - can only degrade what the ADC captures. The industry standard for research-grade EEG acquisition is the Texas Instruments ADS1299, and for good reason.

The ADS1299 is a 24-bit delta-sigma analog-to-digital converter designed specifically for biopotential measurement. Delta-sigma conversion works by massively oversampling the input signal and then decimating, which pushes quantization noise out of the band of interest. The result is an effective resolution that far exceeds what you would get from a successive-approximation ADC at the same sample rate.

Parameter Value
Resolution 24-bit
Input-referred noise 1 uVpp @ 70Hz BW
CMRR -120 dB
Programmable gain 1x, 2x, 4x, 6x, 8x, 12x, 24x
Sample rate 250 SPS to 16 kSPS
Channels per device 8 (daisy-chainable)

The -120 dB CMRR is critical. Common-mode rejection ratio determines how well the system suppresses signals that appear identically on both the positive and negative inputs - primarily 50/60Hz mains interference. At -120dB, you are attenuating common-mode interference by a factor of one million. This is what makes it possible to measure microvolt-level neural signals in an electrically noisy clinical environment.

The ADS1299 is the ADC inside the OpenBCI Cyton board and the majority of research-grade EEG systems that have appeared in the last decade. When you see a paper citing a "custom" EEG system, it is almost certainly built around this chip.

A common misconception in EEG: absolute electrode impedance matters less than impedance balance between electrodes. The clinical standard of keeping impedance below 5 kohm for wet Ag/AgCl electrodes exists primarily because high impedance increases susceptibility to capacitively-coupled interference. But if your positive and negative electrodes have matched impedance - even at 20 kohm - the differential amplifier will reject common-mode noise effectively. Impedance mismatch is what kills your signal quality.

DMA-Based Data Acquisition

Once the ADC has digitized the signal, the next problem is getting that data into memory without burning CPU cycles. This is where Direct Memory Access comes in. DMA is a hardware mechanism that allows a peripheral device to transfer data directly into system memory without involving the CPU at all during the transfer.

The architecture is straightforward: the ADS1299 asserts a data-ready signal, which triggers the DMA controller. The DMA controller reads the sample from the SPI bus and writes it into a pre-allocated ring buffer in shared memory. The CPU is completely uninvolved until the DMA controller signals that a buffer segment is full. At that point, an interrupt fires and the processing pipeline begins work on the completed segment while the DMA controller continues filling the next segment.

This is a zero-copy architecture. The data moves exactly once: from the ADC into the ring buffer. There is no intermediate staging area, no copy from a driver buffer to a userspace buffer, no serialization-deserialization overhead. The sample lands in the same memory region that the DSP code will read from.

The practical consequence is that acquisition becomes an O(1) operation from the CPU's perspective. Whether you are acquiring 8 channels or 128 channels, the CPU cost per acquisition cycle is constant - it is just configuring the DMA descriptor and handling the completion interrupt. All the actual data movement happens on a dedicated bus that does not compete with the CPU for instruction bandwidth.

For our architecture, the ring buffer is segmented into blocks sized to match the processing pipeline's input granularity. The DMA controller fills one block while the DSP core processes the previous block. On systems with dedicated DSP cores or FPGA fabric, the processing can be pinned to a core that never handles interrupts and never runs operating system code.

FPGA Signal Processing Pipelines

An FPGA's fundamental advantage for signal processing is not raw speed - a modern CPU or GPU will beat it on throughput for most workloads. The advantage is deterministic, pipeline-parallel execution. In an FPGA, you can build a processing pipeline where filtering, feature extraction, and classification all run simultaneously on different stages of the pipeline. Every clock cycle, a new sample enters stage one while previous samples progress through later stages. There is no scheduling, no cache misses, no branch misprediction. The pipeline produces one result per clock cycle at steady state.

The published results on FPGA-accelerated EEG processing are substantial. A Xilinx Zynq-7030 SoC (which pairs ARM Cortex-A9 cores with FPGA fabric on a single die) has been demonstrated handling 128-channel EEG acquisition and processing in a single device, with the ARM cores managing configuration and communication while the FPGA fabric handles the signal processing pipeline.

More striking are the energy-efficiency numbers. A Virtex-690t FPGA implementation of seizure detection achieved a 1.32x throughput improvement over an NVIDIA K40c GPU and an 11x improvement over a Xeon E5-2860 CPU. But the real story is energy efficiency: 6.1x better than the GPU and 26.6x better than the CPU, measured in classifications per joule. For battery-powered or thermally constrained medical devices, this is the decisive factor.

Specific FPGA use cases in our pipeline: automatic seizure detection using convolutional classifiers implemented in fabric, LSTM network acceleration for causal filtering (replacing traditional acausal filters that inherently introduce delay), and wavelet coherence computation for real-time connectivity analysis between electrode pairs.

1.32x
vs. K40c GPU throughput
11x
vs. Xeon E5 throughput
6.1x
vs. GPU energy eff.
26.6x
vs. CPU energy eff.

Closed-Loop Feedback Systems

A closed-loop neurofeedback system has three latency components, and they are not equally expensive. Data collection - acquiring enough samples to compute a meaningful spectral estimate - takes tens of milliseconds. Feedback generation - rendering a visual stimulus, triggering a TMS pulse, or modulating an audio stream - takes tens of milliseconds. The bottleneck sits in the middle: filtering and spectral power estimation, which in conventional systems takes hundreds of milliseconds.

The reason filtering is the bottleneck is that traditional EEG frequency analysis uses acausal filters. An acausal filter uses both past and future samples to compute its output, which means it must buffer data before it can produce a result. A standard zero-phase Butterworth bandpass filter operating on the alpha band (8-13 Hz) needs to see roughly 500ms of data before it can produce a reliable power estimate. That delay alone exceeds our total latency budget.

The alternative is causal filtering, where the filter only uses past and present samples. The problem with causal FIR and IIR filters is that they introduce phase distortion and have poor frequency selectivity at short filter orders. This is where LSTM-based causal filtering changes the equation. A trained LSTM network can learn to approximate the output of an acausal filter using only causal (past) inputs. Implemented on FPGA fabric, this replaces hundreds of milliseconds of buffering delay with a single forward pass through the network, achievable in under a millisecond.

At the firmware level, the inner loops run as bare-metal C/C++ with architecture-specific intrinsics. On ARM Cortex-M and Cortex-A processors, the SMLAD instruction (signed multiply-accumulate dual) computes two multiply-accumulates in a single cycle, which maps directly to FIR filter computation. A 64-tap FIR filter on a Cortex-M7 at 400MHz completes in under 200 nanoseconds per sample using this instruction.

The latency budget for a 10ms closed-loop: approximately 2ms for acquisition (one DMA block at 500 SPS), under 1ms for LSTM-based causal filtering on FPGA, 2ms for feature extraction and classification, and 5ms margin for feedback generation and jitter tolerance. Every component must be deterministic - a single non-deterministic element breaks the entire budget.

Deterministic Execution

For medical devices operating in closed-loop, deterministic means something specific: the worst-case execution time for any processing cycle must be bounded and known in advance. An RTOS like FreeRTOS or Zephyr can provide priority-based preemptive scheduling with bounded interrupt latency, typically in the single-digit microsecond range. This is good enough for many applications.

But an RTOS still has overhead: context save/restore, scheduler decision logic, mutex and semaphore management for shared resources. For our most latency-critical paths, we run bare-metal. The processing pipeline is implemented as a finite state machine driven directly by hardware interrupts. When the DMA completion interrupt fires, the ISR transitions the state machine to the processing state, which runs the filter and classification pipeline to completion before returning. There is no task switch, no scheduler involvement, no kernel overhead.

The trade-off is obvious: bare-metal code is harder to write, harder to debug, and harder to maintain. You lose all OS services - no dynamic memory allocation, no file systems, no networking stack. Everything must be statically allocated and deterministically bounded at compile time. For the sensor acquisition and signal processing path, this trade-off is worth it. For non-latency-critical tasks like configuration, logging, and communication with the host system, we run them on a separate core or processor under a conventional RTOS.

The hardware abstraction layer sits between the bare-metal processing code and the physical peripherals. It provides a consistent interface for DMA configuration, ADC control, and interrupt management, while being thin enough that it does not add measurable latency. The key constraint: it bypasses the general-purpose kernel scheduler entirely. Interrupt priorities are configured in hardware, and the highest-priority interrupt (DMA completion from the ADC) can preempt anything, including other interrupt handlers.

Bounded interrupt latency on a Cortex-M7 with no other interrupts pending is 12 cycles - roughly 30 nanoseconds at 400MHz. Even with interrupt nesting, worst-case latency stays under a microsecond. That is three orders of magnitude below our latency budget, which is exactly the margin you want for a medical device.

Pipeline Architecture

Electrodes Ag/AgCl ADC ADS1299 24-bit DMA Zero-copy Ring Buffer Shared mem DSP FPGA Pipeline Feed- back Output ~5 kohm 16 kSPS O(1) double-buf < 1ms < 10ms

Back to Home