WebAssembly SIMD + Multi-threading
Run Benchmarks

Native Performance. In Your Browser.

Equana achieves ~430 GFLOPS using WebAssembly—matching native SSE performance entirely in your browser.

*with Ryzen 9950X

0
GFLOPS in Browser
0
Parallel Threads
100%
of Native SSE

How It Works

Three core optimizations enable native-level performance in the browser.

Learn more at webassembly.org

128-bit SIMD Vectorization

WebAssembly SIMD processes 2 double-precision values per instruction, equivalent to native SSE. Each vector operation handles 128 bits of data simultaneously.

Multi-threaded Execution

Scales computation across 16+ CPU cores using Web Workers with SharedArrayBuffer for shared memory access.

Cache-Optimized Kernels

64×64 blocking for L1/L2 cache efficiency, register tiling with 4×4 micro-kernels for maximum throughput.

The Vector Width Gap

SIMD (Single Instruction, Multiple Data) processes multiple values per CPU cycle. Wider vectors = more throughput.

128-bit

WASM SIMD

2 doubles per instruction

WebAssembly
256-bit

AVX / AVX2

4 doubles per instruction

512-bit

AVX-512

8 doubles per instruction

The Reality

Wider vectors can process more data per instruction, but gains don't scale linearly. Cache sizes, memory bandwidth, and other constraints limit real-world speedups. Still, AVX2 typically achieves ~2x and AVX-512 up to ~3x over 128-bit SIMD in matrix operations.

How We Stack Up

16-thread matrix multiplication benchmark on Ryzen 9950X. WASM compared to highly optimized native OpenBLAS kernels.

Run benchmarks on your hardware

Browser WebAssembly

Equana WASM SIMD (16T)430.00 GFLOPS

Native SSE (128-bit)

OpenBLAS SSE (16T)400.00 GFLOPS
NumPy/OpenBLAS SSE (16T)430.00 GFLOPS

Native AVX/AVX-512

OpenBLAS AVX2 (16T)960.00 GFLOPS
OpenBLAS AVX-512 (16T)1300.00 GFLOPS
01500 GFLOPS

Matching Native SSE*

Equana achieves 100% of native SSE performance—matching optimized OpenBLAS kernels in the browser.

Reference Benchmarks

WebAssembly currently supports 128-bit SIMD. AVX2 (256-bit) and AVX-512 results are shown for reference.

*Comparison to SSE performance

The OpenBLAS reference implementation targets older CPUs with SSE and smaller caches, while our WASM implementation is tuned for modern hardware. Optimizing the native reference for current architectures would likely reach ~500 GFLOPs, revealing the typical performance gap between native and WASM execution due to JIT overhead.

Pushing WASM to the Limit

Every optimization matters when working within browser constraints. Five layers of tuning—from cache blocking to multithreading—achieve native-class throughput entirely in the browser.

Naive Implementation

0 GFLOPS

Basic nested loops, no optimization

Next optimization

Cache-Optimized

5.6x0 GFLOPS

64x64 blocking for L1/L2 cache

Next optimization

SIMD Vectorization

2.4x0 GFLOPS

128-bit f64x2 vectors (2 doubles per op)

Next optimization

Block Packing (4x4)

1.8x0 GFLOPS

Register tiling with micro-kernels

Next optimization

Multithreading

11.8x0 GFLOPS

SharedArrayBuffer + Web Workers across 16 threads

Total Speedup
287x faster
From
1.5 GFLOPS → 430 GFLOPS

Note: These optimizations are a work in progress and not yet fully integrated into Equana. Once complete, they will replace WebAssembly compilations of BLAS/NumPy with hand-tuned kernels written specifically for WebAssembly.

Experience It Yourself

Run matrix operations at near-native speed—right in your browser. No installation required.