Equana achieves ~430 GFLOPS using WebAssembly—matching native SSE performance entirely in your browser.
*with Ryzen 9950X
Three core optimizations enable native-level performance in the browser.
Learn more at webassembly.orgWebAssembly SIMD processes 2 double-precision values per instruction, equivalent to native SSE. Each vector operation handles 128 bits of data simultaneously.
Scales computation across 16+ CPU cores using Web Workers with SharedArrayBuffer for shared memory access.
64×64 blocking for L1/L2 cache efficiency, register tiling with 4×4 micro-kernels for maximum throughput.
SIMD (Single Instruction, Multiple Data) processes multiple values per CPU cycle. Wider vectors = more throughput.
2 doubles per instruction
4 doubles per instruction
8 doubles per instruction
Wider vectors can process more data per instruction, but gains don't scale linearly. Cache sizes, memory bandwidth, and other constraints limit real-world speedups. Still, AVX2 typically achieves ~2x and AVX-512 up to ~3x over 128-bit SIMD in matrix operations.
16-thread matrix multiplication benchmark on Ryzen 9950X. WASM compared to highly optimized native OpenBLAS kernels.
Run benchmarks on your hardwareEquana achieves 100% of native SSE performance—matching optimized OpenBLAS kernels in the browser.
WebAssembly currently supports 128-bit SIMD. AVX2 (256-bit) and AVX-512 results are shown for reference.
The OpenBLAS reference implementation targets older CPUs with SSE and smaller caches, while our WASM implementation is tuned for modern hardware. Optimizing the native reference for current architectures would likely reach ~500 GFLOPs, revealing the typical performance gap between native and WASM execution due to JIT overhead.
Every optimization matters when working within browser constraints. Five layers of tuning—from cache blocking to multithreading—achieve native-class throughput entirely in the browser.
Basic nested loops, no optimization
64x64 blocking for L1/L2 cache
128-bit f64x2 vectors (2 doubles per op)
Register tiling with micro-kernels
SharedArrayBuffer + Web Workers across 16 threads
Note: These optimizations are a work in progress and not yet fully integrated into Equana. Once complete, they will replace WebAssembly compilations of BLAS/NumPy with hand-tuned kernels written specifically for WebAssembly.
Dive deeper into WebAssembly and benchmarks
Run matrix operations at near-native speed—right in your browser. No installation required.