Equana

WebAssembly SIMD + Multi-threading

Run Benchmarks

Native Performance. In Your Browser.

Equana achieves ~430 GFLOPS using WebAssembly—matching native SSE performance entirely in your browser.

*with Ryzen 9950X

GFLOPS in Browser

Parallel Threads

100%

of Native SSE

How It Works

Three core optimizations enable native-level performance in the browser.

Learn more at webassembly.org

128-bit SIMD Vectorization

WebAssembly SIMD processes 2 double-precision values per instruction, equivalent to native SSE. Each vector operation handles 128 bits of data simultaneously.

Multi-threaded Execution

Scales computation across 16+ CPU cores using Web Workers with SharedArrayBuffer for shared memory access.

Cache-Optimized Kernels

64×64 blocking for L1/L2 cache efficiency, register tiling with 4×4 micro-kernels for maximum throughput.

The Vector Width Gap

SIMD (Single Instruction, Multiple Data) processes multiple values per CPU cycle. Wider vectors = more throughput.

128-bit

WASM SIMD

2 doubles per instruction

WebAssembly

256-bit

AVX / AVX2

4 doubles per instruction

512-bit

AVX-512

8 doubles per instruction

The Reality

Wider vectors can process more data per instruction, but gains don't scale linearly. Cache sizes, memory bandwidth, and other constraints limit real-world speedups. Still, AVX2 typically achieves ~2x and AVX-512 up to ~3x over 128-bit SIMD in matrix operations.

How We Stack Up

16-thread matrix multiplication benchmark on Ryzen 9950X. WASM compared to highly optimized native OpenBLAS kernels.

Run benchmarks on your hardware

Browser WebAssembly

Equana WASM SIMD (16T)430.00 GFLOPS

Native SSE (128-bit)

OpenBLAS SSE (16T)400.00 GFLOPS

NumPy/OpenBLAS SSE (16T)430.00 GFLOPS

Native AVX/AVX-512

OpenBLAS AVX2 (16T)960.00 GFLOPS

OpenBLAS AVX-512 (16T)1300.00 GFLOPS

0375 GFLOPS750 GFLOPS1125 GFLOPS1500 GFLOPS

Matching Native SSE*

Equana achieves 100% of native SSE performance—matching optimized OpenBLAS kernels in the browser.

Reference Benchmarks

WebAssembly currently supports 128-bit SIMD. AVX2 (256-bit) and AVX-512 results are shown for reference.

*Comparison to SSE performance

The OpenBLAS reference implementation targets older CPUs with SSE and smaller caches, while our WASM implementation is tuned for modern hardware. Optimizing the native reference for current architectures would likely reach ~500 GFLOPs, revealing the typical performance gap between native and WASM execution due to JIT overhead.

Pushing WASM to the Limit

Every optimization matters when working within browser constraints. Five layers of tuning—from cache blocking to multithreading—achieve native-class throughput entirely in the browser.

Naive Implementation

0 GFLOPS

Basic nested loops, no optimization

Next optimization

Cache-Optimized

5.6x0 GFLOPS

64x64 blocking for L1/L2 cache

Next optimization

SIMD Vectorization

2.4x0 GFLOPS

128-bit f64x2 vectors (2 doubles per op)

Next optimization

Block Packing (4x4)

1.8x0 GFLOPS

Next optimization

Multithreading

11.8x0 GFLOPS

SharedArrayBuffer + Web Workers across 16 threads

Total Speedup

287x faster

From

1.5 GFLOPS → 430 GFLOPS

Note: These optimizations are a work in progress and not yet fully integrated into Equana. Once complete, they will replace WebAssembly compilations of BLAS/NumPy with hand-tuned kernels written specifically for WebAssembly.

Technical Resources

Dive deeper into WebAssembly and benchmarks

Equana Benchmarks

Run live benchmarks on your hardware

WebAssembly

Official WebAssembly homepage

WebAssembly GitHub

Specs, proposals, and development

OpenBLAS

High-performance native BLAS library

NumPy

Python numerical computing library

Experience It Yourself

Run matrix operations at near-native speed—right in your browser. No installation required.

Start Computing Run Benchmarks