Articles by mezark
63

What happens when you run a CUDA kernel? (fergusfinn.com)

3

A running list of reasons to move to open source (whyopensource.ai)

1

Moe inference optimizations: 15% lower expert load by request reordering (doubleword.ai)

1

Tensor Network Attention (mainlymatmul.com)

5

Redundant Information in LLM Weights (fergusfinn.com)

1

Tans: Precomputing RANS (fergusfinn.com)

2

Also-RANS: Asymmetric Numeral Systems for Entropy Coding (fergusfinn.com)

4

70x faster cold(ish) starts for SGLang (fergusfinn.com)

1

QueueSpec – drafting speculation tokens while a request queues (doubleword.ai)

1

ZeroDP: Just-in-Time Weight Offloading over NVLink for Data Parallelism (mainlymatmul.com)

1

Parallel Primitives for Multi-Agent Workflows (fergusfinn.com)

2

New fastest AI Model Gateway – 450x less overhead than LiteLLM (github.com/doublewordai)

4

Should GPUs Make Free Trade Agreements? (doubleword.ai)

2

Controlled generation of OS LLMs – without impacting latency (youtube.com)

3

Takeoff Inference Server Is Now Open Source (github.com/titanml)

4

Falcon 7B running real time on CPU (youtube.com)