-
CuTe DSL - Notes
My notes on CuTe DSL and FA3 Dissection
-
CUTLASS WGMMA on Hopper - Notes
My notes on WGMMA internals from the Colfax Research CUTLASS Hopper GEMM blog
-
Investigating Flaky `test_eagle_dp` — Batch Invariance Failure on L4 GPUs
Investigative notes and fixes for test_eagle_dp CI tests in vLLM
-
GEMM Kernel Optimization Notes
My notes from Simon Boehm's CUDA GEMM optimization blog
-
SiLU+Mul+FP8 Block Quant Pattern Matching Pipeline - vLLM Notes
Detailed walkthrough of vLLM's torch.compile pattern matching pipeline that fuses SiLU+Mul and FP8 block quantization into a single kernel launch, covering FX graphs, matchers, and the dispatch machinery