Codex fixes
Some checks failed
Deploy Website / build-and-deploy (push) Has been cancelled

This commit is contained in:
2026-05-25 09:49:40 -04:00
parent 78ec3d58e3
commit 014b1836c0
101 changed files with 1048 additions and 7327 deletions

View File

@@ -1,7 +1,7 @@
---
title: 'Blackwell: Datacenter vs GeForce GPUs'
date: '2026-02-27'
description: 'Jensen scammed us.'
description: 'GeForce Blackwell and datacenter Blackwell expose meaningfully different tensor-core capabilities.'
tags: ['Nvidia', 'GPU', 'GPU Kernel']
---
@@ -28,24 +28,43 @@ any kind of work load. Why did I have to dig so hard to find this information? T
I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png)
<Image
src="/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png"
alt="A screenshot of a CUTLASS NVFP4 matrix multiplication benchmark on an RTX 5090"
width={236}
height={77}
/>
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
Nsight Compute tells us exactly what we would expect
![Nisght Compute shows registers spilling, due to extremely high register pressure.](/images/1_blackwell_dc_vs_gf/geforce_ncu.png)
<Image
src="/images/1_blackwell_dc_vs_gf/geforce_ncu.png"
alt="Nsight Compute showing register pressure and memory bottlenecks on a GeForce GPU"
width={1974}
height={807}
/>
Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.
![Look at all that memory. Nvtop from my dreams.](/images/1_blackwell_dc_vs_gf/nvtop_b200.png)
<Image
src="/images/1_blackwell_dc_vs_gf/nvtop_b200.png"
alt="nvtop showing B200 GPU memory capacity"
width={647}
height={106}
/>
To see how the GPU folk with datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`.
![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png)
<Image
src="/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png"
alt="A CUTLASS benchmark result from a B200 GPU"
width={300}
height={108}
/>
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
This is amazing, I wish I'd be able to get a taste of this locally.
Why Jensen, why.