Codex fixes

2026-05-25 09:49:40 -04:00
parent 78ec3d58e3
commit 014b1836c0
101 changed files with 1048 additions and 7327 deletions
--- a/content/posts/blackwell_datacenter_vs_geforce.mdx
+++ b/content/posts/blackwell_datacenter_vs_geforce.mdx
@@ -1,7 +1,7 @@
 ---
 title: 'Blackwell: Datacenter vs GeForce GPUs'
 date: '2026-02-27'
-description: 'Jensen scammed us.'
+description: 'GeForce Blackwell and datacenter Blackwell expose meaningfully different tensor-core capabilities.'
 tags: ['Nvidia', 'GPU', 'GPU Kernel']
 ---

@@ -28,24 +28,43 @@ any kind of work load. Why did I have to dig so hard to find this information? T

 I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got

-![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png)
+<Image
+  src="/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png"
+  alt="A screenshot of a CUTLASS NVFP4 matrix multiplication benchmark on an RTX 5090"
+  width={236}
+  height={77}
+/>

 Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
 Nsight Compute tells us exactly what we would expect

-![Nisght Compute shows registers spilling, due to extremely high register pressure.](/images/1_blackwell_dc_vs_gf/geforce_ncu.png)
+<Image
+  src="/images/1_blackwell_dc_vs_gf/geforce_ncu.png"
+  alt="Nsight Compute showing register pressure and memory bottlenecks on a GeForce GPU"
+  width={1974}
+  height={807}
+/>

 Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.

-![Look at all that memory. Nvtop from my dreams.](/images/1_blackwell_dc_vs_gf/nvtop_b200.png)
+<Image
+  src="/images/1_blackwell_dc_vs_gf/nvtop_b200.png"
+  alt="nvtop showing B200 GPU memory capacity"
+  width={647}
+  height={106}
+/>


 To see how the GPU folk with datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`.

-![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png)
+<Image
+  src="/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png"
+  alt="A CUTLASS benchmark result from a B200 GPU"
+  width={300}
+  height={108}
+/>

 We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. 
 This is amazing, I wish I'd be able to get a taste of this locally.

 Why Jensen, why. 
-