blog draft blackwell

2026-02-27 22:37:24 -05:00
parent 21859a250f
commit 19d0ad59a0
2 changed files with 22 additions and 1 deletions
--- a/content/posts/blackwell_datacenter_vs_geforce.mdx
+++ b/content/posts/blackwell_datacenter_vs_geforce.mdx
@@ -17,7 +17,28 @@ you'll see that the new tensor core gen 5 instructions are only compatible with
 ### What's in the new tensor cores?

 The blackwell tensor cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications.
-To test out the nvfp4 <SideNote title="NVFP4">"Nvidia's low precision format. " </SideNote> support, I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
+The ptx isa also introduces `tcgen05` instructions, which make use of `TMEM` or tensor memory, which only the datacenter cards support. This additional memory sits next to the Tensor cores, and can
+be used independent of the registers used in cuda cores. The GeForce cards get 128KB of shared memory per SM, while the datacenter card and the Jetson thor get 228KB SMEM + 256KB TMEM. This is absolutely insane for
+any kind of work load. Why did I have to dig so hard to find this information? The 5090 is an enthusiast tier card, which I feel deserves a clear description of what you're buying. 
+
+I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got

 ![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536.png)

+Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
+Nsight compute tells us exactly what we would expect
+
+![Nisght compute shows registers spilling, due to extremely high register pressure.](/images/geforce_ncu.png)
+
+Tensor cores are so fast that the memory is bottle necking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.
+
+![Look at all that memory. Nvtop from my dreams.](/images/1_blackwell_dc_vs_gf/nvtop_b200.png)
+
+
+To see how the GPU folk in datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`.
+
+![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536.png)
+
+We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
+Why jensen why.
+