diff --git a/content/posts/blackwell_datacenter_vs_geforce.mdx b/content/posts/blackwell_datacenter_vs_geforce.mdx index e034447..8557a89 100644 --- a/content/posts/blackwell_datacenter_vs_geforce.mdx +++ b/content/posts/blackwell_datacenter_vs_geforce.mdx @@ -1,7 +1,7 @@ --- title: 'Blackwell: Datacenter vs GeForce GPUs' date: '2026-02-27' -description: 'Jensen scammed me.' +description: 'Jensen scammed us.' tags: ['Nvidia', 'GPU', 'GPU Kernel'] --- @@ -33,7 +33,7 @@ I needed to confirm this myself. NVFP4 is Nvidia's new low precision f Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute. Nsight Compute tells us exactly what we would expect -![Nisght compute shows registers spilling, due to extremely high register pressure.](/images/1_blackwell_dc_vs_gf/geforce_ncu.png) +![Nisght Compute shows registers spilling, due to extremely high register pressure.](/images/1_blackwell_dc_vs_gf/geforce_ncu.png) Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that. @@ -46,5 +46,6 @@ To see how the GPU folk in datacenters live, I booted up a vast ai instance and We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. This is amazing, I wish I'd be able to get a taste of this locally. + Why Jensen, why.