blog: blackwell --final edit
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 27s

This commit is contained in:
Akshay Kolli
2026-02-27 22:55:24 -05:00
parent c4fa3976f9
commit f647a4ad90

View File

@@ -1,7 +1,7 @@
--- ---
title: 'Blackwell: Datacenter vs GeForce GPUs' title: 'Blackwell: Datacenter vs GeForce GPUs'
date: '2026-02-27' date: '2026-02-27'
description: 'Jensen scammed me.' description: 'Jensen scammed us.'
tags: ['Nvidia', 'GPU', 'GPU Kernel'] tags: ['Nvidia', 'GPU', 'GPU Kernel']
--- ---
@@ -33,7 +33,7 @@ I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision f
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute. Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
Nsight Compute tells us exactly what we would expect Nsight Compute tells us exactly what we would expect
![Nisght compute shows registers spilling, due to extremely high register pressure.](/images/1_blackwell_dc_vs_gf/geforce_ncu.png) ![Nisght Compute shows registers spilling, due to extremely high register pressure.](/images/1_blackwell_dc_vs_gf/geforce_ncu.png)
Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that. Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.
@@ -46,5 +46,6 @@ To see how the GPU folk in datacenters live, I booted up a vast ai instance and
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
This is amazing, I wish I'd be able to get a taste of this locally. This is amazing, I wish I'd be able to get a taste of this locally.
Why Jensen, why. Why Jensen, why.