blog draft blackwell -- 2
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 27s

This commit is contained in:
Akshay Kolli
2026-02-27 22:50:06 -05:00
parent 19d0ad59a0
commit c4fa3976f9
2 changed files with 23 additions and 10 deletions

View File

@@ -1,6 +1,13 @@
@import "tailwindcss"; @import "tailwindcss";
@plugin "@tailwindcss/typography"; @plugin "@tailwindcss/typography";
/* Getting rid of backticks in code blocks in blogs */
.prose code::before,
.prose code::after {
content: "" !important;
}
@theme { @theme {
--font-sans: var(--font-inter); --font-sans: var(--font-inter);
--color-zinc-50: #fafafa; --color-zinc-50: #fafafa;

View File

@@ -9,26 +9,31 @@ tags: ['Nvidia', 'GPU', 'GPU Kernel']
I'm a proud owner for an RTX 5090 FE. I occasionally play games on it, but it's mostly used for ML workloads. I'm a proud owner for an RTX 5090 FE. I occasionally play games on it, but it's mostly used for ML workloads.
I jumped on the 50-series especially for the fp4 support on their 5th generation blackwell tensor cores, cause I'm actively working on some pretty exciting low precision computing. I jumped on the 50-series especially for the fp4 support on their 5th generation blackwell tensor cores, cause I'm actively working on some pretty exciting low precision computing.
Imagine my surprise when I was perusing the GPU mode discord and find people calling the GeForce blackwell cards "Fake blackwell"?!! Imagine my surprise when I was perusing the GPU mode discord and find people calling the GeForce blackwell cards "Fake blackwell"?!!
Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card (arch=sm_120) would contain all the features from the datacenter cards (arch=sm_100), as Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card
it seemed to be a later arch. No, Nvidia just made it more confusing, and obscured the technical details extremely well. Going through the [cuda documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/), <SideNote>The GeForce Cards are `sm_120` with compute capability 12 and the Datacenter cards are `sm_100` with compute capability 10
You'd expect a higher compute capability to mean something.</SideNote> would contain all the features from the datacenter cards, as
it seemed to be a later arch. No, Nvidia just made it confusing, and managed to obscure the technical details extremely well. Going through the [CUDA documentation](https://docs.nvidia.com/CUDA/parallel-thread-execution/),
you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging. you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging.
### What's in the new tensor cores? ### What's in the new tensor cores?
The blackwell tensor cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications. The Blackwell Tensor Cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications.
The ptx isa also introduces `tcgen05` instructions, which make use of `TMEM` or tensor memory, which only the datacenter cards support. This additional memory sits next to the Tensor cores, and can The PTX ISA also introduces `tcgen05` instructions, which make use of `TMEM` or tensor memory, which only the datacenter cards support. This additional memory sits next to the Tensor cores, and can
be used independent of the registers used in cuda cores. The GeForce cards get 128KB of shared memory per SM, while the datacenter card and the Jetson thor get 228KB SMEM + 256KB TMEM. This is absolutely insane for be used independent of the registers used in CUDA cores. The GeForce cards get 128KB of shared memory per SM, while the datacenter card and the Jetson thor get 228KB SMEM + 256KB TMEM. This is absolutely insane for
any kind of work load. Why did I have to dig so hard to find this information? The 5090 is an enthusiast tier card, which I feel deserves a clear description of what you're buying. any kind of work load. Why did I have to dig so hard to find this information? The 5090 is an enthusiast tier card, which I feel deserves a clear description of what you're buying.
### Benchmarking NVFP4 performance
I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536.png) ![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536.png)
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute. Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
Nsight compute tells us exactly what we would expect Nsight Compute tells us exactly what we would expect
![Nisght compute shows registers spilling, due to extremely high register pressure.](/images/geforce_ncu.png) ![Nisght compute shows registers spilling, due to extremely high register pressure.](/images/1_blackwell_dc_vs_gf/geforce_ncu.png)
Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that. Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.
@@ -40,5 +45,6 @@ To see how the GPU folk in datacenters live, I booted up a vast ai instance and
![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536.png) ![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536.png)
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
Why jensen why. This is amazing, I wish I'd be able to get a taste of this locally.
Why Jensen, why.