From c4fa3976f9f8967d61e17d0016e4cbf67fe9264b Mon Sep 17 00:00:00 2001 From: Akshay Kolli Date: Fri, 27 Feb 2026 22:50:06 -0500 Subject: [PATCH] blog draft blackwell -- 2 --- app/globals.css | 7 +++++ .../posts/blackwell_datacenter_vs_geforce.mdx | 26 ++++++++++++------- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/app/globals.css b/app/globals.css index 488c6fe..ad6d5a9 100644 --- a/app/globals.css +++ b/app/globals.css @@ -1,6 +1,13 @@ @import "tailwindcss"; @plugin "@tailwindcss/typography"; + +/* Getting rid of backticks in code blocks in blogs */ +.prose code::before, +.prose code::after { + content: "" !important; +} + @theme { --font-sans: var(--font-inter); --color-zinc-50: #fafafa; diff --git a/content/posts/blackwell_datacenter_vs_geforce.mdx b/content/posts/blackwell_datacenter_vs_geforce.mdx index 2e103f9..e034447 100644 --- a/content/posts/blackwell_datacenter_vs_geforce.mdx +++ b/content/posts/blackwell_datacenter_vs_geforce.mdx @@ -9,28 +9,33 @@ tags: ['Nvidia', 'GPU', 'GPU Kernel'] I'm a proud owner for an RTX 5090 FE. I occasionally play games on it, but it's mostly used for ML workloads. I jumped on the 50-series especially for the fp4 support on their 5th generation blackwell tensor cores, cause I'm actively working on some pretty exciting low precision computing. Imagine my surprise when I was perusing the GPU mode discord and find people calling the GeForce blackwell cards "Fake blackwell"?!! -Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card (arch=sm_120) would contain all the features from the datacenter cards (arch=sm_100), as -it seemed to be a later arch. No, Nvidia just made it more confusing, and obscured the technical details extremely well. Going through the [cuda documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/), +Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card +The GeForce Cards are `sm_120` with compute capability 12 and the Datacenter cards are `sm_100` with compute capability 10 +You'd expect a higher compute capability to mean something. would contain all the features from the datacenter cards, as +it seemed to be a later arch. No, Nvidia just made it confusing, and managed to obscure the technical details extremely well. Going through the [CUDA documentation](https://docs.nvidia.com/CUDA/parallel-thread-execution/), you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging. ### What's in the new tensor cores? -The blackwell tensor cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications. -The ptx isa also introduces `tcgen05` instructions, which make use of `TMEM` or tensor memory, which only the datacenter cards support. This additional memory sits next to the Tensor cores, and can -be used independent of the registers used in cuda cores. The GeForce cards get 128KB of shared memory per SM, while the datacenter card and the Jetson thor get 228KB SMEM + 256KB TMEM. This is absolutely insane for +The Blackwell Tensor Cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications. +The PTX ISA also introduces `tcgen05` instructions, which make use of `TMEM` or tensor memory, which only the datacenter cards support. This additional memory sits next to the Tensor cores, and can +be used independent of the registers used in CUDA cores. The GeForce cards get 128KB of shared memory per SM, while the datacenter card and the Jetson thor get 228KB SMEM + 256KB TMEM. This is absolutely insane for any kind of work load. Why did I have to dig so hard to find this information? The 5090 is an enthusiast tier card, which I feel deserves a clear description of what you're buying. + +### Benchmarking NVFP4 performance + I needed to confirm this myself. NVFP4 is Nvidia's new low precision format. I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got ![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536.png) Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute. -Nsight compute tells us exactly what we would expect +Nsight Compute tells us exactly what we would expect -![Nisght compute shows registers spilling, due to extremely high register pressure.](/images/geforce_ncu.png) +![Nisght compute shows registers spilling, due to extremely high register pressure.](/images/1_blackwell_dc_vs_gf/geforce_ncu.png) -Tensor cores are so fast that the memory is bottle necking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that. +Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that. ![Look at all that memory. Nvtop from my dreams.](/images/1_blackwell_dc_vs_gf/nvtop_b200.png) @@ -39,6 +44,7 @@ To see how the GPU folk in datacenters live, I booted up a vast ai instance and ![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536.png) -We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. -Why jensen why. +We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. +This is amazing, I wish I'd be able to get a taste of this locally. +Why Jensen, why.