This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
---
|
||||
title: 'Blackwell: Datacenter vs GeForce GPUs'
|
||||
date: '2026-02-27'
|
||||
description: 'Jensen scammed us.'
|
||||
description: 'GeForce Blackwell and datacenter Blackwell expose meaningfully different tensor-core capabilities.'
|
||||
tags: ['Nvidia', 'GPU', 'GPU Kernel']
|
||||
---
|
||||
|
||||
@@ -28,24 +28,43 @@ any kind of work load. Why did I have to dig so hard to find this information? T
|
||||
|
||||
I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
|
||||
|
||||

|
||||
<Image
|
||||
src="/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png"
|
||||
alt="A screenshot of a CUTLASS NVFP4 matrix multiplication benchmark on an RTX 5090"
|
||||
width={236}
|
||||
height={77}
|
||||
/>
|
||||
|
||||
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
|
||||
Nsight Compute tells us exactly what we would expect
|
||||
|
||||

|
||||
<Image
|
||||
src="/images/1_blackwell_dc_vs_gf/geforce_ncu.png"
|
||||
alt="Nsight Compute showing register pressure and memory bottlenecks on a GeForce GPU"
|
||||
width={1974}
|
||||
height={807}
|
||||
/>
|
||||
|
||||
Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.
|
||||
|
||||

|
||||
<Image
|
||||
src="/images/1_blackwell_dc_vs_gf/nvtop_b200.png"
|
||||
alt="nvtop showing B200 GPU memory capacity"
|
||||
width={647}
|
||||
height={106}
|
||||
/>
|
||||
|
||||
|
||||
To see how the GPU folk with datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`.
|
||||
|
||||

|
||||
<Image
|
||||
src="/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png"
|
||||
alt="A CUTLASS benchmark result from a B200 GPU"
|
||||
width={300}
|
||||
height={108}
|
||||
/>
|
||||
|
||||
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
|
||||
This is amazing, I wish I'd be able to get a taste of this locally.
|
||||
|
||||
Why Jensen, why.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user