blog: blackwell updated images
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 28s

This commit is contained in:
Akshay Kolli
2026-02-27 23:02:22 -05:00
parent f647a4ad90
commit 6da3a35832
5 changed files with 2 additions and 2 deletions

View File

@@ -28,7 +28,7 @@ any kind of work load. Why did I have to dig so hard to find this information? T
I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536.png) ![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png)
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute. Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
Nsight Compute tells us exactly what we would expect Nsight Compute tells us exactly what we would expect
@@ -42,7 +42,7 @@ Tensor cores are so fast that the memory is bottlenecking them. All of the share
To see how the GPU folk in datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`. To see how the GPU folk in datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`.
![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536.png) ![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png)
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
This is amazing, I wish I'd be able to get a taste of this locally. This is amazing, I wish I'd be able to get a taste of this locally.

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.2 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 114 KiB

After

Width:  |  Height:  |  Size: 20 KiB