blog: blackwell updated images

2026-02-27 23:02:22 -05:00
parent f647a4ad90
commit 6da3a35832
5 changed files with 2 additions and 2 deletions
--- a/content/posts/blackwell_datacenter_vs_geforce.mdx
+++ b/content/posts/blackwell_datacenter_vs_geforce.mdx
@@ -28,7 +28,7 @@ any kind of work load. Why did I have to dig so hard to find this information? T

 I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got

-![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536.png)
+![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png)

 Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
 Nsight Compute tells us exactly what we would expect
@@ -42,7 +42,7 @@ Tensor cores are so fast that the memory is bottlenecking them. All of the share

 To see how the GPU folk in datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`.

-![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536.png)
+![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png)

 We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. 
 This is amazing, I wish I'd be able to get a taste of this locally.