diff --git a/content/posts/blackwell_datacenter_vs_geforce.mdx b/content/posts/blackwell_datacenter_vs_geforce.mdx index 8557a89..d318fab 100644 --- a/content/posts/blackwell_datacenter_vs_geforce.mdx +++ b/content/posts/blackwell_datacenter_vs_geforce.mdx @@ -28,7 +28,7 @@ any kind of work load. Why did I have to dig so hard to find this information? T I needed to confirm this myself. NVFP4 is Nvidia's new low precision format. I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got -![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536.png) +![A screenshot of a cutlass nvfp4 matmul benchmark](/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png) Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute. Nsight Compute tells us exactly what we would expect @@ -42,7 +42,7 @@ Tensor cores are so fast that the memory is bottlenecking them. All of the share To see how the GPU folk in datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`. -![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536.png) +![jeez louise these things are fast](/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png) We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards. This is amazing, I wish I'd be able to get a taste of this locally. diff --git a/public/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png b/public/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png new file mode 100644 index 0000000..1698ecd Binary files /dev/null and b/public/images/1_blackwell_dc_vs_gf/5090_65536_cropped.png differ diff --git a/public/images/1_blackwell_dc_vs_gf/b200_32768.png b/public/images/1_blackwell_dc_vs_gf/b200_32768.png index bf32190..5a867a2 100644 Binary files a/public/images/1_blackwell_dc_vs_gf/b200_32768.png and b/public/images/1_blackwell_dc_vs_gf/b200_32768.png differ diff --git a/public/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png b/public/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png new file mode 100644 index 0000000..ddee643 Binary files /dev/null and b/public/images/1_blackwell_dc_vs_gf/b200_65536_cropped.png differ diff --git a/public/images/1_blackwell_dc_vs_gf/nvtop_b200.png b/public/images/1_blackwell_dc_vs_gf/nvtop_b200.png index 5a6fae8..fe63930 100644 Binary files a/public/images/1_blackwell_dc_vs_gf/nvtop_b200.png and b/public/images/1_blackwell_dc_vs_gf/nvtop_b200.png differ