blog: blackwell updated images
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 28s
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 28s
This commit is contained in:
@@ -28,7 +28,7 @@ any kind of work load. Why did I have to dig so hard to find this information? T
|
||||
|
||||
I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
|
||||
|
||||

|
||||

|
||||
|
||||
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
|
||||
Nsight Compute tells us exactly what we would expect
|
||||
@@ -42,7 +42,7 @@ Tensor cores are so fast that the memory is bottlenecking them. All of the share
|
||||
|
||||
To see how the GPU folk in datacenters live, I booted up a vast ai instance and ran the same matmul, but with cutlass kernels for `sm_100a`.
|
||||
|
||||

|
||||

|
||||
|
||||
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
|
||||
This is amazing, I wish I'd be able to get a taste of this locally.
|
||||
|
||||
Reference in New Issue
Block a user