update 2 blackwell post
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 27s

This commit is contained in:
Akshay Kolli
2026-02-27 22:12:17 -05:00
parent 2a8df25c16
commit 9f8ff8befe
5 changed files with 10 additions and 1 deletions

View File

@@ -11,4 +11,13 @@ I jumped on the 50-series especially for the fp4 support on their 5th generation
Imagine my surprise when I was perusing the GPU mode discord and find people calling the GeForce blackwell cards "Fake blackwell"?!!
Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card (arch=sm_120) would contain all the features from the datacenter cards (arch=sm_100), as
it seemed to be a later arch. No, Nvidia just made it more confusing, and obscured the technical details extremely well. Going through the [cuda documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/),
you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging.
you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging.
### What's in the new tensor cores?
The blackwell tensor cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications.
To test out the nvfp4 <SideNote title="NVFP4">"Nvidia's low precision format. " </SideNote> support, I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
[A screenshot of a cutlass nvfp4 matmul benchmark](public/images/1_blackwell_dc_vs_gf/5090_65536.png)