update 2 blackwell post
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 27s
@@ -11,4 +11,13 @@ I jumped on the 50-series especially for the fp4 support on their 5th generation
|
|||||||
Imagine my surprise when I was perusing the GPU mode discord and find people calling the GeForce blackwell cards "Fake blackwell"?!!
|
Imagine my surprise when I was perusing the GPU mode discord and find people calling the GeForce blackwell cards "Fake blackwell"?!!
|
||||||
Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card (arch=sm_120) would contain all the features from the datacenter cards (arch=sm_100), as
|
Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card (arch=sm_120) would contain all the features from the datacenter cards (arch=sm_100), as
|
||||||
it seemed to be a later arch. No, Nvidia just made it more confusing, and obscured the technical details extremely well. Going through the [cuda documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/),
|
it seemed to be a later arch. No, Nvidia just made it more confusing, and obscured the technical details extremely well. Going through the [cuda documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/),
|
||||||
you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging.
|
you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging.
|
||||||
|
|
||||||
|
|
||||||
|
### What's in the new tensor cores?
|
||||||
|
|
||||||
|
The blackwell tensor cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications.
|
||||||
|
To test out the nvfp4 <SideNote title="NVFP4">"Nvidia's low precision format. " </SideNote> support, I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
|
||||||
|
|
||||||
|
[A screenshot of a cutlass nvfp4 matmul benchmark](public/images/1_blackwell_dc_vs_gf/5090_65536.png)
|
||||||
|
|
||||||
|
|||||||
BIN
public/images/1_blackwell_dc_vs_gf/5090_65536.png
Normal file
|
After Width: | Height: | Size: 15 KiB |
|
Before Width: | Height: | Size: 23 KiB After Width: | Height: | Size: 23 KiB |
|
Before Width: | Height: | Size: 23 KiB After Width: | Height: | Size: 23 KiB |
|
Before Width: | Height: | Size: 41 KiB After Width: | Height: | Size: 41 KiB |