blog draft blackwell -- 2
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 27s
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 27s
This commit is contained in:
@@ -1,6 +1,13 @@
|
|||||||
@import "tailwindcss";
|
@import "tailwindcss";
|
||||||
@plugin "@tailwindcss/typography";
|
@plugin "@tailwindcss/typography";
|
||||||
|
|
||||||
|
|
||||||
|
/* Getting rid of backticks in code blocks in blogs */
|
||||||
|
.prose code::before,
|
||||||
|
.prose code::after {
|
||||||
|
content: "" !important;
|
||||||
|
}
|
||||||
|
|
||||||
@theme {
|
@theme {
|
||||||
--font-sans: var(--font-inter);
|
--font-sans: var(--font-inter);
|
||||||
--color-zinc-50: #fafafa;
|
--color-zinc-50: #fafafa;
|
||||||
|
|||||||
@@ -9,26 +9,31 @@ tags: ['Nvidia', 'GPU', 'GPU Kernel']
|
|||||||
I'm a proud owner for an RTX 5090 FE. I occasionally play games on it, but it's mostly used for ML workloads.
|
I'm a proud owner for an RTX 5090 FE. I occasionally play games on it, but it's mostly used for ML workloads.
|
||||||
I jumped on the 50-series especially for the fp4 support on their 5th generation blackwell tensor cores, cause I'm actively working on some pretty exciting low precision computing.
|
I jumped on the 50-series especially for the fp4 support on their 5th generation blackwell tensor cores, cause I'm actively working on some pretty exciting low precision computing.
|
||||||
Imagine my surprise when I was perusing the GPU mode discord and find people calling the GeForce blackwell cards "Fake blackwell"?!!
|
Imagine my surprise when I was perusing the GPU mode discord and find people calling the GeForce blackwell cards "Fake blackwell"?!!
|
||||||
Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card (arch=sm_120) would contain all the features from the datacenter cards (arch=sm_100), as
|
Looking online, I found next to no resources on the difference. I foolishly assumed that my GeForce card
|
||||||
it seemed to be a later arch. No, Nvidia just made it more confusing, and obscured the technical details extremely well. Going through the [cuda documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/),
|
<SideNote>The GeForce Cards are `sm_120` with compute capability 12 and the Datacenter cards are `sm_100` with compute capability 10
|
||||||
|
You'd expect a higher compute capability to mean something.</SideNote> would contain all the features from the datacenter cards, as
|
||||||
|
it seemed to be a later arch. No, Nvidia just made it confusing, and managed to obscure the technical details extremely well. Going through the [CUDA documentation](https://docs.nvidia.com/CUDA/parallel-thread-execution/),
|
||||||
you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging.
|
you'll see that the new tensor core gen 5 instructions are only compatible with `sm_100[a-f]` (Datacenter Blackwell) and `sm_101` (Jetson Thor). What does this mean? That involved a lot more digging.
|
||||||
|
|
||||||
|
|
||||||
### What's in the new tensor cores?
|
### What's in the new tensor cores?
|
||||||
|
|
||||||
The blackwell tensor cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications.
|
The Blackwell Tensor Cores now support lower precision, namely FP6 and FP4, which the previous Hopper generation didn't. This enables extremely fast low precision matrix multiplications.
|
||||||
The ptx isa also introduces `tcgen05` instructions, which make use of `TMEM` or tensor memory, which only the datacenter cards support. This additional memory sits next to the Tensor cores, and can
|
The PTX ISA also introduces `tcgen05` instructions, which make use of `TMEM` or tensor memory, which only the datacenter cards support. This additional memory sits next to the Tensor cores, and can
|
||||||
be used independent of the registers used in cuda cores. The GeForce cards get 128KB of shared memory per SM, while the datacenter card and the Jetson thor get 228KB SMEM + 256KB TMEM. This is absolutely insane for
|
be used independent of the registers used in CUDA cores. The GeForce cards get 128KB of shared memory per SM, while the datacenter card and the Jetson thor get 228KB SMEM + 256KB TMEM. This is absolutely insane for
|
||||||
any kind of work load. Why did I have to dig so hard to find this information? The 5090 is an enthusiast tier card, which I feel deserves a clear description of what you're buying.
|
any kind of work load. Why did I have to dig so hard to find this information? The 5090 is an enthusiast tier card, which I feel deserves a clear description of what you're buying.
|
||||||
|
|
||||||
|
|
||||||
|
### Benchmarking NVFP4 performance
|
||||||
|
|
||||||
I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
|
I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision format. </SideNote> I downloaded the cutlass repo and ran the nvfp4 matrix multiply example. Here's what I got
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
|
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
|
||||||
Nsight compute tells us exactly what we would expect
|
Nsight Compute tells us exactly what we would expect
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.
|
Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.
|
||||||
|
|
||||||
@@ -40,5 +45,6 @@ To see how the GPU folk in datacenters live, I booted up a vast ai instance and
|
|||||||

|

|
||||||
|
|
||||||
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
|
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
|
||||||
Why jensen why.
|
This is amazing, I wish I'd be able to get a taste of this locally.
|
||||||
|
Why Jensen, why.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user