blog: blackwell --final edit
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 27s
All checks were successful
Deploy Website / build-and-deploy (push) Successful in 27s
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
---
|
||||
title: 'Blackwell: Datacenter vs GeForce GPUs'
|
||||
date: '2026-02-27'
|
||||
description: 'Jensen scammed me.'
|
||||
description: 'Jensen scammed us.'
|
||||
tags: ['Nvidia', 'GPU', 'GPU Kernel']
|
||||
---
|
||||
|
||||
@@ -33,7 +33,7 @@ I needed to confirm this myself. <SideNote>NVFP4 is Nvidia's new low precision f
|
||||
Over a PETA FLOP of nvfp4 compute! ggs. This is already insane, and I'm very happy with it. I didn't get `wgmma` from hopper, nor the `tcgen05` instructions and the `TMEM`, but I did get a petaflop of nvfp4 compute.
|
||||
Nsight Compute tells us exactly what we would expect
|
||||
|
||||

|
||||

|
||||
|
||||
Tensor cores are so fast that the memory is bottlenecking them. All of the shared memory is filling up. Huh, I guess nvidia realised this and created `tcgen05` but we don't get to see any of that.
|
||||
|
||||
@@ -46,5 +46,6 @@ To see how the GPU folk in datacenters live, I booted up a vast ai instance and
|
||||
|
||||
We're getting over 2 petaflops, and I'm sure these things can go even faster with better code. Not having `tcgen05` really holds back the geforce cards.
|
||||
This is amazing, I wish I'd be able to get a taste of this locally.
|
||||
|
||||
Why Jensen, why.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user