Every billion parameters needs about 2 GB of VRAM - if using bfloat16 representation. 16 bits per parameter, 8 bits per byte -> 2 bytes per parameter.
1 billion parameters ~ 2 Billion bytes ~ 2 GB.
From the name, this model has 72 Billion parameters, so ~144 GB of VRAM
Linux and Nvidia really need to sort out their shit so I can fully dump windows.
Luckily the AI hype is good for something in this regard, since running gpus on Linux servers is suddenly much more important.