Running Bloom LM

Running Bloom LM is not so easy; naively, it requires 8x A100 32GB GPU! Of course, there many tricks to run it without that many GPU. A bunch of these tricks are part of the "accelerate" library, which checks for the best way to run Bloom depending on your hardware. For instance, you should even be able to run Bloom with a single GPU as long as you have CPU RAM and a fast and free disk space, thanks to "offloading", which swaps data between VRAM, RAM and disk seamlessly. However, (pytorch) support for CPU only is still poor.

So I wanted to investigate other options; more precisely 2 of them:

Collaborative computing consists in deploying one or several layers in different nodes and let forward activations transit from one to another. The bottleneck is obviously the network bandwidth, but batches of data may be processed in parallel simultaneously in all nodes. I managed to run that way GPT-NeoX-11B across 3 laptops + 1 desktop, without any GPU. The real bottleneck was CPU computation, which required 1 to 2 seconds to process data through one layer. So it's a viable option, but there are many standard distributed issues to solve before scaling: resilience of nodes, dynamic switching nodes, malevolent nodes... A more pressing issue is that, for Bloom and it's 70 layers, this would require about 60 nodes running in parallel, so a large and active community.

What I call "disk pipeline" consists in loading one layer in RAM, processing the data through that layer, freeing memory, loading the next layer, etc. This is probably very similar to what accelerate does, but I wanted to have more control, to directly loading from the original files and thus avoid creating any temporary files, and to fully support CPU. This approach may sound too slow, but let's look at figures:

So this solutions becomes reasonable, because I had implemented 2 parallel threads: one thread computes with layer N, while another thread is loading layer N+1. Because the CPU is doing the processing for one layer in about 0.7s, we should have nearly the same speed for both threads, so no more disk bottleneck thanks to NVMe! But still, it's quite slow as for the data to go from bottom to top, it requires about 1 minute. And because generation is auto-regressive, you need to repeat that for every generated token, so for one sentence, it still requires about half an hour. Another issue is that pytorch has poor support for CPU, so I had to convert parameters to float32, which doubles the RAM. But without any optimization, the code is running on less than 40GB of RAM, but it should be possible to make it run in less than 16GB.

So it's very slow (30' to generate text), but it's really hard to do faster without any GPU. Another interesting option is to quantize the weights to int8 or int4, but I didn't tried that yet.

GIT tricks

git log --oneline -- '*.svg'
git show --pretty="" --name-only af7b6df
git log --all -- fichier