Large language models require huge amounts of GPU memory. Is it possible to run inference on a single GPU? If so, what is the minimum GPU memory required? The 70B large language model has parameter size of 130GB. Just loading the model into the GPU requires 2 A100 GPUs with 100GB memory each. During inference, the entire input sequence also needs to be loaded into memory for complex “attention” ca