You might have heard about LLaMa or maybe you haven’t. Either way, what’s the big deal? It’s just some AI thing. In a nutshell, LLaMa is important because it allows you to run large language models (LLM) like GPT-3 on commodity hardware. In many ways, this is a bit like Stable Diffusion, which similarly allowed normal folks to run image generation models on their own hardware with access to the underlying source code. We’ve discussed why Stable Diffusion matters and even talked about how it works.
LLaMa is a transformer language model from Facebook/Meta research, which is a collection of large models from 7 billion to 65 billion parameters trained on publicly available datasets. Their research paper showed that the 13B version outperformed GPT-3 in most benchmarks and LLama-65B is right up there with the best of them. LLaMa was unique as inference could be run on a single GPU due to some optimizations made to the transformer itself and the model being about 10x smaller. While Meta recommended that users have at least 10 GB of VRAM to run inference on the larger models, that’s a huge step from the 80 GB A100 cards that often run these models.
While this was an important step forward for the research community, it became a huge one for the hacker community when [Georgi Gerganov] rolled in. He released llama.cpp on GitHub, which runs the inference of a LLaMa model with 4-bit quantization. His code was focused on running LLaMa-7B on your Macbook, but we’ve seen versions running on smartphones and Raspberry Pis. There’s even a version written in Rust! A rough rule of thumb is anything with more than 4 GB of RAM can run LLaMa. Model weights are available through Meta with some rather strict terms, but they’ve been leaked online and can be found even in a pull request on the GitHub repo itself.
Aside from occasionally funny and quirky projects, how does having a local GPT-3 like chatbot impact us? The simple fact is that it is accessible to hackers. Not only can you run it, but the code is available, the models are trained on publicly available data, so you could train your own though it took 21 days on 2048 A100 GPUs, and it’s useful enough to provide reasonable output. Stanford even released a version called Alpaca that is LLaMa-7B fine-tuned for instruction following which elevates it from a simple chatbot to a bot able to follow instructions. There is even a guide on how to replicate Alpaca yourself for less than $100 of cloud computing.
Of course, like most current LLMs, LLaMa suffers from the same problems of hallucination, bias, and stereotypes. When asked to generate code, it can try to request endpoints that don’t exist. When asked what the capital of Tanzania is, it will reply Dar es Salaam instead of Dodoma. Researchers haven’t solved the problem of trying to secure a black box, as it is still far too easy to get the model to do something its creators tried hard to prevent.
While it is incredible to think that just a few weeks ago it would have been ridiculous to think you could run a GPT-3 level model on your personal laptop, this ultimately asks the question: what will we do with this? The easy answer is sophisticated spam. Long term there are concerns that large language models could replace programmers and writers. For writing or tweaking small programs, it is already quite good as [Simon Wilson] demonstrated by asking it to generate some AppleScript. However, that is still up for debate. Being able to spit out an accurate answer to a question does not a human make. What do you do with the raw sort of bias-confused amorphous intelligence that is ChatGPT and other LLMs now running locally?
Rather than connecting to an API, the Raspberry Pi inside of this old typewriter can run it entirely locally with no internet connection required. Because the model is smaller, it becomes much easier to fine-tune for your use case. By taking a bunch of dialog from a TV show (let’s say the Simpsons) you could fine-tune the model to respond like a character from the show. Looking further into the future, there is an excellent paper called ReAct that tries to put something like an internal dialog into chat GPT by asking it to output questions, thoughts, and actions. A good example might be this:
Question: How much bigger is the land in Seattle, Washington versus the water?
Thought: I need to use Wikipedia to look up the square footage of the city area and the water
Action: search_wikipedia: Seattle, WA
Observation:
• City 142.07 sq mi (367.97 km2)
• Land 83.99 sq mi (217.54 km2)
• Water 58.08 sq mi (150.43 km2)
• Metro 8,186 sq mi (21,202 km2)
Thought: The city is 142.07 square miles and the water is 58.08 square miles, I should calculate a ratio.
Action: calculate: 142.07 / 58.08
Observation: 2.4461
Answer: The land is 2.4x the size of the water or 83.99 square miles bigger
You can see how this forms a loop where complex actions can be broken down to be performed by simplified helpers, like searching Wikipedia, calling APIs, controlling smart appliances, or actuating motors. Google has been experimenting with the concept in their PaLM-SayCan system, which uses an LLM (PaLM) and breaks it down into smaller tasks.
We can see LLaMa powering NPCs in video games, optimizing blog titles, and controlling robots. So understandably, we’re quite curious to see what you all do with it. One thing is for sure, though. Putting this in the hands of creative hackers is going to be fun.