What we learned about LLMs in 2024

Simon Williamson reviews what we learned about Large Language Model AI development in 2024.

He goes into a lot of detail on each point so the article itself is well worth a read (as is his blog in general if you’re interested in this topic). But in summary:

Several models that outperformed GPT-4 were released, including those from Google, Claude and various other lesser-known ones. Context lengths were also increased.
Supercomputers are not required to use them! Many of these are efficient enough to able to be run locally on your own home computer if it’s a reasonably decent one. The smaller ones can even work powered by your mobile phone.
The cost of running a prompt through a hosted LLM decreased a lot.
Multimodal models became common - those which are able to respond to pictures, audio and video.
Live voice and camera modes were added - you can talk to some of them in a way very reminiscent of the film “Her”.
You can now build entire apps via prompting LLMs.
The best models stopped being free to use, with OpenAI launching a $200 per month subscription for its fanciest one.
There was a lot of buzz about AI agents but they’ve not really taken off yet. It’s not even clear what it means to be an agent.
Evaluating models became a very important skill.
Apple released a great library for running models (mlx-lm) on Apple silicon - but its consumer Apple Intelligence features were not very exciting.
New “reasoning” models were released, such as OpenAI’s o1 series. The quality of their output can be improved by increasing inference compute, not just training compute.
A leading openly licensed model, DeepSeek v3, was trained for under $6 million.
The energy usage of these models, and hence their environmental impact, dramatically decreased.
But the environment was adversely impacted in other ways, with all the big tech companies building out a ton of infrastructure - data centres.
The word “slop” became a popular way to describe undesirable AI content.
It was found that synthetic training data actually works well, contrary to what was originally thought by some.
The optimal use of LLMs became harder. They each have their different limitations, they’re all inherently unreliable in some way - and learning how to work with them best is a non-intuitive skill users would need to develop.
The knowledge gap between people that actively follow and hence know what’s going on with these models and the vast majority of the population who don’t is huge.
Much as it’s important to critique LLMs for the ways that they can create harm, the way some people criticise these models is unhelpful and doesn’t at all help people get the best value from them.