One still valuable takeaway from this case study, is that even a model as incapable (by current standards) as GPT-4o, can eventually successfully build exactly specified software, if enough iteration steps are performed. GPT-4o was able to correct its own mistakes when enough debug errors were fed back to it, and reason just well enough to make progress.
So think of all the small models that can now be now run on expensive consumer GPUs, with approximately that same level of coding capability. Even if they can't produce the same quality code first shot, which GPT 5.4 (or the most recent Claude, Gemini, etc.) can produce immediately, the same sort of iterative process demonstrated in this case study, correcting error prone code in an iterative loop, can eventually lead to perfect output.
The big difference is that now anyone can hook up a little LLM to an agentic harness, and give it permission to perform all those iterations completely on its own, entirely hands off by human users, so that the iteration process can be performed much faster, fully unattended. The end effect is that you can get good quality end results from LLMs that perform far below frontier level.