A few weeks ago, DeepSeekās announcement of a super-capable R1 model that combines high performance with low resource costs thrilled the entire tech community as well as the US stock market. R1 model is part of a growing trendāAI models trained using a technique called distillation. Essentially, distillation is an approach to training a smaller, faster AI by letting it learn from a bigger, smarter AI. Thus, the smaller AI keeps most of its intelligence but runs more efficiently. However, we wonāt be focusing on this technique here.
OpenAI and similar companies are trying to protect their intellectual property, limiting how their models are used to train competitors. Companies may take countermeasures, such as banning certain accounts/IP addresses, reducing model request limits, and legally prohibiting the use of the model to create competitors.
Can a powerful model be built on a budget?
Recent experiment conducted by researchers from Stanford and the University of WashingtonĀ demonstrated it is indeed possible.
TLDR: Researchers created a new s1 model based on Alibaba's Qwen2.5 and paid $50 for tokens to Gemini 2.0 Flash Thinking (free with limits), 16 NVIDIA H100 GPUs, and in 26 minutes got a competitor to the o1-preview model that answers math questions 27% better, the paper says.
TheĀ s1 modelĀ demonstrates how AI systems can be trained efficiently through strategic data curation,Ā supervised fine-tuning (SFT), and budget forcing. Rather than depending on costly, large-scale datasets, the researchers developedĀ s1K, a compact yet high-quality dataset containing 1,000 reasoning questions. which consists of 1,000 carefully curated questions paired with reasoning traces and answers distilled from Gemini Thinking Experimental model, enabling the capture of complex problem-solving patterns without requiring manual annotation. By fine-tuning Alibabaās Qwen2.5-32B-InstructĀ on this dataset, they built a highly capable model at a fraction of the usual cost.
The core of their training method wasĀ supervised fine-tuning, where the model directly learned from Geminiās reasoning traces instead of following the conventional teacher-student distillation approach. ThisĀ 26-minute fine-tuning processĀ onĀ 16 NVIDIA H100 GPUsĀ cost less thanĀ $50, proving that fine-tuning a strong open-weight model with well-curated data can lead to significant performance gains.
To optimize inference efficiency, the researchers implementedĀ budget forcing, a technique that regulates how long the model spends on reasoning. If a response exceeded a certain length, anĀ end-of-thinking tokenĀ signaled the model to stop and deliver an answer. Conversely, incorporating the wordĀ āWaitāĀ prompted the model to extend its reasoning, leading toĀ more accurate answers. This simple yet powerful adjustment boosted the modelās accuracy onĀ American Invitational Mathematics Examination (AIME) 2024 from 50% to 57%.
TheĀ s1-32B modelĀ surpassed OpenAIāsĀ o1-preview model by 27%Ā on competitive math benchmarks, demonstrating thatĀ small, well-trained models can compete with those built using vast computational resources. This research challenges the notion that state-of-the-art AI requires billion-dollar training pipelines. Instead, it underscores a future whereĀ strategic dataset design, fine-tuning, and inference optimizationĀ can democratize AI model training.
If someone wants to run this process independently, the price of one H100 GPU is $30,000, making a total GPU cost of $480,000. Thatās a $500,000 investment versus the billions spent by major AI playersāfor nearly the same results.
New LLM-based products are just around the corner
If AI can be trained this efficiently, whatās stopping individuals or small teams from building their models? With the right expertise and a few hundred dollars, crafting a custom AI could soon be as accessible as, say, getting a dog. š¶
Models like Mistral, Qwen, and Llama are getting closer to proprietary ones like GPT, reducing the big tech dominance. Distillation allows teams to train high-quality models using API access instead of building from scratch - at a fraction of the cost. As a bonus, we can reduce dependency on a single provider.
If this trend continues, AI might evolve the cloud computing model: big companies still dominate the infrastructure, but smaller players gain power by optimising and customising models forĀ specific needs, cost efficiency, and control.
The barriers to AI development are falling. What happens when anyone can train a high-performing AI assistant for the price of a laptop?