DeepSeek R1, Diffusion Models, and Multi-Agent Systems

From System 1 to System 2: How test-time compute is disrupting AI

March 3, 2025

Insight

DeepSeek has revolutionized the way we think about AI reasoning and model training. Their recently released R1 and R1-Zero reasoning models are powered by a concept called test-time compute, which shifts the focus from generating a quick output for the user, to allowing the model to spend more time reasoning and refining its responses. This approach allows AI systems to tackle complex queries with greater depth and precision, making them more capable of emulating human-like reasoning processes.

Besides the technological innovations, a significant portion of the current media frenzy surrounding DeepSeek was fuelled by the fact that they’ve open-sourced their models’ weights and training details, a rare move in contrast to companies like OpenAI and Anthropic. Additionally, they claimed that R1 was trained at a fraction of the cost of other large models, at under $6 million—though this likely only accounts for the final training run, excluding earlier development and experimentation.

What are DeepSeek’s R1-Zero and R1 Models?

DeepSeek R1-Zero and R1 are large reasoning models (LRMs) designed to enhance reasoning across a wide range of tasks, from natural language understanding to more complex problem-solving activities like mathematics and coding.

DeepSeek R1-Zero: is trained solely using reinforcement learning, which leverages iterative feedback through tailored reward functions. These rewards promote accuracy—ensuring correct, verifiable outputs on tasks like math problems and coding challenges—and encourage structured reasoning by having the model clearly articulate its thought process (for example, by marking its reasoning steps). This strategy enables the model to optimize its reasoning capabilities without the need for extensive annotated datasets, saving both cost and training time.
DeepSeek R1: Building upon the insights from R1-Zero, DeepSeek R1 underwent a multi-stage training process designed to preserve its strong reasoning capabilities while addressing readability, coherence, and language consistency concerns with R1-Zero. Starting with a cold-start phase using curated examples, R1 combines reinforcement learning for reasoning tasks with supervised fine-tuning (SFT) on both reasoning and general-purpose datasets. This hybrid approach makes R1 a powerful and reliable tool for real-world applications, where clear and accurate output is critical.

The key innovation behind these models is that they are trained to prioritize reasoning during inference, rather than focusing on a one-time, quick response. This allows the model to spend more time “thinking” through a query, akin to how humans take time to reason through complex problems - this concept is known as test-time compute.

Test-Time Compute

OpenAI initially popularized the term “test-time compute” when it released its proprietary o1 reasoning model, as discussed in the article Learning to Reason with LLMs. The following figure from the article illustrates how the o1 model is able to improve its accuracy with both train-time and test-time compute, giving rise to new forms of scaling laws that can push performance capability beyond what current train-time compute alone can achieve.

The relationship between train-time compute and test-time compute has also been explored outside of the world of LLMs in deep reinforcement learning. The graph below, taken from Scaling Scaling Laws with Board Games, shows the relationship between train-time compute and test-time compute when training a model to play the Hex board game. Each dotted line here represents the minimum compute required to achieve a particular level of performance (ELO score).

The relationship between train-time compute and test-time compute when training a model to play the Hex board game. — Figure 2: Relationship between train-time and test-time compute when learning to play the Hex board game with deep reinforcement learning. Source: Scaling Scaling Laws with Board Games.

To achieve the required level of performance in this task, this graph shows that one can simply trade-off train-time compute for test-time compute and vice versa.

Our intuition was that test-time compute is much ‘cheaper’ than train-time compute, and so we were surprised that one could easily substitute for the other. On reflection however, we believe the key distinction is that an optimization at test-time needs only optimize over one sample, while train-time compute meanwhile must optimize over the entire distribution of samples.

Andy L. Jones, Scaling Scaling Laws with Board Games

A Sense of Deja Vu: We’ve Been Here Before in Generative Image Synthesis

The idea of exploiting test-time compute over train-time compute echos a trend we saw a few years ago in generative image modelling.

For years, Generative Adversarial Networks (GANs) and sophisticated forms of Variational Autoencoders (VAEs) dominated image generation tasks (checkout this person does not exist to see GANs in action). These models were trained to generate images from the data distribution in one single, fast, forward pass through a network. Whilst this is very fast and efficient at inference time, it turns out that this is actually a really challenging thing for artificial neural networks to learn to do. Perhaps this should be somewhat unsurprising, given that this is also a really hard thing for humans to do - artists don’t just create a painting in one single brush stroke, and authors don’t typically write a book in a single draft.

Enter diffusion models like Stable Diffusion, which shifted the paradigm in generative image synthesis. Instead of generating an image in a single, quick pass, diffusion models gradually refine (de-noise) their outputs in iterative steps (Markov Chain Monte Carlo sampling). In other words, diffusion models allocate significantly more compute at test-time than their predecessors, enabling them to achieve state-of-the-art performance on image generation benchmarks. This process more closely reflects the way a human artist slowly works toward completing a painting or a writer revises drafts until the final piece takes shape.

Demonstration of diffusion model's iterative denoising process on a face image. — Figure 3: Animation showing the iterative denoising process of diffusion models trained on the CIFAR10 (left) and CelebA (right) datasets. Source: Generative Modeling by Estimating Gradients of the Data Distribution.

Demonstration of diffusion model's iterative denoising process on a CIFAR10 image. — Figure 3: Animation showing the iterative denoising process of diffusion models trained on the CIFAR10 (left) and CelebA (right) datasets. Source: Generative Modeling by Estimating Gradients of the Data Distribution.

This has a conceptual analogy to the two modes of thinking that Daniel Kahneman introduced in his famous book “Thinking, Fast and Slow” (2011):

System 1: Fast, automatic thinking
System 2: More reflective, deliberate reasoning

where we could liken the fast and efficient GAN and VAE image synthesis to a system 1-like process, and the more iterative gradual process of diffusion models to system 2.

Diffusion models revolutionized generative image synthesis, and this was driven by the utilization of test-time compute in their iterative sampling process! We’re now witnessing a similar shift in language modelling.

Aside - some very new and exciting research suggests large language diffusion models are also a very promising approach, where instead of auto-regressively generating text token by token, the model generates the entire text in one go and then iteratively refines it.

From System 1 to System 2: The Emergence of Large Reasoning Models

Just as we saw a shift in generative image synthesis, large reasoning models represent a similar evolution that is currently unfolding in language modelling. Rather than simply generating a response in a single, quick step (System 1), large reasoning models such as DeepSeek R1 take a System 2 approach by refining its reasoning through test-time compute. While diffusion models iteratively refine images by gradually denoising them, DeepSeek R1 uses the test-time compute mechanism to refine its reasoning step-by-step. In image synthesis, this iterative process ensures visual fidelity and coherence. Similarly, in reasoning, the extra inference steps allow the large reasoning model to deliberate on complex queries, ensuring a more thoughtful, accurate output and resulting in a remarkable improvement in performance.

Figure 4: Emergent properties of DeepSeek R1-Zero showing that as training progresses, it learns to spend more time thinking in order to solve a reasoning task. Source: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

Something I find particularly impressive about DeepSeek R1-Zero, is that the model naturally learnt to spend more time thinking in order to solve reasoning tasks. Unlike in diffusion models, where the number of computational steps used to generate an image is considered a hyper-parameter that must be set by the developer prior to training, DeepSeek R1-Zero learned this emergent behavior automatically via the reinforcement learning process.

So if language models now have powerful reasoning capabilities, what’s coming next?

The first developments will be brute-force scaling to create larger, more powerful models that are trained for longer with more data (we are already seeing this with Grok 3). However, whilst large reasoning models demonstrate impressive previously unseen capability in solving complex tasks, reasoning about something for longer is not always the right course of action. For example, a knowledge worker might simply need to quickly retrieve relevant context from a client’s new document. No matter how sophisticated the reasoning engine is, if the system isn’t equipped with an effective retrieval tool, it will be unable to solve this problem.

Tools, Multi-Agent Systems, and Beyond

Rather than squeezing all the desired capabilities and knowledge into a single model, I have much more faith in the idea of compound AI systems that comprise multiple models equipped with tools. In fact, if you think about it, enabling models to use tools is just another form of test-time compute.

Tool use has been integral to human intelligence, shaping our unique capacity for complex problem-solving and technological innovation. Imagine AI systems that can decide when to connect and retrieve information from relevant data sources, connect and interact with numerous applications on your behalf, formulate and execute relevant web search queries to gather the latest updates on a given topic, and even generate and execute custom code. This is a reality, today.

How Instill AI Leverages Reasoning Models as Tools

At Instill AI, we’re developing a multi-agent AI system with advanced tools to process unstructured data—documents, audio, and video—enabling complex analysis and insight generation across industries such as finance, law, and healthcare. In particular, our decision-making unit will be able to invoke large reasoning models like DeepSeek R1 as one of the many available tools at its disposal. This will allow Instill AI to dynamically handle intricate queries that require deeper cognitive processing and deliberation. The result is a multi-agent system capable of answering complex questions accurately and reliably, whilst also retaining efficient low-latency responses in situations where deeper reasoning is not required.

Get Early Access

P.S. We are building Instill AI using the open-source full-stack AI platform Instill Core, which you can use to serve reasoning models or diffusion models like the ones discussed in this article.