
At a seminar at URA Lab – Ho Chi Minh City University of Technology (HCMUT), the atmosphere that day felt “different” in a way that was easy to notice: people came not just to listen to a talk, but to step into a longer conversation about where AI is heading. Dr. Phạm Hy Hiếu (from OpenAI) brought the topic “Two strategies to speed up reasoning models”—and although it sounds deeply technical, the way he delivered it made the whole session surprisingly easy to follow.
The first thing that stood out was his discipline with time. Dr. Hiếu set a clear rule for himself: he would speak for only 45 minutes. Yet those 45 minutes never felt rushed. Instead, the talk was structured, crisp, and simplified in exactly the right places—so clear that even first-year students in the lab could understand it. It wasn’t the kind of simplification that removes important ideas; it was the kind that builds the foundation first, then climbs to the harder parts.

Starting with the Basics: What Is a Reasoning Model?
Rather than jumping straight into GPU-level optimization or “speed-up tricks,” Dr. Hiếu began with the most basic question: What is a reasoning model, and how is it different from a regular model? That choice mattered. By starting from shared ground, he made sure everyone in the room was moving forward together.
He explained it in a very intuitive way. A model without reasoning often tries to go directly to an answer. A reasoning model, on the other hand, tends to produce a brief “thinking” phase before it commits to the final response. In the slides, this distinction was made concrete through two simple labels: <|think|>, marking the model’s thinking phase, and <|answer|>, marking the final answer. This framing helped many people relax: reasoning models are not magic. They are still language models, but trained to follow a pattern of thinking first, answering second, much closer to how humans approach difficult problems.
Two Worlds Meet: Research Questions vs Industry Constraints
Once that foundation was in place, Dr. Hiếu brought the room to the real point: it’s not enough that something works. In research, we often ask “What has been solved?” “What is the algorithm?” “What is SOTA?” But in industry, the questions become more direct and practical: “How is it computed?” “Is it fast?” “Is it affordable?” “Can it scale?” And this is where the talk truly “woke the room up”: reasoning can make models much stronger—especially on hard tasks—but the tradeoff is latency and cost. The more the model thinks, the more it computes, and the more expensive it becomes.
From that moment, the discussion naturally shifted. People didn’t only ask whether an approach is correct. They started asking: How much does it cost? Where does the cost come from? Can we reduce the cost without losing quality? It was the kind of shift that turns a seminar into a shared problem-solving session.
Why Reasoning Gets Expensive
Dr. Hiếu then grounded everything in a simple reality: large language models generate text step by step. If you ask a model to think longer, you’re asking it to take more steps—and each step repeats heavy internal computation. Longer reasoning means more steps, which means both time and cost grow quickly. In plain terms, reasoning is expensive because it forces the model to “do the hard work” many more times.
The Title Clicks: Two Directions for Speeding Things Up
This is where the seminar title clicked into focus. Dr. Hiếu didn’t present speed-up methods as random hacks. He framed the problem like an engineer building a system: if reasoning is expensive, there are two big directions to make it cheaper and faster.
Strategy 1: Split the Heavy Work and Run It in Parallel (Split-K)
The first direction is to make the model’s heavy computations run faster by splitting the workload and running parts in parallel, then combining the results. Dr. Hiếu described the intuition in a very human way: don’t make one person carry an entire cabinet—split it into parts and let several people carry it together. In the slides, this idea was associated with Split-K, a way to reorganize large internal work so it can be processed simultaneously rather than sequentially.
Strategy 2: Reduce the Cost of “Looking Back” (Split-KV)
The second direction targets the other major reason why reasoning gets slow: as the model generates more text, it has to “look back” at what came before to stay coherent. The longer the reasoning trace becomes, the heavier that “look back” becomes. So the strategy, again, is to restructure the work so it can be handled in parallel, then merged properly. In the slides, this was discussed as Split-KV—more complex than the first strategy, but very much in the spirit of big-tech engineering: hard, but worth it, because it directly reduces real-world latency and cost at scale.
What Stayed With the Room
What stayed with the room was not just the techniques, but the mindset behind them. Dr. Hiếu didn’t only bring knowledge; he brought a way of thinking shaped by production realities, where everything eventually comes down to three words: fast, affordable, scalable. That is likely why he taught the talk the way he did: first explain what a reasoning model is, then explain why thinking is expensive, then show the two strategies that attack the expense head-on.
By the end, the feeling wasn’t “this is too far away.” It was “now we see the path.” People left understanding not only what reasoning models are, but why they are costly—and how large-scale systems confront that cost with clear, structured ideas.
If the day could be summarized in one line, it would be this: the seminar didn’t just explain what reasoning models are—it explained how to make reasoning fast enough, cheap enough, and scalable enough to live in the real world.