Pull down to go back
DiffusionLLM - Inception Mercury 2 hits 11,000 tokens per second on NVIDIA H100 GPUs

DiffusionLLM - Inception Mercury 2 hits 11,000 tokens per second on NVIDIA H100 GPUs

DiffusionLLM - Inception Mercury 2 在 NVIDIA H100 GPU 上達到每秒 11,000 個 token

Inception Labs just dropped Mercury 2, a new diffusion-based language model that's processing text at insane speeds—11,000 tokens per second on H100 GPUs. This is a big deal because most LLMs are still optimized for accuracy over raw throughput. The approach uses diffusion models (the same tech behind image generation like DALL-E) applied to text generation, which could fundamentally change how we think about inference speed. If this scales, it means faster responses, cheaper API costs, and real-time applications that weren't feasible before. Worth checking out the podcast links to understand the technical details and why this matters for production deployments.

Tech Blogger Take

Someone just made every other LLM look like it's running on dial-up internet

Inception Labs dropped Mercury 2 yesterday, and the numbers are absolutely bonkers — 11,000 tokens per second on H100 GPUs. To put that in perspective, most production LLMs are crawling along at a few hundred tokens per second, optimized for accuracy while your users tap their fingers waiting. But here's the kicker: Mercury 2 uses diffusion models, the same tech that powers DALL-E's image generation, except applied to text. It's like someone took the wrong turn in the AI research lab and accidentally solved the speed problem everyone else ignored. If this actually works at production quality — and that's still a big if — we're looking at real-time AI conversations, dirt-cheap API costs, and suddenly every 'too slow for production' AI feature becomes viable. The entire inference cost equation just got flipped on its head.

VerdictGo download Mercury 2 right now and see if it breaks your current understanding of what's possible — your next product demo might just blow everyone's mind.
8/10

Action

馬上試用
1Clone the Mercury 2 repository from GitHub
2Install dependencies and configure for your GPU setup
3Run the benchmark script to test token throughput on your hardware
Before

Waiting 3-5 seconds for your LLM to generate a paragraph while users get impatient and API costs pile up

After

Getting instant AI responses that feel like talking to a human, with API costs that actually make sense for high-volume applications

AI Analysis

Cloud Computing & APIs

high
Action Required

Start benchmarking your current token costs against this 11K/sec baseline — if Mercury 2 delivers on production quality, your API bills are about to get slashed

Key Insight

While everyone's been chasing smarter models, Inception Labs just made the entire inference cost equation obsolete by borrowing image generation tech

Why It Matters

Your customers expect instant responses, and you're paying premium prices for models that think too slowly — this could flip both problems overnight

Job Impact Analysis

DevOps Engineer

Role Shift
Why It Impacts

11,000 tokens per second means your current scaling assumptions for LLM workloads just became ancient history

How to Adapt

Download Mercury 2 and stress-test it against your production traffic patterns — if it holds up, your infrastructure costs are about to plummet

Product Manager

Opportunity
Why It Impacts

Real-time AI features that were too expensive or slow before suddenly become viable product opportunities

How to Adapt

Dust off those 'AI-powered live chat' and 'instant content generation' features you shelved — it's time to revisit the roadmap

Keywords

diffusion LLMthroughput11000 tokens per secondGPU accelerationproduction-grade

Glossary

Diffusion Models(擴散模型)
The AI technique behind image generators like DALL-E that creates content by gradually removing noise — Mercury 2 is the first to successfully apply this to text generation at scale.
Tokens per Second(每秒標記數)
How fast an AI model can generate text, measured in word-pieces — Mercury 2's 11,000 is roughly 50x faster than typical production models.
Inference(推論)
The process of an AI model generating responses after training — where Mercury 2's speed breakthrough actually matters for real users.
H100 GPU(H100圖形處理器)
NVIDIA's latest powerhouse chip for AI workloads — the hardware Mercury 2 used to hit those insane 11K token speeds.