Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Inference cost at scale with napkin math (injuly.in)

87 points by gmays 5 days ago | 18 comments

breput 1 days ago [-]

> We'll assume a 32B dense model, as they've have gotten quite good for production use and a B200 can comfortably serve them. This could be a Gemma, Qwen, DeepSeek, whatever.

That seems like a very consequential point to include halfway through the post. They aren't wrong that Qwen 3.6 26B or Gemma 4 31B are quite good, depending on the use case, but if we're doing napkin math, I'd want some more headroom in the assumptions.

They really ought to have Qwen parameterize their post's calculations and add sliders so a reader could play around with the values.

Edit: And since they especially mentioned DeekSeek (or whatever), as far as I know, none of their current generation of models is a dense model, and even the smallest of the mixture of experts (MoE) models is 284B parameters (13B activated). That will completely incinerate their napkin.

martinald 1 days ago [-]

Yes 32B dense is a weird one to choose.

But in reality, 32B dense is very similar* to 32B activated on MoE in terms of inference costs. And I highly suspect eg Opus is around that level of active params.

A 284ba13b model at scale, is almost certainly cheaper to serve than a 32b dense model.

*as you can shard the model across multiple GPUs at scale. but in reality you have some loss of efficiency from GPU coordination and expert routing

breput 1 days ago [-]

That's good information. I couldn't possibly even start to run even DeepSeek Flash on my system, but also if you're assuming multiple GPUs, that is going to affect the napkin math.

martinald 20 hours ago [-]

The point is that tok/s/GPU stays ~roughly stable. So you need say 4 GB200s minimum to fit the modules, but this provides 4x the tok/s as 1 GPU.

smalltorch 1 days ago [-]

>This largely depends on whether you own or rent your hardware. At $40,000 per B200, your lifetime cost per user is 40_000/num_users. In the 100% duty cycle case (worst for cost), that's 6k$ per user. Realistically, serving 300 users per GPU you'll spend a lifetime cost of about $133 per user, plus the datacenter/upkeep bill. If you rent the GPU, the cost is more straightforward. At an hourly rate of $43, your hourly cost per user is 4/num_users. For num_users=300 you get an hourly rate of about $0.013 per user, or $9.36 per month.

This leads me to believe you can buy a GPU but leave it at a data center?

Do people do this? I don't understand. Or are you equating upkeep bill to electricity on premises?

injuly 21 hours ago [-]

Yes. You can either rent an entire blade, or purchase a dis-assembled box (majority share of the price will be GPU), and place it at a datacenter.

This cannot be done on most premises because of power, noise, and cooling.

__s 1 days ago [-]

You can, people do. https://www.linkedin.com/posts/activity-7409593739138060288-...

smalltorch 1 days ago [-]

So what's the cost separating them from placing this box at their premise?

Network throughout?

1 days ago [-]

namibj 1 days ago [-]

Plus power and cooling.

smalltorch 7 hours ago [-]

The napkin math should figure in the base amount of GPU's i need to buy in order to be able to place it in a data center otherwise this math is not relevant. Because surely any place can cool, power, and protect a single gpu easily.

Not having physical access to my assets doesn't sound secure at all, and even a residential internet connection could handle this throughout.

BadBadJellyBean 1 days ago [-]

Plus space, manpower and security.

JBAnderson5 1 days ago [-]

> Realistically, serving 300 users per GPU you'll spend a lifetime cost of about $133 per user, plus the datacenter/upkeep bill.

What is the operational cost and when does it become more expensive than the upfront capex?

The B200 tops out at 1000W and idles around 140W. It averages around 600W. https://www.lightly.ai/blog/nvidia-b200-vs-h100 U.S. average electricity cost is $.14 per kWh in March. https://www.eia.gov/electricity/monthly/epm_table_grapher.ph...

600/1000 *.14 =$0.084 per hour. $2.01 per day. $60.30 per month. With 300 users, $.20 per user per month. Seems fairly cheap for the electricity.

Does anyone know how to estimate colo/data center rent costs? Where did I screw up my estimates?

3eb7988a1663 1 days ago [-]

The EIA says $.14 for "Commercial" but $.086 for "Industrial". I assume that the big data centers have such high electricity costs that they would be able to cut better deals that would put them in line with the lower "Industrial" rates. So, even better potential margin.

BadBadJellyBean 1 days ago [-]

I wonder what the power costs are when you put jet turbines in front of your DC to power it.

martinald 1 days ago [-]

In general, less for fuel cost alone. But you obviously need to buy the turbines.

BadBadJellyBean 1 days ago [-]

I'd like to see a bit of the running costs inside the napkin math. Power, cooling, maintenance, rent, etc. are probably significant factors as well.

what 1 days ago [-]

They can’t do that. The only way they can show that it’s profitable is by only counting electricity.

stevenaenns 1 days ago [-]

> 2B = 562 => B = 331

what kind of math is this? why isn't it B = 562 / 2 = 281?

Rendered at 05:30:22 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.