Most AI startups I talk to are spending 3-10x more on inference than they need to. Not because the technology is hard, because the same six mistakes show up over and over again, in nearly the same order, regardless of what the team is building.
I run engineering for a production AI system that serves 30B+ predictions a year end-to-end in under 100ms. I started InferWorks because every time I look under the hood of another funded startup’s stack, I find the same recoverable waste. None of it is exotic. All of it is fixable in weeks, not quarters.
This post is the list. Six mistakes, what they look like from the inside, and roughly what each one costs you. No conclusion section, no newsletter pitch, if two or three of these feel familiar, the offer at the bottom is the only thing I’m selling.
Mistake 1. You’re using a managed vector DB at a scale where it stops making sense
Pinecone and pgvector are the right call at the prototype stage. They get expensive fast at scale, but not for the reason most people assume. It’s not the vectors. It’s the operations.
The symptoms are easy to spot. Your vector DB line item is growing faster than your active user count. Latency is creeping up as the index grows. And, the real tell, your engineers are debating whether to shrink the dataset (drop old documents, tighten filters) instead of scaling the infrastructure, because scaling the managed tier got expensive enough to feel like a wall.
Here’s the part that surprises people: on a managed vector DB, storage is cheap and operations are not. Pull up Pinecone’s own calculator and the shape is obvious. Ten million vectors at 1,536 dimensions with light query traffic is about $350/month, storage is a rounding error at $24 of it. Keep the same 10M vectors but run real traffic through dedicated read nodes and you’re at ~$1,000/month. Push to 100M vectors with the query and write volume a live product actually generates and it’s ~$7,600/month, of which roughly $6,800 is query cost and only $236 is storage.
That reframes the whole decision. The crossover point isn’t a vector count, it’s a workload shape. A near-static index (embed once, query occasionally) stays cheap on managed services for a long time. A dynamic catalog that’s constantly re-embedded and queried, a marketplace, a live document store, anything with churn, hits the wall early, because you’re paying per operation and your operations never stop. Self-hosted Qdrant, Weaviate, or LanceDB price the box, not the operation, so they scale roughly linearly with the compute you provision instead of punishing you for throughput.
The objection is often operational: “self-hosting is hard, we don’t have the headcount.” But the math answers itself. If self-hosting saves you $10k/month, you can hire a part-time DevOps contractor for a fraction of that and still come out far ahead, and the migration is usually a 1-2 week project, not a quarter of platform work: stand up the cluster, dual-write, backfill, cut over reads, decommission.
Rough cost impact: depends entirely on workload shape, modest for a static index, very large for a high-throughput or write-heavy one, where managed operation pricing is the dominant line item.
Mistake 2. You’re re-embedding documents that haven’t changed
Most embedding pipelines I look at re-embed an entire document every time any part of it updates, even when 95% of the content is byte-for-byte identical to what’s already in the index. You’re paying the embedding API to recompute vectors you already have.
You’ll recognize this if your embedding bill grows with how often documents change rather than with how much net new content you’re adding. Re-ingestion that should take minutes takes hours. And the OpenAI or Cohere line item feels mysteriously large relative to how much your corpus is actually growing.
The fix is content-hash-based incremental embedding at the chunk level. Chunk the document, hash each chunk, and only call the embedding API for chunks whose hash you haven’t seen:
for chunk in chunk_document(doc):
h = sha256(chunk.text)
if h in embedding_store: # unchanged, skip the API call
continue
vector = embed(chunk.text) # only new/changed chunks hit the API
embedding_store[h] = vector
index.upsert(chunk.id, vector)
When 95% of chunks short-circuit, you stop paying for 95% of the work, and re-ingestion gets dramatically faster, because the expensive network round-trips only happen for the chunks that actually changed. The user-facing win (faster updates) and the cost win come from the same line of code.
Where this pays off most is systems with frequent small edits, a knowledge base, a docs site, a wiki, where a typical update touches a paragraph and leaves the rest of the document untouched. Diff the document, identify which chunks actually changed, and re-embed only those. For a system with that edit pattern, this alone cuts embedding spend by 10-50%, and the savings track exactly how granular your changes are: lots of small edits to large documents is the best case.
And there’s a second lever hiding in the same place. Everything above assumes you’re paying a third-party API, OpenAI, Cohere, Google, per embedding call. But embedding models are small. Most of the ones teams actually use run comfortably in-house, and if you bring the pipeline in-house you can cut GPU costs by 10-50x versus the per-call API or a naively-provisioned GPU. Incremental embedding cuts how often you call; running the model yourself cuts what each call costs. Do both and the embedding line item nearly disappears. (This is really a special case of Mistake 3 below, the embedding model is the most common workload sitting on an API or a GPU when it doesn’t need either.)
Rough cost impact: 10-50% from incremental re-embedding alone for an edit-heavy corpus, on top of a further 10-50x if you move the embedding pipeline in-house.
Mistake 3. You’re paying GPU prices for workloads that don’t need GPUs
This is the big one. The default assumption is AI = GPU = SageMaker or Bedrock, and for a large class of production workloads it’s simply wrong.
Embeddings, reranking, classifiers, small transformers, and even some 1-3B-parameter generation models run on CPU with ONNX or a similar runtime at a fraction of the cost, frequently faster on cold start, and trivially horizontal to scale. The reason almost nobody does this isn’t that it’s hard. It’s that the default AWS path funnels you toward GPU, and the SageMaker tutorial you followed on day one never mentioned ONNX. The decision got made by inertia, and nobody on the team has had a free week to revisit it.
Here’s the tell: your SageMaker bill dominates your cloud spend, latency is actually fine, but cost keeps climbing with traffic. That’s a workload that’s GPU-bound for no reason other than how it was deployed.
The reframe I’d push: stop asking “we’re an AI company, so we need GPUs.” Ask “what’s the smallest unit of compute that meets my latency budget for this specific workload?” Run that question per model, not per company. Embeddings and rerankers almost always come back CPU-viable. Large generative calls genuinely need the GPU, so leave those there and move everything else.
I’m certain about this one because I’ve watched it play out more than once: move the right class of workload off GPU and the cost multiplier lands somewhere between 10x and 50x. The surprise is never that it’s cheaper, it’s how much.
Rough cost impact: 10-50x on every workload you can correctly move off GPU, plus the operational win of not having to provision, queue for, and babysit GPU capacity at all.
Mistake 4. No caching layer for similar queries
This one matters most if you’re building agentic products, agent loops generate a lot of internally repetitive traffic, and that’s exactly what caching reclaims. A surprising share of production LLM traffic, anywhere from 5% to 50% depending on the workload, and toward the high end in agentic systems, is queries similar enough to previous ones that the same answer would have been correct. Without semantic caching, you pay full model price on every one of those duplicates.
Watch for identical or near-identical queries hitting your flagship model multiple times a day, support or agent automation regenerating answers to questions you’ve responded to thousands of times, and an LLM bill that scales perfectly linearly with traffic, no efficiency gain as volume grows, which is the signature of zero caching.
The architecture is genuinely simple: embed the incoming query, look it up against your cache of previous query embeddings, and if something is close enough, return the cached response; otherwise fall through to the model and cache the result.
qv = embed(query)
hit = cache.nearest(qv)
if hit and hit.distance < THRESHOLD:
return hit.response # cached, no LLM call
resp = llm(query)
cache.add(qv, resp)
return resp
The threshold is the part to get right, and I’d resist hard-coding a number you read in a blog post, including this one. As a starting point for cosine distance on a small embedding model, somewhere around 0.1-0.15 is defensible, but it’s model-dependent and you should calibrate it against your own traffic before trusting it. Set it too loose and you get cache poisoning, returning a confidently wrong cached answer to a query that only looked similar. Start conservative, measure your false-hit rate, loosen carefully.
For where to actually store the cache: any vector store you already run works, but the common production choices are Redis (with its vector search module), Qdrant, or pgvector if you want to keep it in Postgres, and for a simple in-process cache, even a local FAISS index does the job. You don’t need new infrastructure for this; you almost certainly already have something that can hold query embeddings.
Hit rates run anywhere from 5% to 50% depending on workload, lower for open-ended RAG, higher for customer support and agent loops where the same intents recur, and the savings land immediately, because every hit is a model call you didn’t make.
Rough cost impact: directly proportional to your hit rate, a 30% hit rate is roughly 30% off the cacheable portion of your LLM bill.
Mistake 5. Using flagship models for every query
Most production systems route every request to the most expensive model available, regardless of how hard the request actually is. “What’s my account balance?” and “summarize this 50-page contract” hit the same frontier model, because that’s how the team wired it at launch and nobody went back.
The symptoms are organizational as much as technical: one model used for everything, an inference bill that climbs every time you add users without a matching revenue bump, and recurring conversations about switching providers because “the flagship model is too expensive”, when the real problem isn’t the model, it’s using the flagship for the 70% of queries that never needed it.
The fix is a routing layer. A small classifier, sometimes a genuine tiny model, often just a heuristic to start, decides which tier each query needs. The simplest version that works: route on token length plus the presence of a few keywords, send the short/simple bucket to GPT-4o-mini or Claude Haiku, reserve the frontier model for what’s left. Most teams find a majority of queries downroute with no measurable quality loss.
This matters more than which flagship model you picked in the first place. Teams spend weeks A/B testing one frontier model against another for the top tier while routing everything there, optimizing the price of the expensive option instead of how often they reach for it. And it compounds with Mistake 4: cache the duplicates, downroute the simple ones, and the flagship model only sees the queries that actually deserve it.
Rough cost impact: 3-10x on the downroutable share of traffic, depending on the mix of input and output sizes.
Mistake 6. Keeping a big batch-access dataset in a managed transactional database
This one isn’t on the serving path at all, which is exactly why it hides. You’ve got a large dataset, say 300-500GB, sitting in Postgres or a managed equivalent, and it exists for one reason: training. It gets queried in full, a few times a week or a few times a month, when a training or eval run kicks off. The rest of the time it’s just sitting there, accruing managed-database pricing 24/7 for a workload that touches it for a few hours.
The tell is the mismatch between access pattern and storage class. A transactional DB charges you for what it’s good at, low-latency point reads, concurrent writes, indexes, transactions, and you’re using none of it. Your access pattern is “scan everything, occasionally.” Look for a large, slowly-changing table that no live request ever reads; a Postgres bill dominated by storage rather than query volume; and training jobs that pull the whole table anyway, so the indexes you’re paying to maintain do nothing.
The fix is to move that data off the transactional tier and onto cheap S3-compatible object storage in a columnar format, Parquet, or Delta Lake if you want versioning and schema evolution, and query it directly from there at training time. DuckDB or Polars will scan Parquet on S3 fast enough for batch work, with no always-on database to pay for. You’re swapping per-GB-month managed-DB pricing for object-storage pricing, which is typically an order of magnitude cheaper, and the columnar format usually makes the full scans faster than the row-store ever was.
And don’t overthink the read speed: S3 throughput is more than enough for 99.9% of batch training cases, you’re streaming a large scan, not serving low-latency point reads. In the rare case you need faster access or a filesystem-like interface (a training pipeline that expects local paths, say), you can mount the bucket directly with something like s3fs-fuse and treat it like a local directory.
# no always-on DB, scan Parquet straight off object storage at train time
import duckdb
df = duckdb.sql("""
SELECT features, label
FROM 's3://training-data/dataset/*.parquet'
WHERE split = 'train'
""").df()
Keep in transactional storage only what genuinely needs transactional access. A training corpus that’s read in bulk and rarely written is the textbook case for cold columnar storage, not a hot database.
Rough cost impact: up to ~99% on the storage tier. A dataset sitting in AWS RDS at $3,000/month becomes roughly $10/month on S3, and that’s true even if you read and write to it millions of times a day, because object-storage operations are nearly free compared to keeping a transactional database running around the clock.
The thing that actually kills the budget
Each of these is recoverable on its own. None of them, fixed alone, gets you all the way.
The reason inference bills explode is that most teams have three or four of these compounding at once. A team on a managed vector DB with a high-churn workload, re-embedding everything on every update, running embeddings on GPU, with no caching, sending every query to a flagship model, and parking a half-terabyte training set in Postgres, that bill is 10-30x what it should be, and no single fix unwinds it. The waste multiplies; the fix has to be unbundled one layer at a time.
If two or three of these sounded uncomfortably familiar, that’s normal, almost every AI startup I talk to has at least two. I run a free cost audit for InferWorks: one week, fixed fee, fully refundable if I can’t identify at least 5x the fee in annual savings. Book a call at inferworks.io.