Most engineering teams building RAG pipelines start with Postgres and pgvector. Then they go looking for production-scale performance data and hit a wall. The benchmarks don’t match what they’re running. The confusion is nearly always the same: pgvector and pgvectorscale are two different things, and most teams treat them as one. That distinction matters enormously — Timescale/Tiger Data’s benchmarks show pgvectorscale delivering 28x lower p95 latency and 75% cost reduction versus Pinecone.
The broader Postgres consolidation thesis — doing more inside the database you already have rather than bolting on specialised tools — extends all the way down to vector search. But the case for Postgres as a production vector store rests on the full three-extension stack: pgvector, pgvectorscale, and pgai. Not just pgvector alone.
So let’s untangle what each extension actually does, look at the benchmark evidence honestly, and work out when this stack is the right call versus a dedicated vector database.
What is pgvector and what does it actually do inside Postgres?
pgvector is a PostgreSQL extension that adds a native vector data type and two approximate nearest-neighbour (ANN) index types: HNSW and IVFFlat. Install it, create a column of type vector(1536), and Postgres can store and query high-dimensional embeddings without a separate vector store.
The embeddings come from AI models like OpenAI’s text-embedding-3-small, which turns text into 1,536-dimensional numerical representations. Store those vectors in a pgvector column, build an HNSW index, and you can run similarity queries — cosine distance, L2 distance, inner product — entirely in SQL. Documents and embeddings live in the same table, queried in the same transaction.
HNSW is the one to start with. Sub-20ms query times at 1 million vectors with 95%+ recall are achievable. IVFFlat builds faster, but you need to specify the cluster count upfront, and if you get that number wrong, recall suffers. Start with HNSW and don’t overthink it.
pgvector is supported across managed Postgres providers including Neon, Supabase, AWS RDS, and Tiger Data. The catch is that HNSW stores its entire graph in memory. At millions of vectors, that’s fine. Push toward tens of millions and beyond, and the RAM requirements become your constraint. This is where teams incorrectly conclude Postgres can’t do vector search at scale. The correct answer: pgvector alone can’t. But the full stack can.
Why does pgvectorscale exist and what does DiskANN change about vector search at scale?
pgvectorscale is a separate open-source extension from Timescale (Tiger Data) that adds the StreamingDiskANN index type on top of pgvector. The key phrase is on top of — pgvectorscale requires pgvector and extends it. They’re not competing extensions. They layer.
The problem pgvectorscale solves is HNSW’s memory architecture. HNSW keeps its graph in RAM, so as your dataset grows, the memory footprint grows with it. At large scale, you’re buying expensive RAM to hold a graph. DiskANN, originally from Microsoft Research, takes a different approach: a disk-resident graph that reads from NVMe SSDs with intelligent prefetching. Memory footprint stays bounded regardless of dataset size, and the performance hit versus RAM is small on any modern cloud instance.
The Tiger Data team’s “It’s 2026, Just Use Postgres” thesis is built on pgvectorscale plus pgai, not pgvector alone. When you see Timescale claiming Postgres can handle production vector workloads, they’re talking about the full stack. That’s the distinction worth keeping front of mind.
What do the pgvectorscale vs Pinecone benchmarks actually show?
The headline figures from Timescale/Tiger Data’s benchmark: on a dataset of 50 million Cohere embeddings (768 dimensions), pgvectorscale delivered 28x lower p95 latency and 16x higher query throughput than Pinecone’s storage-optimised index, at 75% less cost. At 471 queries per second, pgvectorscale achieved p95 latency of 28ms against Pinecone’s 784ms.
Here’s the caveat you need to be aware of: these are vendor-produced benchmarks. Timescale ran them, Timescale published them. No independent third party has replicated these results at equivalent scale. The methodology and infrastructure configuration are published alongside the numbers, and it’s worth reading that detail before you present any of this to your own team.
That said, p95 latency is the right metric to care about for RAG, not averages. A 784ms p95 means one in twenty user interactions is waiting nearly a second for context retrieval. That’s the number that shows up in production complaints.
On cost: Pinecone’s standard tier starts at $70/month. Enterprise pricing for Dedicated Read Nodes (DRN) — Pinecone’s purpose-built architecture for sustained high read volumes, with exclusive compute and local SSD storage — uses per-node hourly pricing. That’s what pgvectorscale’s benchmarks are competing against. The cost advantage for Postgres emerges at 50 million vectors and above. At smaller scale the picture reverses. For the full breakdown, see the pgvectorscale vs Pinecone TCO article.
How does pgai eliminate the embedding sync pipeline?
pgai is the third extension in the stack. Call create_vectorizer() and pgai registers a background worker that watches your source table. Any INSERT triggers an embedding API call — OpenAI, Cohere, or other supported providers — and writes the resulting vector directly into a linked embedding table. Updates regenerate the embedding. Deletes remove it.
The traditional alternative looks like this: Postgres source table → Kafka → Debezium connector → custom embedding worker → vector store. That’s four components with their own deployments, failure modes, and monitoring requirements. Stale embeddings are a real RAG quality problem — when the vector store lags behind the source database, your retrieval quality degrades — and that multi-component pipeline is exactly where the lag originates.
pgai replaces all of that with a one-time schema decision. On Tiger Data’s managed platform the vectorizer runs as a managed service. On self-hosted Postgres, a separate Python CLI process handles the embedding queue. Either way, the sync infrastructure disappears as a recurring operational concern.
How do you build a RAG pipeline on Postgres without a separate vector store?
A Postgres-native RAG pipeline has four stages, all through a single database. Data ingests into source tables. pgai’s vectorizer watches those tables and calls the embedding API on insert or update, writing vectors to a linked table. pgvectorscale indexes those vectors using StreamingDiskANN. Then retrieval runs hybrid search — BM25 keyword ranking plus vector similarity, merged via Reciprocal Rank Fusion (RRF) — and returns a ranked list of context chunks to pass to your LLM.
That retrieval step is important to get right. Pure vector search misses keyword-exact queries: product codes, identifiers, proper names. Pure BM25 misses semantic paraphrases. Hybrid search covers both failure modes, which is why it’s the recommended approach for production RAG. In Postgres, BM25 is available via pg_textsearch, vector search via pgvectorscale, and the RRF merge runs in pure SQL. No external dependencies. The full hybrid search implementation is covered in the BM25 hybrid search in Postgres article.
The operational advantage is pretty straightforward. All your data — source tables and embedding vectors — lives in one transactional system. Backups cover both in a single operation. One connection pool, one monitoring dashboard. Running a separate vector store means another deployment, another failure point, and a sync dependency that will eventually cause an incident. This same stack also underpins AI agent memory patterns — persistent agent context and branching workflows built entirely on Postgres.
When does Postgres still need a dedicated vector database?
Postgres with pgvectorscale is production-ready for most RAG and semantic search workloads at moderate scale — up to hundreds of millions of vectors with reasonable read concurrency. The boundary is at true billion-vector scale combined with extreme concurrent read requirements.
At that scale, Pinecone’s DRN architecture — purpose-built for read amplification with exclusive compute resources — may retain a performance edge that justifies the cost premium. Pinecone handles traffic spikes from 100 to 10,000 QPS without manual configuration; pgvectorscale’s scale-up model will hit hardware limits before Pinecone’s auto-scaling does.
There are other signals that warrant a dedicated vector database regardless of scale: compliance or data isolation requirements, teams without the DevOps capacity to tune Postgres, or multi-tenancy isolation at very large scale.
The practical rule of thumb: if your dataset is describable in “millions of documents” rather than “billions of vectors,” and your team would otherwise be running two separate systems, the argument for Postgres is solid. Build on Postgres. Break out a dedicated vector database only when measured performance forces your hand.
For a detailed cost analysis, see the pgvectorscale vs Pinecone TCO article. If you’re evaluating Postgres as your broader data layer, the database consolidation trend covers the full picture.
FAQ
What is the difference between pgvector and pgvectorscale?
pgvector is the base PostgreSQL extension: it adds the vector data type and HNSW/IVFFlat indexes — solid and reliable for millions of vectors. pgvectorscale is a separate Timescale extension that builds on top of pgvector (not replacing it) and adds the StreamingDiskANN index type for high-recall vector search at large scale with bounded memory usage. You need both installed together.
What is DiskANN and how is it different from HNSW?
HNSW stores its graph in RAM — memory grows with dataset size, which gets expensive at scale. DiskANN (from Microsoft Research) stores its graph on disk with intelligent prefetching, keeping memory bounded regardless of dataset size. pgvectorscale implements StreamingDiskANN to make this practical inside Postgres.
Can Postgres replace Pinecone for a RAG pipeline?
For most teams: yes. The combination of pgvectorscale plus pgai plus pgvector makes Postgres a production vector stack for datasets in the range of millions to low hundreds of millions of vectors. Pinecone retains an edge at true billion-vector scale with extreme read concurrency. The more useful question is whether your workload actually requires that scale — chances are it doesn’t.
How accurate are the pgvectorscale vs Pinecone benchmarks?
The 28x lower p95 latency and 75% cost reduction figures are vendor-produced — Timescale ran them on 50 million Cohere embeddings at 99% recall, not independently verified. They’re directionally credible, but treat them as a starting point for your own evaluation, not a verdict.
What is pgai and do I need it?
pgai automates embedding generation and sync within Postgres via create_vectorizer(). You need it if you want to avoid building a separate ETL pipeline — Kafka, Debezium, custom embedding workers — to keep vectors current. On Tiger Data managed cloud it’s a managed service; on self-hosted Postgres, a Python CLI process runs the queue. Either way, it replaces custom sync logic with a managed background process.
What is hybrid search and why does it matter for RAG?
Hybrid search combines BM25 keyword ranking with vector similarity, merged via Reciprocal Rank Fusion (RRF). Pure vector search misses keyword-exact queries; pure BM25 misses semantic paraphrases. Hybrid covers both — in Postgres via pg_textsearch plus pgvectorscale plus RRF in pure SQL, no external dependencies.
Is pgvector suitable for production at scale?
pgvector alone handles many real workloads — sub-20ms at 1 million vectors with 95%+ recall. At tens of millions and beyond, HNSW memory becomes the constraint. pgvectorscale extends production viability to hundreds of millions by replacing HNSW with StreamingDiskANN.
Does pgai only work with OpenAI?
No. pgai supports multiple embedding API providers. OpenAI’s text-embedding-3-small is the reference example in Timescale documentation — 1,536 dimensions, strong multilingual performance — but the create_vectorizer() pattern is provider-agnostic.
What is the Tiger Data platform?
Tiger Data is Timescale’s managed cloud Postgres platform (rebranded from Timescale Cloud). It ships pgvectorscale, pgai, and pg_textsearch as first-class extensions with a BYOC deployment model. Timescale is the primary contributor to the pgvectorscale and pgai open-source projects.
Where can I find the pgvectorscale vs Pinecone benchmark source?
The benchmark data is on the pgvectorscale GitHub repository and the Timescale/Tiger Data engineering blog, including methodology and infrastructure configuration. For a cost analysis, see the pgvectorscale vs Pinecone TCO article in this cluster.
What embedding model should I use with pgvector?
OpenAI’s text-embedding-3-small is the most common starting point: 1,536 dimensions, strong multilingual performance, well-documented pgai integration. Higher-dimensional models increase storage and slow index builds — for most RAG applications, text-embedding-3-small is the right balance of quality and cost.
When should I use pgvector HNSW vs IVFFlat?
HNSW is the default: higher recall, faster queries after index build, no cluster count to specify. IVFFlat builds faster but requires specifying the number of lists — misconfigure it and recall degrades. Start with HNSW; move to DiskANN (pgvectorscale) when HNSW memory requirements become prohibitive.