Building Full-Text Search in Postgres With AI

Most teams reach for Elasticsearch the moment someone says “we need search,” and most of them did not need to. Postgres has had real full-text search since 8.3, and for the common case of searching a few million rows of documents, articles, or product descriptions, it is fast, transactional, and one less system to operate. The trick is knowing which primitives to wire together and where the sharp edges are. AI is genuinely good at drafting that wiring, but it will happily hand you a LIKE '%term%' query and call it search, so you have to verify everything it produces.

This guide walks through a search implementation I would actually ship: a generated tsvector column, a GIN index, ranked results with ts_rank, and pg_trgm for typo tolerance. Then I’ll be honest about when Postgres stops being the right tool.

The two halves: tsvector and tsquery

Postgres full-text search has exactly two data types you need to understand. A tsvector is a normalized, lexeme-tokenized representation of a document. A tsquery is a normalized representation of a search expression. You match them with the @@ operator.

SELECT to_tsvector('english', 'The fox jumps over the lazy dogs');
-- 'dog':7 'fox':2 'jump':3 'lazi':6

SELECT to_tsvector('english', 'The fox jumps over the lazy dogs')
       @@ to_tsquery('english', 'jumping & dog');
-- t

Notice what the english text search configuration did. It dropped stop words (the, over), lowercased everything, and ran each token through a stemmer so jumps became jump and dogs became dog. That stemming is why the query for jumping & dog matches a document containing jumps and dogs. This is the entire reason you use tsvector instead of LIKE: the matching happens on normalized lexemes, not raw substrings.

The configuration argument matters. english stems and removes English stop words. simple does neither, which you want for things like usernames or SKUs where stemming corrupts the term. Pick deliberately; do not let it default silently.

A generated tsvector column plus a GIN index

You can compute to_tsvector at query time, but that defeats indexing. The pattern that scales is a stored generated column indexed with GIN. GIN (Generalized Inverted Index) is purpose-built for “which rows contain this lexeme” lookups.

CREATE TABLE articles (
    id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    title       text NOT NULL,
    body        text NOT NULL,
    search_doc  tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(body,  '')), 'B')
    ) STORED
);

CREATE INDEX articles_search_idx ON articles USING GIN (search_doc);

Two things to call out. First, setweight tags title lexemes as weight A and body lexemes as weight B. That weighting feeds ranking later, so a hit in the title counts for more than a hit buried in the body. Second, the coalesce guards against NULL, because concatenating a NULL tsvector yields NULL and silently drops the document from your index.

The generated column means Postgres recomputes search_doc automatically on every insert and update. No triggers, no application code, no drift between the document and its index. This is the single biggest reason to prefer the generated-column approach over the older trigger-based pattern you will still find in a lot of blog posts and, frankly, in a lot of AI output.

I asked an assistant to scaffold this schema and the first draft was instructive about what to watch for.

Prompt: Design a Postgres full-text search schema for an articles table with title and body, weighting title higher, indexed for fast search.

Output (excerpt): “…add a trigger BEFORE INSERT OR UPDATE that calls tsvector_update_trigger(search_doc, 'pg_catalog.english', title, body) to keep the column current…”

That works, but it is the pre-12 approach. tsvector_update_trigger hardcodes the configuration and cannot express per-column weights, so you lose the setweight distinction. The generated-column version is simpler and strictly more capable. This is the verify-the-AI loop in miniature: the suggestion was not wrong, it was just dated. If you want a structured starting point for this kind of schema work, the Postgres full-text search design prompt bakes in the generated-column and weighting decisions so you do not have to relitigate them every time.

Querying and ranking

Matching tells you which rows qualify. Ranking tells you which order to show them in. ts_rank and ts_rank_cd score a tsvector against a tsquery, and they respect the weights you set earlier.

SELECT id, title,
       ts_rank(search_doc, query) AS rank
FROM   articles,
       websearch_to_tsquery('english', 'postgres replication lag') AS query
WHERE  search_doc @@ query
ORDER  BY rank DESC
LIMIT  20;

Use websearch_to_tsquery for anything user-facing. Unlike to_tsquery, it accepts the syntax people actually type: bare words, quoted "exact phrases", and -excluded terms, without throwing a syntax error on a stray character. to_tsquery is strict and will reject postgres replication because of the missing operator; websearch_to_tsquery just treats the space as AND.

For result snippets with the matched terms bolded, ts_headline does the highlighting:

SELECT ts_headline('english', body,
                   websearch_to_tsquery('english', 'wal archiving'),
                   'StartSel=<b>, StopSel=</b>, MaxWords=35')
FROM   articles
WHERE  id = 42;

ts_headline is expensive because it re-parses the original text, so compute it only for the page of results you actually display, never across the whole result set before the LIMIT.

Typo tolerance with pg_trgm

Full-text search matches lexemes, which means it does not handle misspellings. Search for postgers and you get nothing. For fuzzy matching you want pg_trgm, which breaks strings into three-character sequences and measures similarity between them.

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX articles_title_trgm_idx
    ON articles USING GIN (title gin_trgm_ops);

SELECT id, title, similarity(title, 'postgers tuning') AS sim
FROM   articles
WHERE  title % 'postgers tuning'
ORDER  BY sim DESC
LIMIT  10;

The % operator returns rows above the similarity threshold (pg_trgm.similarity_threshold, default 0.3). The GIN trigram index makes this index-backed instead of a full scan. A common production pattern is to run the tsvector search first and fall back to a trigram search only when it returns nothing, giving you precise matching for correct spellings and graceful degradation for typos. AI is useful here for generating the fallback query and the threshold tuning, but confirm the index actually gets used with EXPLAIN (ANALYZE, BUFFERS) rather than trusting that it does.

When to stay in Postgres, and when not to

Stay in Postgres when your corpus fits comfortably in memory-backed indexes, your search is bounded by a tenant or a category filter, and you want search results inside the same transaction as your writes. That covers an enormous number of applications. You avoid a second datastore, eventual-consistency reconciliation, and the operational burden of a search cluster.

Reach for a dedicated engine like Elasticsearch or OpenSearch when you need cross-language analyzers Postgres does not ship, relevance tuning more sophisticated than ts_rank weights, faceted aggregations over hundreds of millions of documents, or per-field analyzers and synonyms managed by a non-DBA team. The honest failure mode of Postgres search is relevance tuning: ts_rank is coarse, and once product wants BM25-style scoring with synonyms and boosting, you are reimplementing a search engine inside SQL.

A useful rule of thumb: if your search needs are “find the documents containing these words, newest first, scoped to this account,” Postgres wins on simplicity. If search relevance is a product feature you will iterate on weekly, budget for a real search engine. The same discipline applies to any Postgres performance decision, which is why the Postgres category collects the indexing and tuning playbooks alongside this one.

Verifying what the AI gives you

The pattern that has held up for me: let the assistant draft the schema, the query, and the index, then run three checks. Confirm the configuration argument is correct for your data (English stemming on a SKU column is a bug). Run EXPLAIN (ANALYZE, BUFFERS) and verify the GIN index is actually scanned, not bypassed for a sequential scan. And test the ranking against real queries, because a relevance order that looks reasonable in a code review can be useless in practice.

AI accelerates the drafting and saves you from memorizing the exact setweight and ts_headline argument syntax. It does not replace measuring. Build the schema, index it, rank it, add trigram fallback, and let the query planner tell you the truth.