2026 Archive

PostgreSQL Weekly posts for 2026.

Weeks

Week 13 (2026)

PostgreSQL mailing list discussions for Week 13, 2026.

Articles

UUID and Base32hex Encoding

Introduction

Compact, copy-paste-friendly representations of UUIDs come up in URLs, logs, JSON payloads, and anything humans have to read aloud. A pgsql-hackers thread starting in October 2025 began with advocacy for two new builtins—uuid_to_base32hex() and base32hex_to_uuid()—encoding UUIDs as 26-character RFC 4648 Section 7 base32hex strings. What followed was less about the numeric format (everyone agreed it is standardized and useful) and more about where that behavior should live in PostgreSQL’s SQL surface: separate UUID-specific functions, or the existing encode() / decode() pipeline plus explicit type conversions.

Why This Matters

PostgreSQL already stores UUIDs efficiently as a dedicated type and ships uuidv7() and friends. Text interchange formats still matter for APIs and cross-system contracts. Base32hex preserves lexicographic sort order of the underlying bytes (unlike typical base64), stays case-insensitive for decoding, and avoids ambiguous oral spelling of mixed-case hex. The thread connects that format to real-world precedent (DNSSEC encoders, JSON-heavy pipelines) and to RFC 9562 positioning: canonical hyphenated UUID strings for compatibility, binary storage in databases, and a desire—not always fully standardized in the RFC text—for one compact text encoding to reduce ecosystem fragmentation.

Technical Analysis

Original proposal (conceptual)

The opening pitch (preserved in the thread as hi-hackers.txt) described:

  • uuid_to_base32hex(uuid) → text: 26 uppercase base32hex characters, no hyphens, no padding; two zero bits appended so 128 bits map cleanly to base32’s 5-bit alphabet.
  • base32hex_to_uuid(text) → uuid: case-insensitive decode; invalid input yields NULL.

Sergey Prokhorenko also argued against alternatives such as base36 (performance) and Crockford Base32 (weaker presence in standard libraries), positioning base32hex as the practical compact choice.

Community direction: compose, do not duplicate

Aleksander Alekseev’s first reply mixed process advice (avoid mass Cc: lists; use git format-patch; register on Commitfest) with API feedback: standalone UUID helpers are not composable. Prefer:

  1. Explicit uuid ↔ bytea casts (or conversions).
  2. Extending encode(bytea, ...) and decode(text, ...) with a base32hex format—so usage looks like:
SELECT encode(uuidv7()::bytea, 'base32hex');

(The early mail said 'base32'; the implemented format name in later patches is base32hex, matching RFC 4648’s “base32hex” alphabet.)

Andrey Borodin and Jelte Fennema-Nio agreed that extending encode() matches existing practice; Jelte pointed to base64url support added in PostgreSQL 18 (commit e1d917182) as a precedent for adding encodings to the same entry points.

What landed in the patch series (high level)

The downloadable series (revisions through v12 in the thread attachments) converges on:

  • encode(data bytea, 'base32hex') and decode(text, 'base32hex') implemented in encode.c, with RFC 4648 padding on encode and tolerant decode (padded or unpadded; case-insensitive; whitespace ignored—see patch headers for exact rules).
  • Documentation under func-binarystring.sgml describing base32hex and explicitly recommending:
rtrim(encode(uuid_value::bytea, 'base32hex'), '=')

for a 26-character compact UUID string versus 36 characters in canonical hex-with-hyphens form.

  • Companion work: explicit casting between uuid and bytea, so the encode() path does not require ad-hoc UUID-only builtins.

SQL examples (illustrative)

The snippets below match the API shape discussed on the list (encode / decode with 'base32hex' and uuid::bytea). They assume a PostgreSQL build that includes that support—the thread’s patches were still under review when this post was written.

1. Raw RFC output (with padding)encode() emits '=' padding; for a 16-byte UUID the string is longer than 26 characters until trimmed:

SELECT encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex') AS padded;
-- 32 characters (RFC 4648 padding to a multiple of 8; ends with '=')

2. Compact 26-character form — strip padding for URLs and logs (same example UUID as in the original thread):

SELECT rtrim(
         encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
         '='
       ) AS short_id;
-- 06AJBM9TUTSVND36VA87V8BVJO

3. Round tripdecode() yields bytea; cast back to uuid:

WITH x AS (
  SELECT rtrim(
           encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
           '='
         ) AS short_id
)
SELECT short_id,
       decode(short_id, 'base32hex')::uuid AS back_to_uuid
FROM x;
-- back_to_uuid = 019535d9-3df7-79fb-b466-fa907fa17f9e

4. New IDs with uuidv7() (when you want time-ordered values and a short external form):

SELECT rtrim(encode(uuidv7()::bytea, 'base32hex'), '=') AS short_new_id;

Patch evolution

Early revisions explored UUID-specific uuid_encode / uuid_decode style APIs; later revisions fold behavior into encode/decode, rebase cast support, add regression tests, and iterate on documentation—including notes on collation and sortability of encoded text (additional small doc patches appear in the attachment list).

Community Insights

  • Process and visibility: Aleksander noted Sergey was not subscribed to the list, so the original message did not reach all readers; subscribing and attaching the full proposal helped restore context.

  • Why base32hex: Sergey summarized advantages—sort preservation, compactness, wide library support, easy dictation without case ambiguity, and simple codecs for JSON-centric systems. He also framed standardization pressure: many ad-hoc “short UUID” schemes risk incompatibility; RFC 9562 authors reportedly favored a single compact encoding, even if the full compact-text story did not ship inside that RFC’s timeline.

  • API surface area: Masahiko Sawada argued that naming another uuid_* encoding beside encode() invites a proliferation of one-off functions; +1 for encode/decode + uuid/bytea conversion, with negligible cost for the cast.

  • Design tension—polymorphism vs. casts: Sergey asked whether encode() could take uuid directly and whether decode() could return uuid without going through bytea. Masahiko explained PostgreSQL cannot overload decode(text, text) with two different result types for the same signature; casts plus inline SQL wrappers are the idiomatic compromise. He gave an inlineable example:

CREATE FUNCTION uuid_to_base32(u uuid) RETURNS text
LANGUAGE SQL IMMUTABLE STRICT
BEGIN ATOMIC
  SELECT encode($1::bytea, 'base32hex');
END;

and noted the runtime difference versus calling encode(...) with an explicit cast is essentially a small conversion cost.

  • Sergey’s “one short format” concern: He warned against a world where different systems pick Crockford base32, base36, unsorted base64, etc. Masahiko countered that developers still choose encodings when integrating heterogeneous stacks—e.g., hex when every component supports it. The thread’s engineering answer is still to expose one standard base32hex in core and document the 26-character UUID recipe.

Technical Details

  • Padding: RFC-style encode() output includes = padding to a multiple of 8 characters; trimming yields the compact 26-character UUID form used in the original proposal.
  • Sort order: Base32hex preserves byte order for lexicographic comparisons; documentation and follow-up patches discuss collation effects when comparing encoded text (binary UUID comparisons remain the authoritative ordering).
  • Integration: Placing the codec in encode.c keeps binary encoding formats in one place and mirrors how base64url was added.

Conclusion

Base32hex is a small feature on paper, but the thread is a clear example of PostgreSQL’s preference for orthogonal primitives: typed storage (uuid), explicit binary views (bytea), and shared text encodings (encode/decode). If you need compact, sort-friendly UUID text, the emerging pattern is encode(uuid::bytea, 'base32hex') with optional rtrim(..., '='), not a parallel family of UUID-only functions—unless you wrap that one-liner for your own schema’s ergonomics.

JSONPath String Methods: Cleaning JSON Inside the Path—and a Long Debate About Immutability

Introduction

Working with messy JSON often means trimming, case-folding, and splitting strings before comparisons. Today you can do that in SQL around jsonb_path_query() and friends, but not always inside the JSONPath expression itself. Florents Tselai posted a patch series to pgsql-hackers that adds familiar string helpers—lower(), upper(), initcap(), ltrim() / rtrim() / btrim(), replace(), and split_part()—as JSONPath methods, delegating to PostgreSQL’s built-in string functions.

The thread quickly grew into a broader discussion: how these operations interact with volatility and immutability (locale-dependent behavior), whether PostgreSQL’s JSONPath should track the SQL standard or Internet RFCs, and what to do about the naming wart around the existing *_tz JSONPath entry points. The work is registered on Commitfest 5270; the mailing list thread spans 2024 through 2025 with many patch revisions.

Why This Matters

JSONPath is the embedded language used by jsonb_path_* routines. Adding string primitives in-path reduces nested SQL, keeps intent local to the path, and matches how people already think about Unix-style text pipelines—only applied to JSON scalars. For data cleaning (whitespace, casing, delimiter splitting), the feature is ergonomically strong.

The catch is that case mapping and many string operations depend on locale. PostgreSQL’s planner relies on correct volatility labels: marking something immutable when it can change with locale or OS updates breaks assumptions. That tension is what drives most of the thread after the initial patch post.

Technical Analysis

What the patch adds

The author’s first revision introduces methods that forward to pg_proc-registered implementations, extends JsonPathParseItem with method_args (arg0, arg1) for methods that need more than the usual left/right operand pattern, and adds jspGetArgX() accessors. A README.jsonpath file documents how to add new methods for future contributors. Example shapes from the proposal:

SELECT jsonb_path_query('" hElLo WorlD "', '$.btrim().lower().upper().lower().replace("hello","bye") starts with "bye"');
SELECT jsonb_path_query('"abc~(at)~def~@~ghi"', '$.split_part("~(at)~", 2)');

Open items in the first post included: possible name collisions if the SQL/JSON standard later defines methods with the same names (prefixes such as pg_ were mentioned in an earlier related thread), default collation behavior (consistent with existing JSONPath code), missing user-facing documentation, and a half-formed idea of CREATE JSONPATH FUNCTION-style extensibility.

Patch evolution (high level)

The fetchable series runs from v1 through v18, with v6 onward split into two patches per version: one renaming JSONPath method-argument tokens, the other carrying the string-method implementation. Intermediate revisions iterate on tests, grammar, and rebases against jsonpath_scan.l and related files. Later revisions add substantial SGML documentation under doc/src/sgml/func/func-json.sgml and regression tests for jsonpath, jsonb_jsonpath, and related SQL.

Community Insights

  • Tom Lane raised immutability early: string methods tied to locale mean JSONPath operations are not universally immutable, pointing to commit cb599b9dd and contrasting with the _tz split used for time-zone-sensitive datetime behavior in JSONPath—a pattern he argued does not scale cleanly to many sources of mutability.

  • Alexander Korotkov floated “flexible mutability”: a companion could analyze whether the JSONPath argument is constant and whether all methods used are “safe,” so jsonb_path_query() might be labeled immutable in restricted cases and stable otherwise. He also asked whether the new names appear in the SQL standard (or a draft).

  • Florents Tselai noted prior discussion from 2019, sketched a heuristic (if all elements are safe, treat the path as immutable), and cited RFC 9535 “function extensions” as covering vendor additions—while acknowledging mutability as an implementation concern.

  • David E. Wheeler clarified that PostgreSQL’s JSONPath follows the SQL/JSON track in the SQL standard, not RFC 9535; extension facilities in the public RFC are not automatically the same as SQL’s rules. He remained interested in extensibility hooks (lexer, parser, executor) and asked what hooks would look like; Florents outlined steps: new JsonPathItemType, lexer/parser changes in jsonpath_scan.l / jsonpath_gram.y, and a dispatch hook from executeItemOptUnwrapTarget.

  • Robert Haas reframed the issue as partly a general policy gap (immutability when depending on OS time or locale is often stable), compared to the existing json_path_exists vs json_path_exists_tz split, and suggested extending the “tz” family to accept locale-dependent operations and erroring in the non-suffixed functions—while admitting the _tz name becomes misleading.

  • David Wheeler read Tom’s “doesn’t scale” comment as “we will not add json_path_exists_foo for every mutable concern,” and wondered about renaming or generalizing. After discussion at a PostgreSQL community event, Florents summarized a pragmatic path: add the new behavior under the jsonb_path_*_tz family, reject it in non-_tz variants, and document clearly—keeping naming aligned with existing _tz functions despite the semantic stretch.

  • Naming clutter: David suggested introducing _stable-style names; Robert outlined a heavy deprecation/GUC path and predicted complaints regardless. Tom Lane saw little value in elaborate migration machinery, noted wrapper functions preserve old names for stubborn apps, and asked what is wrong with new names alongside old ones without removal. Robert accepted extra clutter if consensus prefers it. Florents noted a third parallel set of five jsonb_path_* functions would add API surface. David questioned indexability of _tz usage in the wild; the thread touches on generated columns and future virtual generated columns as where timestamps from JSON often land.

Technical Details

Immutability and locale

PostgreSQL distinguishes immutable (same inputs → same outputs for the life of a query plan’s assumptions) from stable (can change within a statement, e.g., with session or environment). Locale definitions can change when the OS or ICU data is updated, so functions that depend on them are typically not immutable. JSONPath integration must respect the same rules as core SQL functions if paths are embedded in expression evaluation and optimization.

Standards positioning

The thread distinguishes:

  • RFC 9535 (JSONPath, IETF): public, documents extension points.
  • SQL/JSON and SQL-standard JSONPath (what PostgreSQL targets): not fully public; extension and naming rules may differ.

Vendor-specific method names may still need care to avoid future standard collisions—prefixing or naming conventions remain an open design concern from the first post.

Implementation surface

Beyond volatility, adding methods touches the JSONPath lexer, grammar, executor, and tests. Splitting patches (rename tokens vs feature body) keeps review manageable once the series grows.

Current Status

The patch series remained under active development through multiple rebases (through v18 in the downloaded artifacts). It is listed on Commitfest 5270, with a GitHub branch/PR for readers who prefer that view. The mailing list discussion in May 2025 converged on practical next steps (surface under *_tz, strict errors elsewhere, documentation) and naming trade-offs, without a final commit message in this thread snapshot.

Conclusion

JSONPath string methods address a real ergonomics gap: in-path normalization and splitting of JSON string data. The community response centers less on the feature’s utility than on correct volatility, alignment with SQL/JSON, and API sprawl around jsonb_path_* and the _tz suffix. The thread is a useful snapshot of how PostgreSQL balances standards compliance, planner correctness, and long-term maintainability when extending a DSL that lives inside core.

Week 12 (2026)

PostgreSQL mailing list discussions for Week 12, 2026.

Articles

Reduce Planning Time for Large NOT IN Lists Containing NULL

Introduction

When a query uses x NOT IN (NULL, ...) or x <> ALL (NULL, ...), the result is always empty—no rows can match. In SQL, the presence of a single NULL in a NOT IN list makes the entire predicate evaluate to NULL or false for every row, so the selectivity is 0.0. Yet the PostgreSQL planner was still iterating over every element in the list and invoking the operator's selectivity estimator for each one, wasting planning time on large lists.

Ilia Evdokimov (Tantor Labs) proposed a simple optimization: detect when the array contains NULL and short-circuit the selectivity loop for the <> ALL / NOT IN case, returning 0.0 immediately. The patch went through several rounds of review, uncovered a subtle regression involving arrays that once contained NULL but no longer do, and was committed by David Rowley in March 2026.

Why This Matters

Queries with large NOT IN or <> ALL lists are common in reporting and ETL workloads. When such a list includes NULL—whether intentionally or from a subquery—the planner was doing unnecessary work:

  • For constant arrays: deconstructing the array, iterating over each element, and calling the operator's selectivity function.
  • For non-constant expressions: iterating over list elements to check for NULL.

In Ilia's benchmarks, planning time for WHERE x NOT IN (NULL, ...) dropped from 5–200 ms to ~1–2 ms depending on list size, when the column had detailed statistics. The optimization preserves semantics and avoids regressions in the common case.

Technical Analysis

The Semantics

For x NOT IN (a, b, c, ...) (or x <> ALL (array)):

  • If any element is NULL, the predicate yields NULL for every row (in a WHERE clause, that means the row is filtered out).
  • The planner models this as selectivity = 0.0: no rows match.

The current code in scalararraysel() in src/backend/utils/adt/selfuncs.c handles both = ANY (IN) and <> ALL (NOT IN) via a useOr flag. For <> ALL, the selectivity is computed by iterating over array elements and combining per-element estimates. When a NULL is present, the final result is always 0.0, so the loop is redundant.

Patch Evolution

v1 added an early check after deconstruct_array() using memchr() on elem_nulls to detect any NULL. David Geier raised a concern: memchr() adds overhead on every call. He suggested a flag on ArrayType instead.

v2 switched to short-circuiting inside the per-element loop: when a Const element is NULL, return 0.0 immediately. This avoids a separate pass but still requires entering the loop.

v3 moved the check earlier, right after DatumGetArrayTypeP(), using ARR_HASNULL() to detect NULL before deconstructing the array. This avoids both the deconstruction and the per-element loop when NULL is present. Ilia reported planning time dropping from 5–200 ms to ~1–2 ms.

v4 addressed a regression found by Zsolt Parragi. The macro ARR_HASNULL() only checks for the existence of a NULL bitmap—not whether any element is actually NULL. An array that originally had NULL but had all NULLs replaced (e.g., via array_set_element()) can still have a NULL bitmap. Using ARR_HASNULL() alone caused incorrect selectivity 0.0 for such arrays.

The fix: use array_contains_nulls(), which iterates the NULL bitmap and returns true only when an element is actually NULL. v4 also added a regression test that constructs an array from ARRAY[1, NULL, 3] with the NULL replaced by 99, ensuring the planner estimates 997 rows (not 0) for x <> ALL(replace_elem(ARRAY[1,NULL,3], 2, 99)).

v5–v9 incorporated feedback from David Rowley: move the test to a new selectivity_est.sql file (later renamed to planner_est.sql), use tenk1 from test_setup.sql instead of a custom table, add a comment about assuming the operator is strict (like var_eq_const()), and simplify tests to assert the invariant (selectivity 0.0 when NULL is present) rather than exact row estimates.

Community Insights

  • David Geier questioned the cost of memchr() and suggested an ArrayType flag; Ilia found that ARR_HASNULL() / array_contains_nulls() already existed.
  • Zsolt Parragi found the regression with arrays that had NULLs replaced, and proposed the replace_elem test case.
  • David Geier clarified that ARR_HASNULL() checks the bitmap's existence, not actual NULL elements; array_contains_nulls() is the correct check.
  • David Rowley suggested moving tests to a dedicated planner-estimation file, using existing test tables, and documenting the strict-operator assumption. He also pushed the refactoring patch (planner_est.sql) and the main optimization.

Technical Details

Implementation

The optimization adds two short-circuit paths in scalararraysel():

  1. Constant array case: After DatumGetArrayTypeP(), if !useOr (i.e., <> ALL / NOT IN) and array_contains_nulls(arrayval), return 0.0 immediately. This runs before deconstruct_array().

  2. Non-constant list case: In the per-element loop, if !useOr and the element is a Const with constisnull, return 0.0. This handles x NOT IN (1, 2, NULL, ...) when the list comes from a non-constant expression.

The code assumes the operator is strict (like var_eq_const()): when the constant is NULL, the operator returns zero selectivity. This is consistent with existing planner behavior.

Edge Cases

  • Arrays with NULL bitmap but no actual NULLs: Handled by array_contains_nulls() instead of ARR_HASNULL().
  • Non-strict operators: The comment documents that the short-circuit follows the same assumption as var_eq_const().

David Geier noted that the speedup will be less pronounced once the hash-based NOT IN code is merged, but the optimization still saves cycles during selectivity estimation.

Current Status

The patch was committed by David Rowley on March 19, 2026. The refactoring that moved planner row-estimation tests to planner_est.sql was committed first; the main optimization followed. It will appear in a future PostgreSQL release.

Conclusion

A small change—detecting NULL in NOT IN / <> ALL lists and returning selectivity 0.0 early—avoids unnecessary per-element work during planning. The fix required careful handling of the NULL bitmap vs. actual NULL elements, and benefited from thorough review and regression tests. The optimization is now part of PostgreSQL and will help workloads that use large NOT IN lists containing NULL.

Week 11 (2026)

PostgreSQL mailing list discussions for Week 11, 2026.

Articles

Converting NOT IN Sublinks to Anti-Joins When Safe

Introduction

In February–March 2026, Richard Guo proposed and iterated on a planner patch to convert

... WHERE expr NOT IN (SELECT subexpr FROM ...)

into a join-based anti-join when it is provably safe. The topic has a long history on pgsql-hackers: NOT IN has tricky semantics around NULL, so naïvely rewriting it to an anti-join risks changing query results.

The new patch takes advantage of infrastructure that did not exist in older attempts: an outer-join-aware Var representation, a global not-null-attnums hash table, and smarter non-nullability reasoning. Over six patch versions, reviewers including wenhui qiu, Zhang Mingli, Japin Li, David Geier and others helped refine both the safety criteria and the implementation details. The final v6 patch was committed by Richard in March 2026.

This post explains why NOT IN is hard, how the planner proves safety, what changed between v1 and v6, and what this means for everyday queries.

Why NOT IN Is Hard

At first glance, NOT IN looks like a natural anti-join:

SELECT *
FROM users
WHERE id NOT IN (SELECT user_id FROM banned_users);

Intuitively this means “users that are not banned,” and a hash anti-join against banned_users is exactly the plan you would like to see. The problem is SQL’s NULL semantics.

Consider a scalar comparison (A = B) used inside NOT IN:

  • If (A = B) is TRUE, the element is found and the NOT IN condition fails.
  • If (A = B) is FALSE, the particular element is not a match; the overall result depends on the other elements.
  • If (A = B) is NULL, NOT (NULL) is also NULL (treated as false in WHERE), so the row is discarded.

In contrast, for an anti-join, if the join condition evaluates to NULL for a row pair, the executor essentially treats that as “no match” and may keep the outer row. As Richard points out in the thread, whenever the comparison operator can yield NULL on any pair of inputs that the executor considers a valid equality comparison, NOT IN and a simple anti-join disagree.

Historically, this semantic mismatch made the planner avoid converting NOT IN sublinks to anti-joins. Users who wanted a join-based plan needed to rewrite their queries using NOT EXISTS or other patterns.

Planner Infrastructure: Proving Non-Nullability

Recent releases have accumulated infrastructure that can prove non-nullability of expressions more reliably and cheaply:

  • Outer-join-aware Vars record which relations can be nulled out by outer joins.
  • A not-null-attnums hash table tracks columns declared NOT NULL (or implied by primary keys and certain constraints).
  • expr_is_nonnullable() reasons about more complex expressions, not just simple Var or Const nodes.
  • find_nonnullable_vars() can derive non-nullability from qual clauses (e.g. col IS NOT NULL or strict operators in join conditions).

In the new patch, Richard leverages this infrastructure to answer two questions:

  1. Can either side of the comparison become NULL?
    • Use relation-level information (NOT NULL, PK) and outer-join-aware Vars to reject Vars that come from the nullable side of an outer join.
    • Incorporate find_nonnullable_vars() and safe qual clauses to detect values forced non-null by the query’s WHERE/ON conditions.
  2. Can the comparison operator itself return NULL on non-null arguments?
    • Use catalog information to restrict to operators that are members of a B-tree or hash opfamily, because those must behave like a normal total order or equality operator. If such an operator returned NULL for non-null inputs, indexes built on it would be broken.

Only when both operands are provably non-nullable and the operator is considered safe does the planner consider rewriting the NOT IN sublink as an anti-join.

From v1 to v6: Tightening the Safety Checks

The patch did not land in its first incarnation. The thread documents a clear evolution:

  • v1:

    • Implemented the basic conversion logic.
    • Focused on proving both sides of the comparison are non-nullable using existing planner helpers.
    • Did not yet check that the operator itself could not return NULL for non-null inputs.
  • Discussion on operator safety:

    • Richard realized that requiring non-nullable operands is not enough; an operator could still return NULL on non-null inputs.
    • He asked whether it is possible to detect operators that never return NULL on non-null inputs and suggested using membership in btree opclasses as a proxy.
    • David Geier noted that a lot of executor code assumes comparison operators do not return NULLFunctionCall2() and friends will ERROR if they do—so constraining to builtin B-tree / hash operators is a reasonable and safe starting point.
  • v2–v4:

    • Added the check that the operator must be a member of a B-tree or hash operator family.
    • Clarified comments and added more regression tests covering cases where the inner side of the subquery comes from the nullable side of an outer join but is forced non-null by a WHERE clause.
    • Incorporated small review fixes around test comments and edge cases.
  • v5–v6:

    • Refined internal helpers, including the function that checks non-nullability of the test expression (sublink_testexpr_is_not_nullable).
    • Improved handling of row comparison expressions (RowCompareExpr) so that multi-column NOT IN patterns can also benefit.
    • Refactored list iteration using foreach_ptr / foreach_node for better type safety.
    • Fixed a subtle order issue in query_outputs_are_not_nullable() around flattening grouping Vars versus join alias Vars.
    • Added and polished regression test coverage, then performed a final self-review.

By the time v6 was committed, the patch combined conservative semantics, good test coverage, and multiple rounds of review feedback.

What the Patch Actually Does

At a high level, when the planner sees a NOT IN sublink of the canonical ANY/ALL form, it:

  1. Recognizes the pattern in SubLink and the associated testexpr.
  2. Collects outer expressions (the left-hand side(s) of the comparison).
  3. Checks operator safety:
    • Every operator involved in the comparison must be a member of a B-tree or hash opfamily.
  4. Proves operand non-nullability:
    • Use relation-level NOT NULL information, outer-join-aware Var metadata, and non-nullable-vars analysis from qual clauses to show that neither the outer expression nor the subquery output can be NULL.
  5. Converts to an anti-join if and only if all checks pass.

The conversion allows the planner to:

  • Pull the subquery up into the global join tree instead of treating it as an opaque subplan.
  • Reorder the join relative to other joins.
  • Choose the best join algorithm (hash anti-join, merge anti-join, etc.) based on cost.

From the user’s perspective, the benefit is that typical exclusion patterns written with NOT IN now receive plans comparable to well-written NOT EXISTS and explicit anti-join forms, without requiring manual rewrites.

Example: Typical Exclusion Pattern

The patch primarily targets canonical patterns like:

SELECT *
FROM users
WHERE id NOT IN (SELECT user_id FROM banned_users);

and:

SELECT *
FROM users
WHERE id NOT IN (
  SELECT user_id
  FROM banned_users
  WHERE user_id IS NOT NULL
);

In well-modelled schemas, users.id and banned_users.user_id are usually defined as NOT NULL and use standard equality operators. Under these conditions, the planner can prove that:

  • Neither side of the comparison can be NULL.
  • The equality operator behaves like a normal B-tree/hash equality operator.

The NOT IN sublink is then rewritten to an anti-join, and the plan can look like:

  • A Hash Anti Join between users and banned_users, or
  • A Merge Anti Join if there are supporting indexes and the planner prefers a merge strategy.

The thread also includes a large performance test contributed by wenhui qiu, showing how the new optimization produces efficient anti-join plans on big synthetic data sets once the relevant columns are marked NOT NULL.

Community Feedback and Scope

The discussion also explored how far this optimization should go:

  • David Geier described more aggressive rewrites that add IS NOT NULL predicates and extra NOT EXISTS subqueries to cover cases where either side can be NULL. Richard pointed out correctness problems with some of those transformations (especially when the subquery is empty) and concluded that they were out of scope for this patch.
  • There was brief speculation about a potential null-aware anti-join execution node, similar to what Oracle does, to handle more NOT IN cases without purely syntactic query rewrites. That is left for future work.
  • Reviewers repeatedly emphasized a conservative approach: it is better to miss some theoretical optimization opportunities than to risk changing query answers.

The final patch deliberately focuses on the basic, high-ROI form: both sides provably non-nullable and standard B-tree/hash operators. This matches the majority of real-world NOT IN exclusion queries while keeping code complexity and risk under control.

Current Status

As of mid-March 2026:

  • The v6 patch “Convert NOT IN sublinks to anti-joins when safe” has been committed.
  • The optimization is enabled automatically when the planner can prove its safety.

In practice, this means that if you write typical NOT IN exclusion queries on NOT NULL keys using builtin comparison operators, PostgreSQL’s planner can now generate anti-join plans for you. You get better plans without changing application SQL, as long as your schemas and constraints correctly reflect non-nullability.

Takeaways for Users

  • Use proper NOT NULL constraints and primary keys. The more accurately schemas model non-nullability, the more often the planner can safely apply this optimization.
  • NOT IN on nullable columns remains tricky. PostgreSQL will continue to be conservative in these cases; consider NOT EXISTS patterns if you need predictable behavior with NULLs.
  • You do not need to rewrite canonical NOT IN exclusions just to get an anti-join; the planner can now recognize and optimize them when it is safe to do so.

References

Week 10 (2026)

PostgreSQL mailing list discussions for Week 10, 2026.

Articles

Generic Plans and Initial Pruning: Fewer Locks for Partitioned Tables

Introduction

In December 2021, Amit Langote proposed a patch to speed up execution of generic plans over partitioned tables. Generic plans (e.g. from prepared statements with parameters) cannot prune partitions at plan time, so they contain nodes for every partition. That makes AcquireExecutorLocks() a major bottleneck: it locks every relation in the plan, and partition count grows without bound. This post summarizes the idea, the benchmark gains, and the safety and design discussion that followed on pgsql-hackers.

Why This Matters

When you use a prepared statement like:

PREPARE q AS SELECT * FROM partitioned_table WHERE key = $1;
EXECUTE q(123);

with plan_cache_mode = force_generic_plan (or after the planner has chosen a generic plan), the plan is shared across executions and has no plan-time pruning. So the plan tree includes all partitions. Before each execution, CheckCachedPlan() must ensure the plan is still valid; most of that cost is in AcquireExecutorLocks(), which locks every relation in the plan. As partition count grows (hundreds or thousands), lock acquisition dominates and throughput drops sharply.

A previous attempt by David Rowley was to delay locking until after "initial" (pre-execution) pruning in the executor. That was rejected because leaving some partitions unlocked opened race conditions: a concurrent session could alter a partition after the plan was deemed valid but before execution, invalidating part of the plan.

Amit's approach keeps locking at plan-check time but reduces the set of relations to lock by reusing the same "initial" pruning logic that the executor will later use. Only partitions that survive initial pruning are locked, so the plan stays safe while lock count scales with the number of partitions that are actually used.

Technical Analysis

The Idea: Prune Before Locking

Append and MergeAppend nodes for partitioned tables carry partition pruning information: which subplans are discarded by "initial" (pre-execution) steps versus "execution-time" steps. Initial steps depend only on values available before execution (e.g. bound parameters), not on per-row values. The patch teaches AcquireExecutorLocks() to:

  1. Walk the plan tree as it does today to collect relations to lock.
  2. For Append/MergeAppend nodes that have initial pruning steps (contains_init_steps), run those steps (using the same logic as the executor) to get the set of subplans that survive.
  3. Only add relations from those surviving subplans to the lock set.

So the lock set is exactly the set of relations that will be used when the plan runs. No partition that is pruned away by initial steps is locked, and no partition that is used is left unlocked.

Duplication of Pruning

Initial pruning is therefore performed twice for generic plans: once in AcquireExecutorLocks() to decide what to lock, and again in ExecInit[Merge]Append() to decide which partition subnodes to create. Amit noted he couldn't find a clean way to avoid this duplication without restructuring where locking happens (e.g. moving it into executor startup), which would be a larger change.

Benchmark

Using pgbench with a partitioned database and plan_cache_mode = force_generic_plan:

  • HEAD: throughput falls as partition count increases (e.g. 32 partitions ≈ 20.5k tps, 2048 partitions ≈ 1.3k tps).
  • Patched: throughput stays much higher (e.g. 32 partitions ≈ 27.5k tps, 2048 partitions ≈ 16.3k tps).

So the patch removes most of the scaling cost from lock acquisition when generic plans are used with many partitions.

Community Insights

When Does This Apply?

Ashutosh Bapat asked when "pre-execution" pruning instructions exist so that this approach helps. Amit clarified:

  • The main use case is prepared statements that use a generic plan, e.g. PREPARE q AS SELECT * FROM partitioned_table WHERE key = $1; with EXECUTE q(...).
  • Other bottlenecks (e.g. executor startup/shutdown code that walk the full range table) are unchanged by this patch.

Code Review (Amul Sul)

Amul suggested several cleanups for the v1 patch:

  • Move declarations inside the if (pruneinfo && pruneinfo->contains_init_steps) block.
  • Add a short comment that when the condition is false, plan_tree_walker() continues to child nodes, so locking behavior remains correct.
  • Prefer GetLockableRelations_worker() (or equivalent) over adding a new get_plan_scanrelids().
  • Use plan_walk_members() for CustomScan like other node types.
  • Let the caller create/free the temporary EState used for pruning in the lock path, instead of doing it inside the lock-collection helper.
  • Use foreach_current_index() in the relevant loops for clarity.

Safety (Robert Haas)

Robert raised two important points.

1. Partly valid plans Today we only run plans that are fully valid: we lock every relation, so we accept invalidation messages and detect DDL that might invalidate the plan. If we skip locking some relations, we might never see invalidations for them. For example:

  • A partition has an extra index and the plan uses an Index Scan on it.
  • That partition is pruned by initial steps, so we don't lock it.
  • Another session drops the index.
  • We still consider the plan valid. We don't execute the pruned part, but code that walks the whole plan (e.g. EXPLAIN, auto_explain) might touch that node and break (e.g. looking up the index name).

So we'd be in a situation where the plan is "partly valid," which we don't have today. Robert wasn't sure there is a concrete bug in core from that, but it's a new class of risk.

Amit replied that he'd looked for places that inspect the plan tree before executor init and hadn't found one that would touch pruned-off parts; EXPLAIN runs after ExecutorStart(), which builds the PlanState tree and thus only the non-pruned portion. He agreed it's not something to assert with certainty.

2. Lock set vs. init set must match Doing initial pruning in two places means two separate computations. If they ever disagree (e.g. due to a function misdeclared as IMMUTABLE but actually VOLATILE), we could lock one set of partitions and then initialize a different set. Robert argued we should ensure that cannot happen rather than rely on the two being always identical.

Amit agreed the patch assumes initial pruning is deterministic (no VOLATILE in the pruning expressions). Misdeclared IMMUTABLE could lead to Assert failures or, in non-assert builds, using an unlocked partition, which would be bad.

Technical Details

Applicability

  • Generic plans only: Custom plans can prune at plan time, so they don't have the "all partitions in the plan" problem to the same degree.
  • Initial pruning only: Only partitions eliminated by initial (pre-execution) pruning are skipped for locking. Partitions pruned at execution time (e.g. by runtime parameter values in a different execution) are still in the plan and would still be locked; the patch doesn't change that.

Edge Cases

  • EXPLAIN / auto_explain: The concern is that they might walk plan nodes for relations we didn't lock. Amit's analysis is that they run after executor init, so they only see the initialized (post-pruning) plan.
  • Incorrect VOLATILE/IMMUTABLE: Could make lock set and init set differ; the patch doesn't add extra guards for that.

Current Status

The thread carried a single patch version (v1) and did not show a follow-up commit in the thread. The discussion highlighted:

  • Strong benchmark gains for generic plans on many partitions.
  • Need to address code-style and refactor suggestions (EState lifecycle, walker usage, comments).
  • Open design points: ensuring lock set and init set cannot diverge, and whether "partly valid" plans are acceptable for code paths that walk the full plan tree.

Conclusion

Amit's patch reduces the cost of AcquireExecutorLocks() for generic plans over partitioned tables by locking only partitions that survive initial pruning, avoiding the race conditions of the earlier "delay locking" approach. Benchmarks show large throughput improvements when many partitions are present. The discussion clarified the intended use case (prepared statements with generic plans), raised valid safety and consistency concerns (partly valid plans, duplicate pruning), and produced concrete code-review suggestions. Implementing similar logic in a way that guarantees a single pruning result for both locking and executor init would strengthen the approach and may require a somewhat larger refactor of where and how locks are acquired.

References

Week 09 (2026)

PostgreSQL mailing list discussions for Week 09, 2026.

Articles

More Speedups for Tuple Deformation: Precalculating attcacheoff

Introduction

Tuple deformation is the process of extracting individual attribute values from a PostgreSQL heap tuple's raw byte representation into a TupleTableSlot. It happens constantly during query execution—every time a sequential scan, index scan, or join produces a row, the executor must "deform" that tuple to access column values. For workloads that process millions of rows, even small improvements to the deforming hot path can yield significant gains.

David Rowley has been steadily optimizing tuple deformation. In PostgreSQL 18, he landed several patches: CompactAttribute (5983a4cff), faster offset aligning (db448ce5a), and inline deforming loops (58a359e58). Building on that work, he proposed precalculating attcacheoff rather than computing it on every attribute access. The discussion has since evolved through v10 (February 2026), with Andres Freund contributing a NULL-bitmap-to-isnull conversion that pushes Apple M2 speedups to 63% in some scenarios. The patch set remains under active review.

Why This Matters

When the executor needs a column value from a tuple, it must:

  1. Align the current offset according to the attribute's alignment
  2. Fetch the value via fetch_att()
  3. Advance past the attribute to the next one

These steps form a dependency chain: each offset depends on the previous. There is little opportunity for instruction-level parallelism. For fixed-width attributes, PostgreSQL can cache the offset (attcacheoff) to avoid recomputing alignment and length—but that caching was previously done inside the deforming loop. David's idea: do it once when the TupleDesc is finalized, not on every tuple.

Technical Approach

TupleDescFinalize()

The core change introduces TupleDescFinalize(), which must be called after a TupleDesc has been created or changed. This function:

  1. Pre-calculates attcacheoff for all fixed-width attributes
  2. Records firstNonCachedOffAttr—the first attribute (by attnum) that is varlena or cstring and thus cannot have a cached offset
  3. Enables a tight loop that deforms all attributes with cached offsets before falling through to attributes that require manual offset calculation

If a tuple has a NULL before the last attribute with a cached offset, the code can only use attcacheoff up to that NULL—but for tuples without early NULLs, the fast path handles many attributes in a tight loop without any per-attribute offset arithmetic.

Dedicated Deforming Loops

The patch adds a dedicated loop that processes all attributes with a precomputed attcacheoff before entering the loop that handles varlena/cstring attributes. For tuples with HEAP_HASNULL set, the current code calls att_isnull() for every attribute. A further optimization: keep deforming without calling att_isnull() until we reach the first NULL. Test #5 in the benchmark (first col int not null, last col int null) highlights this—it often shows the largest speedup.

Optional OPTIMIZE_BYVAL Loop

An optional variant adds a loop for tuples where all deformed attributes are attbyval == true. In that case, fetch_att() can be inlined without the branch that handles pointer types, reducing branching and yielding a tighter loop. The tradeoff: when the optimization doesn't apply, there is extra overhead to check attnum < firstByRefAttr. Benchmark results vary by hardware and compiler as to whether this helps.

Benchmark Design

To stress tuple deformation, David designed a benchmark that maximizes deforming work relative to other CPU:

SELECT sum(a) FROM t1;

The a column is almost last, so all prior attributes must be deformed before a can be read. Eight test schemas cover combinations of first column (int/text, null/not null) and last column (int null/not null). For each of the 8 tests, he ran with 0, 10, 20, 30, and 40 extra INT NOT NULL columns—40 scenarios per benchmark run. Each scenario used 1 million rows.

Benchmark Results

Results varied by hardware and compiler:

  • AMD Zen 2 (3990x) with GCC: Up to 21% average speedup with OPTIMIZE_BYVAL; some tests exceed 44%; no regressions.
  • AMD Zen 2 with Clang: Some small regressions in the 0-extra-column tests.
  • Apple M2: Tests #1 and #5 improve significantly; others less so; a few slight regressions with certain patches.
  • Intel (Azure): Benchmarks run on shared, low-core instances; results were noisier due to co-located workloads.

Patch Evolution

v1 → v3 (December 2025 – January 2026)

  • v1: Three patches—0001 (precalculate attcacheoff), 0002 (experimental NULL bitmap look-ahead), 0003 (remove dedicated hasnulls loop)
  • v2: Rebase, fix Assert for NULL bitmap in 0003, JIT fix (remove TTS_FLAG_SLOW), more benchmarks
  • v3: Rebase, drop 0002 and 0003 (benchmarks showed little advantage), keep only 0001

v4 (January 2026)

Addressed code review from Chao Li:

  • NULL bitmap mask (tupmacs.h): Clarified comment—when natts & 7 == 0, the mask is zero and the code correctly returns natts
  • Uninitialized TupleDesc: firstNonCachedOffAttr == 0 means no cached attributes; -1 means uninitialized. Added Asserts with hints to call TupleDescFinalize() if they fail
  • Typo: "possibily" → "possibly"
  • LLVM: Fixed compiler warning

v5–v8 (January–February 2026): Andres Freund's NULL Bitmap Optimization

Andres Freund joined the discussion and proposed a key improvement: instead of calling att_isnull() for each column, compute the isnull[] array directly from the NULL bitmap using a SWAR (SIMD Within A Register) technique. The idea: multiply one byte of the bitmap by a carefully chosen value (e.g. 0x204081) so each bit spreads into a separate byte, then mask. This avoids a 2KB lookup table and works well on most hardware.

David implemented this in patch 0004 ("Various experimental changes"). Additional changes in 0004:

  • populate_isnull_array(): Converts the NULL bitmap to tts_isnull in bulk using the multiplication trick
  • tts_isnull sizing: Rounded up to a multiple of 8 so the loop can write 8 bytes at a time (avoids memset inlining issues)
  • t_hoff: For !hasnulls tuples, use MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) instead of t_hoff
  • fetch_att_noerr(): New variant without elog for the common attlen == 8 case

John Naylor noted that __builtin_ctz(~bits[bytenum]) is undefined when the byte is 255; David fixed this with a cast: pg_rightmost_one_pos32(~((uint32) bits[bytenum])).

Results with 0004: Apple M2 averaged 53% faster than master (or ~63% excluding 0-extra-column tests). Andres suggested pg_nounroll and pg_novector pragmas to prevent GCC from over-vectorizing populate_isnull_array(), which was generating poor code.

v9 (February 24, 2026)

  • Resequenced patches: deform_bench moved to 0001 for easier master benchmarking
  • 0004 (new): Sibling-call optimization in slot_getsomeattrs—moved slot_getmissingattrs() into getsomeattrs() so the compiler can apply tail-call optimization. Reduces overhead and improves 0-extra-column tests
  • 0005 (new): Shrink CompactAttribute from 16 to 8 bytes—attcacheoffint16 (max 2^15), bitflags for booleans. Andres noted the 8-byte size lets the compiler use a single LEA with scale factor 8; 6 bytes would require two LEA instructions

v10 (February 25, 2026) — Latest Patch Set

Based on the actual v10 patch content:

0003 (Optimize tuple deformation):

  • firstNonCachedOffsetAttr: index of the first attribute without a cached offset
  • firstNonGuaranteedAttr: index of the first nullable, missing, or !attbyval attribute. When deforming only up to this point, the code need not access HeapTupleHeaderGetNatts(tup)—a dependency reduction that helps the CPU pipeline
  • TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS: opt-in flag for the guaranteed-attribute optimization (some code deforms tuples before NOT NULL validation)
  • populate_isnull_array(): uses SPREAD_BITS_MULTIPLIER_32 (0x204081) to spread each bit of the inverted NULL bitmap into a separate byte; processes lower 4 and upper 4 bits separately to avoid uint64 overflow
  • fetch_att_noerr(): variant of fetch_att() without elog for invalid attlen; safe when attlen comes from CompactAttribute
  • first_null_attr(): finds the first NULL in the bitmap using pg_rightmost_one_pos32 or __builtin_ctz

0004 (Sibling-call optimization):

  • getsomeattrs() is now responsible for calling slot_getmissingattrs()
  • slot_getmissingattrs(): replaced memset with a for-loop (benchmarks showed the loop is faster)
  • slot_deform_heap_tuple(): calls slot_getmissingattrs() at the end when attnum < reqnatts; parameter renamed from natts to reqnatts

0005 (8-byte CompactAttribute):

  • attcacheoffint16; offsets > PG_INT16_MAX are not cached
  • Bitflags for attispackable, atthasmissing, attisdropped, attgenerated
  • Stores cattrs = tupleDesc->compact_attrs to help GCC generate better code (avoids repeated TupleDescCompactAttr() calls)

Review fixes:

  • Amit Langote: Fixed rebase noise (duplicate attcacheoff check)
  • Zsolt Parragi: Big-endian fix—pg_bswap64() before memcpy in populate_isnull_array()
  • Typos: "benchmaring" → "benchmarking", "to info" → "into"
  • Andres: Set *offp before slot_getmissingattrs to reduce stack spills; use size_t for attnum to fix GCC -fwrapv codegen

deform_bench and Benchmark Infrastructure

Andres and Álvaro Herrera discussed where to put deform_bench: src/test/modules/benchmark_tools, src/benchmark/tuple_deform, or a single extension for micro-benchmarks. Andres argued for merging useful tools incrementally rather than waiting for a full suite. David prefers to focus on the deformation patches first; deform_bench may be committed separately.

Code Review: Chao Li's Feedback

Chao Li reviewed the patch and raised several points:

  1. NULL bitmap mask: Add a comment to clarify no overflow/OOB risk when natts & 7 == 0
  2. Uninitialized TupleDesc: Initialize firstNonCachedOffAttr to -1 in TupleDesc creation; Assert >= 0 in nocachegetattr()
  3. Semantic consistency: Use 0 for "no cached attributes," >0 for "some cached"
  4. Typo: "possibily" → "possibly"

David addressed all in v4.

Current Status

  • v10 (February 2026) is the latest patch set: 0001 (deform_bench), 0002 (TupleDescFinalize stub), 0003 (main optimization), 0004 (sibling-call + NULL bitmap→isnull), 0005 (8-byte CompactAttribute)
  • Andres Freund supports merging 0004 as a clear win; 0005's benefit is less certain (helps with LEA addressing when deforming few columns)
  • Active review from Zsolt Parragi (Percona), Álvaro Herrera, John Naylor, Amit Langote
  • deform_bench placement (src/test/modules vs. src/benchmark) still under discussion; David prefers to land the optimization patches first

Conclusion

Precalculating attcacheoff in TupleDescFinalize() and using a dedicated tight loop for attributes with cached offsets yields meaningful speedups for tuple deformation on modern CPUs. The optimization is most effective when tuples have many fixed-width columns and few or late NULLs. With Andres Freund's NULL-bitmap-to-isnull conversion (the "0x204081" SWAR trick), Apple M2 sees up to 63% speedup excluding edge cases. The sibling-call optimization in slot_getsomeattrs further reduces overhead. Results depend on hardware and compiler; GCC can over-vectorize some loops, addressed with pragmas or size_t for loop indices. The patch set (v10) has been refined through extensive review from Andres, John Naylor, Zsolt Parragi, Álvaro Herrera, and Amit Langote, and is progressing toward integration.

References

Week 08 (2026)

PostgreSQL mailing list discussions for Week 08, 2026.

Articles

Eliminating SPI from RI Triggers: A Fast Path for Foreign Key Checks

Introduction

Referential Integrity (RI) triggers in PostgreSQL traditionally execute SQL queries via SPI (Server Programming Interface) to verify that inserted or updated rows in a referencing table have matching rows in the referenced (primary key) table. For bulk operations—large INSERT or UPDATE statements—this means starting and tearing down a full executor plan for each row, with significant overhead from ExecutorStart() and ExecutorEnd().

Amit Langote has been working on eliminating this overhead by performing RI checks as direct index probes instead of SQL plans. The latest iteration of this work, "Eliminating SPI / SQL from some RI triggers - take 3," achieves up to 57% speedup for bulk foreign key checks by bypassing the SPI executor and calling the index access method directly when the constraint semantics allow it.

The patch set has evolved through several versions, with Junwang Zhao joining the effort in late 2025. The current direction is a hybrid fast-path + fallback design: use a direct index probe for straightforward cases, and fall back to the existing SPI path when correctness requires executor behavior that would be difficult or risky to replicate.

Why This Matters

Foreign key constraints are ubiquitous. Every INSERT or UPDATE into a referencing table triggers RI checks that must verify each new or modified row against the referenced table's primary key. With the traditional approach:

CREATE TABLE pk (a int PRIMARY KEY);
CREATE TABLE fk (a int REFERENCES pk);

INSERT INTO pk SELECT generate_series(1, 1000000);
INSERT INTO fk SELECT generate_series(1, 1000000);  -- 1M RI checks

Each of the 1 million inserts triggers an RI check that:

  1. Builds a query plan to scan the PK index.
  2. Runs ExecutorStart() and ExecutorEnd().
  3. Executes the plan to find (or not find) the matching row.

This per-row plan setup/teardown dominates the cost. With Amit's v3 patches, the same bulk insert drops from ~1000 ms to ~432 ms (57% faster) on his benchmark machine—by probing the PK index directly without going through the executor.

Technical Background

The Traditional RI Path

RI trigger functions in ri_triggers.c (e.g. RI_FKey_check) call ri_PerformCheck(), which:

  1. Builds an SQL string for a query like SELECT 1 FROM pk WHERE pk.a = $1.
  2. Uses SPI_prepare and SPI_execute_plan to run it.
  3. The executor performs an index scan on the PK, returning a row if the referenced value exists.

This works correctly for all cases—partitioned tables, temporal foreign keys, concurrent updates—but pays the full plan-execution cost per row.

The Fast-Path Idea

For simple foreign keys (non-partitioned referenced table, non-temporal semantics), the check is conceptually: "probe the PK index for this value; if found and lockable, the check passes." That can be done by:

  1. Opening the PK relation and its unique index.
  2. Building a scan key from the FK column values.
  3. Calling index_getnext() (or equivalent) to find the tuple.
  4. Locking it with LockTupleKeyShare under the current snapshot.

No SQL, no plan, no executor. Just a direct index probe and tuple lock.

Patch Evolution

v1: The Original Approach (December 2024)

The first patch set (3 patches) introduced:

  • 0001: Refactoring of the PartitionDesc interface to explicitly pass the snapshot needed for omit_detached visibility (detach-pending partitions). This addressed a bug where PK lookups could return incorrect results under REPEATABLE READ because find_inheritance_children()'s visibility of detach-pending partitions depended on ActiveSnapshot, which RI lookups were manipulating.
  • 0002: Avoid using SPI in RI trigger functions by introducing a direct index probe path.
  • 0003: Avoid using an SQL query for some RI checks—the main performance optimization.

Amit noted that temporal foreign key queries would remain on the SPI path, as their plans involve range overlap and aggregation and are not amenable to a simple index probe. He also added an equivalent of EvalPlanQual() for the new path to handle concurrent updates correctly under READ COMMITTED.

v2: Junwang's Hybrid Fast Path (December 2025)

Junwang Zhao took the work forward with a hybrid design:

  • 0001: Add fast path for foreign key constraint checks. Applies when the referenced table is not partitioned and the constraint does not involve temporal semantics.
  • 0002: Cache fast-path metadata (operator hash entries, operator OIDs, strategy numbers, subtypes). At that stage, the metadata cache did not yet improve performance.

Benchmarks (1M rows, numeric PK / bigint FK):

  • Head: INSERT 13.5s, UPDATE 15s
  • Patched: INSERT 8.2s, UPDATE 10.1s

v3: Amit's Rework with Per-Statement Caching (February 2026)

Amit reworked Junwang's patches into two patches:

  • 0001: Functionally complete fast path. Includes concurrency handling, REPEATABLE READ crosscheck, cross-type operators, security context (RLS/ACL), and metadata caching. Most logic lives in ri_FastPathCheck(); RI_FKey_check just gates the call and falls back to SPI when needed.
  • 0002: Per-statement resource caching. Instead of sharing EState between trigger.c and ri_triggers.c, a new AfterTriggerBatchCallback mechanism fires at the end of each trigger-firing cycle. It allows caching the PK relation, index, scan descriptor, and snapshot across all FK trigger invocations within a single cycle, rather than opening and closing them per row.

Benchmarks on Amit's machine:

ScenarioMaster00010001+0002
1M rows, numeric/bigint2444 ms1382 ms (43% faster)1202 ms (51% faster)
1M rows, int/int1000 ms520 ms (48% faster)432 ms (57% faster)

The incremental gain from 0002 (~13–17%) comes from eliminating per-row relation open/close, scan begin/end, slot allocation/free, and replacing per-row GetSnapshotData() with a snapshot copy in the cache.

Design: When to Use Fast Path vs. SPI

The fast path applies when:

  • The referenced table is not partitioned.
  • The constraint does not involve temporal semantics (range overlap, range_agg(), etc.).
  • Multi-column keys, cross-type equality (via index opfamily), collation matching, and RLS/ACL are all handled directly in the fast path.

The code falls back to SPI when:

  1. Concurrent updates or deletes: If table_tuple_lock() reports that the target tuple was updated or deleted, the code delegates to SPI so that EvalPlanQual and visibility rules apply as today.
  2. Partitioned referenced tables: Require routing the probe through the correct partition via PartitionDirectory. Can be added later as a separate patch.
  3. Temporal foreign keys: Use range overlap and containment semantics that inherently involve aggregation; they stay on the SPI path.

Security behavior mirrors the existing SPI path: the fast path temporarily switches to the parent table's owner with SECURITY_LOCAL_USERID_CHANGE | SECURITY_NOFORCE_RLS around the probe, matching ri_PerformCheck().

Future Directions

David Rowley suggested off-list that batching multiple FK values into a single index probe could further improve performance, leveraging the ScalarArrayOp btree improvements from PostgreSQL 17. The idea: buffer FK values across trigger invocations in the per-constraint cache, build a SK_SEARCHARRAY scan key, and let the btree AM traverse matching leaf pages in one sorted pass instead of one tree descent per row. Locking and recheck would remain per-tuple. This could be explored as a separate patch on top of the current series.

Current Status

  • The series is in PG19-Drafts. Amit moved it there in October 2025; Junwang Zhao is continuing the work.
  • Amit's v3 patches (February 2026) are in reasonable shape and ready for review. He welcomes feedback, especially on concurrency handling in ri_LockPKTuple() and the snapshot lifecycle in 0002.
  • Pavel Stehule has offered to help with testing and review.

Conclusion

Eliminating SPI from RI triggers for simple foreign key checks yields substantial performance gains for bulk operations. The hybrid fast-path + fallback design addresses reviewer concerns about correctness by deferring to SPI whenever executor behavior is non-trivial to replicate. The per-statement resource caching in v3 adds a second layer of optimization by amortizing relation/index setup across many rows within a single trigger-firing cycle.

For workloads with large bulk inserts or updates on tables with foreign keys—common in ETL, staging loads, and data migrations—this work could significantly reduce runtimes. The current limitations (partitioned PKs, temporal FKs) leave those cases on the existing path, preserving correctness while optimizing the majority of FK workloads.

References

Week 07 (2026)

PostgreSQL mailing list discussions for Week 07, 2026.

Articles

SQL Property Graph Queries (SQL/PGQ): Bringing Graph Queries to PostgreSQL

Introduction

In February 2024, Peter Eisentraut announced a prototype implementation of SQL Property Graph Queries (SQL/PGQ) on the pgsql-hackers mailing list—a new way to run graph-style queries directly in PostgreSQL, following the SQL:2023 standard (ISO 9075-16). The initiative had been briefly discussed at the FOSDEM developer meeting, and community interest led Peter to share his work-in-progress.

Nearly two years later, the patch has evolved into a substantial implementation: 118 files changed, ~14,800 lines added. Peter and Ashutosh Bapat are the primary authors, with Junwang Zhao reviewing and Ajay Pal and Henson Choi testing. The latest iteration (v20260113) consolidates features from v0 through v14 and beyond—including cyclic path patterns, access permissions, RLS support, graph element functions (LABELS(), PROPERTY_NAMES()), multi-pattern path matching, ECPG support, property collation rules, and pg_overexplain integration.

SQL/PGQ lets you define property graphs over existing relational tables and query them using path patterns (vertices connected by edges), much like Cypher or GQL. Unlike dedicated graph databases, this approach maps graphs onto the relational model: graphs are views over tables, and graph queries rewrite to joins and unions. The discussion raised important architectural questions about when and how that transformation should happen—and whether the rewriter is the right place for it.

Why This Matters

Many applications have naturally graph-shaped data: social networks, supply chains, recommendation systems, fraud detection. Today, developers either:

  • Use a separate graph database (Neo4j, etc.) and maintain two systems, or
  • Encode graph traversals as recursive CTEs and complex joins in PostgreSQL.

SQL/PGQ aims to give PostgreSQL users a standard, declarative way to express graph queries without leaving SQL or duplicating data. The standard has been adopted by Oracle 23c and others; bringing it to PostgreSQL would improve interoperability and make graph capabilities available to a broad user base.

Technical Analysis

The SQL/PGQ Model

A property graph in SQL/PGQ is a virtual structure defined over existing tables:

  • Vertices are rows from one or more tables (with optional labels).
  • Edges are relationships, typically inferred from foreign keys or explicitly specified.
  • Properties are columns from those tables.
  • Labels can be shared across multiple element tables (e.g., person for both customers and employees), with each label exposing its own set of properties.

You create a graph with CREATE PROPERTY GRAPH and query it with GRAPH_TABLE(... MATCH ... COLUMNS ...). Example:

CREATE PROPERTY GRAPH myshop
    VERTEX TABLES (
        products LABEL product,
        customers LABEL customer,
        orders LABEL "order"
    )
    EDGE TABLES (
        order_items SOURCE orders DESTINATION products LABEL contains,
        customer_orders SOURCE customers DESTINATION orders LABEL has_placed
    );

SELECT customer_name FROM GRAPH_TABLE (myshop
  MATCH (c IS customer)-[IS has_placed]->(o IS "order" WHERE o.ordered_when = current_date)
  COLUMNS (c.name AS customer_name));

The MATCH clause describes a path pattern; the implementation rewrites it to joins and filters that PostgreSQL can plan and execute.

Implementation Approach: Rewrite System

The implementation uses the rewrite system for graph-to-relational transformation—the same stage where views expand. Peter explained that this aligns with the SQL/PGQ specification: graphs map to relations, and queries expand like view definitions. By the time the planner sees the query, the graph structure has been flattened into standard relational form, keeping the implementation consistent with view security (privileges, security barriers).

Patch Evolution: From v0 to v20260113

v0: Fragile Prototype (Feb 2024)

The initial patch was ~332 KB. Peter described it as "quite fragile." It introduced CREATE PROPERTY GRAPH, GRAPH_TABLE, basic path patterns, and documentation in ddl.sgml and queries.sgml.

v1 and Early Refinements (June–Aug 2024)

Peter and Ashutosh released v1 with a "fairly complete minimal feature set." Ashutosh contributed:

  • WHERE clause in graph patterns—e.g., MATCH (a)->(b)->(c) WHERE a.vname = c.vname
  • Spurious "column not found" bug fix: Attribute names were referenced from pg_attribute heap tuples; after RELCACHE_FORCE_RELEASE they could point to freed memory. The fix was to copy attribute names.
  • Compilation fixes, pgperltidy compliance, error location reporting for implicit properties/labels.

Imran Zaheer added support for EXCEPT lists in labels and properties.

v14: Cyclic Paths, Access Permissions, RLS (Aug–Oct 2024)

Ashutosh contributed major features:

  • Cyclic path patterns: Path patterns where an element variable repeats (e.g., same vertex at both ends of a path). Elements sharing a variable must have the same type and label expression; repeated edge patterns are not supported.
  • Access permissions on property graphs: The property graph acts like a security invoker view—the current user must have privileges on the underlying tables. Queries succeed only for elements the user can access. Security definer property graphs are not implemented.
  • Row Level Security (RLS): Regression tests in graph_table_rls.sql verify RLS behavior with property graphs.
  • Property collation and edge-vertex links: Same-named properties across elements must have the same collation. Edge key columns and referenced vertex keys must have compatible collations. Edge-vertex link quals use equality operators, with dependencies so they cannot be dropped without dropping the edge.
  • \d and \dG variants: \d on a property graph shows elements, tables, kinds, and end-vertex aliases; \d+ adds labels and properties. \dG lists property graphs; \dG+ adds owner and description.

Henson Choi: LABELS(), PROPERTY_NAMES(), Multi-Pattern (Dec 2025)

Henson Choi contributed three patches:

  • LABELS() graph element function: Returns all labels of a graph element as text[]. Implemented by wrapping element tables in subqueries with a virtual __labels__ column, enabling the planner to prune Append branches when filtering by labels (e.g., WHERE 'Person' = ANY(LABELS(v))).
  • PROPERTY_NAMES() graph element function: Returns all property names as text[], with similar planner pruning for property-based filters.
  • Multi-pattern path matching: Support for comma-separated path patterns in MATCH, e.g., MATCH (a)->(b), (b)->(c). Patterns with shared variables merge into one join; disconnected patterns produce a Cartesian product (aligned with SQL/PGQ and Neo4j Cypher).

v20260113: Consolidated Implementation (Jan 2026)

The latest patch (v20260113) merges all prior work into a single WIP patch:

  • ECPG support: SQL/PGQ in ECPG—basic queries, prepared statements, cursors, dynamic queries. Label disjunction in ECPG required changes to the ecpg lexer.
  • pg_overexplain integration: Property graph RTEs and RELKIND_PROPGRAPH are recognized for EXPLAIN (RANGE_TABLE, ...).
  • Extended test coverage: create_property_graph.sql (365 lines), graph_table.sql (561 lines), graph_table_rls.sql (363 lines), privileges.sql (58 lines).
  • rewriteGraphTable.c: Grown from ~420 to ~1,330 lines; propgraphcmds.c from ~1,000 to ~1,860 lines.

Community Insights

Andres Freund: Concerns About the Rewriter

Andres Freund raised a structural concern: transformation via the rewrite system bars the planner from benefiting from graph semantics and increases rewrite-system usage. Peter responded that PGQ is designed as relational-at-core (like view expansion), and the standard and other implementations follow this model. Tomas Vondra wondered whether retaining graph structure longer could enable graph-specific indexes or executor nodes; Ashutosh Bapat noted that many optimizations would improve the underlying joins anyway, and aligning with view expansion makes sense for security.

Florents Tselai: Documentation

Florents Tselai suggested reordering the docs to answer "I have some data modeled as a graph G=(V, E). Can Postgres help me?" first, and to use runnable examples from graph_table.sql. He compared with Apache Age’s jsonpath-style approach but agreed the standard’s relational mapping fits core PostgreSQL.

Technical Details

Architecture

  • Parser: Grammar for CREATE PROPERTY GRAPH, ALTER PROPERTY GRAPH, DROP PROPERTY GRAPH, and GRAPH_TABLE(... MATCH ... COLUMNS ...).
  • Catalogs: pg_propgraph_element, pg_propgraph_element_label, pg_propgraph_label, pg_propgraph_label_property, pg_propgraph_property.
  • Rewrite: rewriteGraphTable.c transforms graph patterns into joins and unions.
  • Utilities: pg_dump, psql \d/\dG, tab completion; pg_get_propgraphdef() for introspection.
  • ECPG: Full support in embedded SQL.

Access Control

  1. A user needs SELECT on the property graph.
  2. The property graph is security invoker: the current user must have privileges on the underlying tables. Queries only succeed for elements the user can access.
  3. In a security definer view, property graph access uses the view owner’s privileges; base relation access uses the executing user’s privileges.
  4. Security definer property graphs are not implemented (the standard does not mention them).
  • Properties with the same name must have the same collation across elements.
  • Edge key and vertex key collations must match when keys are explicitly specified (foreign-key-derived links rely on the constraint).
  • Edge-vertex link quals use equality operators; the edge depends on those operators so they cannot be dropped independently.

Current Status

The v20260113 patch is a consolidated WIP. It includes:

  • Full CREATE PROPERTY GRAPH / ALTER / DROP with labels, properties, KEY clauses, SOURCE/DESTINATION REFERENCES
  • GRAPH_TABLE with path patterns, WHERE in patterns, cyclic paths, multi-pattern
  • LABELS() and PROPERTY_NAMES() graph element functions
  • Access permissions, RLS, privileges tests
  • ECPG support, pg_overexplain integration
  • Documentation and regression tests

The patch is not yet committed. Peter and Ashutosh continue to refine it; the rewriter-based design remains the chosen approach, with ongoing review and testing.

Conclusion

SQL/PGQ would bring standardized graph query syntax to PostgreSQL. The implementation has grown from a fragile prototype to a feature-rich patch with cyclic paths, graph element functions, multi-pattern matching, access control, RLS, ECPG support, and comprehensive tests. The main architectural choice—rewriting graph queries to relations in the rewrite system—aligns with the standard and view semantics. If committed, PostgreSQL users could express path patterns over relational data without leaving SQL or maintaining a separate graph database.

References

Reducing LEFT JOIN to ANTI JOIN: A Planner Optimization for "WHERE col IS NULL"

Introduction

In late December 2025, Nicolas Adenis-Lamarre raised a planner optimization on the pgsql-hackers list: automatically detect anti-join patterns in queries that use LEFT JOIN ... WHERE right_table.column IS NULL when that column is known to be non-nullable (e.g. NOT NULL or primary key). Such queries mean "rows from the left side with no matching row on the right," which is exactly what an anti-join expresses. Recognizing this lets the planner choose a dedicated anti-join plan instead of a generic left join + filter, often with better performance.

The discussion drew in Tom Lane, David Rowley, Tender Wang, Richard Guo, and others. A patch evolved through several versions, was submitted to CommitFest, and received detailed review that uncovered correctness issues with nested outer joins and inheritance. This post summarizes the idea, the implementation approach, and the current status.

Why This Matters

Many developers write "find rows in A with no match in B" as:

SELECT a.*
FROM a
LEFT JOIN b ON a.id = b.a_id
WHERE b.some_not_null_col IS NULL;

Because the join is a LEFT JOIN, unmatched rows from a have NULL in all columns from b. Filtering on b.some_not_null_col IS NULL (when that column is NOT NULL in b) therefore keeps only those unmatched rows. Semantically this is an anti-join: "rows in A that do not have a matching row in B."

If the planner does not recognize this pattern, it may implement it as a normal left join plus a filter. If it does recognize it, it can use an explicit Hash Anti Join (or similar), which can be more efficient and can unlock better join order choices. The optimization is "non-mandatory" in the sense that a skilled user could rewrite the query as NOT EXISTS or NOT IN (with due care for NULLs), but automatic detection helps all users and keeps the original SQL readable.

Technical Background

PostgreSQL already has logic to reduce outer joins in certain cases. In particular:

  • Commits 904f6a593 and e2debb643 added infrastructure that the planner can use for this kind of reduction.
  • In reduce_outer_joins_pass2, the planner already tries to reduce JOIN_LEFT to JOIN_ANTI when the join's own quals are strict for some var that was forced null by higher qual levels (e.g. by an upper WHERE).

The existing comment in that area noted that there are other ways to detect an anti-join—for example, checking whether vars from the right-hand side are non-null because of table constraints (NOT NULL, etc.). That was left for later; Nicolas's proposal and Tender's patch implement exactly that: use NOT NULL (and related) information to detect when WHERE rhs_col IS NULL implies "no match," and thus when a LEFT JOIN can be reduced to an ANTI JOIN.

Patch Evolution

Nicolas's initial patch

Nicolas sent a draft patch that:

  • Detected the pattern "left join b where x is null" when x is a non-null var from the right-hand side (RTE).
  • Was intentionally quick-and-dirty to see if the change was feasible.

He also listed other ideas (removing redundant DISTINCT/GROUP BY, folding double ORDER BY, anti-join on NOT IN, and a way to "view the rewritten query"). Those were discussed briefly but are not the focus of this post.

Tom Lane and David Rowley

Tom Lane replied that:

  • The optimization is reasonable and the new infrastructure (904f6a593, e2debb643) should be used.
  • The draft should not leave nearby comments outdated; keeping comments accurate is mandatory.

David Rowley suggested:

  • Using find_relation_notnullatts() and comparing with forced_null_vars, with care for FirstLowInvalidHeapAttributeNumber.
  • Searching the archives for prior work on UniqueKeys (for redundant DISTINCT removal).
  • Being cautious about "remove double order" and "NOT IN" anti-join; both have been discussed before and have subtle edge cases.
  • Noting that "view the rewritten query" is ambiguous—many optimizations cannot be expressed back as a single SQL statement.

Tender Wang's implementation (v2–v4)

Tender Wang provided a patch that:

  • Used the infrastructure from 904f6a593 and e2debb643.
  • Updated the comments in reduce_outer_joins_pass2 to describe the new case (detecting anti-join via NOT NULL constraints on the RHS).
  • Added regression tests.

Nicolas then:

  • Confirmed that Tender's patch was correct (after re-testing).
  • Suggested an early exit: only run the new logic when forced_null_vars != NIL, to avoid calling find_nonnullable_vars and have_var_is_notnull on every left join when most have no forced-null vars.
  • Contributed extra regression tests using new tables (with NOT NULL constraints) instead of modifying existing test tables like tenk1.

Tom Lane clarified that modifying common test objects (e.g. in test_setup.sql) is a bad idea, as it can change planner behavior and break or alter other tests. New tests should use new tables or existing ones that already match the needed properties.

Tender incorporated Nicolas's early-exit and regression tests into a single v4 patch and submitted it to CommitFest.

Richard Guo's review: correctness issues

Richard Guo reviewed the v4 patch and found two correctness problems.

1. Nested outer joins

When the right-hand side of the left join itself contains an outer join, a column that is NOT NULL in its base table can still become NULL in the join result. Reducing the outer join to an anti-join in that case is wrong.

Example (tables t1, t2, t3 with columns e.g. (a NOT NULL, b, c)):

EXPLAIN (COSTS OFF)
SELECT * FROM t1
LEFT JOIN (t2 LEFT JOIN t3 ON t2.c = t3.c) ON t1.b = t2.b
WHERE t3.a IS NULL;

Here t3.a is NOT NULL in t3, but because of the inner t2 LEFT JOIN t3, a row from t1 can be joined to the subquery and still have t3.a NULL (when there is no matching row in t3). So the upper join must remain a left join; converting it to an anti-join would drop rows incorrectly.

The patch was treating any var from a NOT NULL column as "safe" for the anti-join reduction without considering whether that var could be nulled by a lower-level outer join. Richard noted that we don't currently record varnullingrels in forced_null_vars, so a simple fix would be to only do this optimization when the RHS has no outer joins (right_state->contains_outer false), but that would be too restrictive.

His proposed direction: in reduce_outer_joins_pass1_state, record the relids of base rels that are nullable within each subtree. Then, when checking NOT NULL constraints, skip vars that come from those rels. He attached a v5 patch illustrating this idea.

2. Inheritance

For inheritance parent tables, some child tables might have a NOT NULL constraint on a column while others do not. The patch did not account for that; the second issue is more straightforward to fix than the nested-outer-join case.

Other discussion points

  • Pavel Stehule asked participants to avoid top-posting on the list; the PostgreSQL wiki has guidelines on mailing list style.
  • Constants from subqueries: Nicolas noted that a case like SELECT * FROM a LEFT JOIN (SELECT 1 AS const1 FROM b) x WHERE x.const1 IS NULL is not handled; he considered it not worth handling.

Current Status

  • The v4 patch (with early exit and regression tests) was submitted to CommitFest (patch 6375).
  • Richard Guo's v5 patch addresses the nested-outer-join and inheritance issues by tracking nullable base rels and tightening when NOT NULL can be used for the reduction.
  • As of the thread, the discussion was ongoing; the final resolution (e.g. commit of a revised patch) would be tracked on the list and in CommitFest.

Conclusion

Automatically reducing LEFT JOIN ... WHERE rhs_not_null_col IS NULL to an anti-join when the column is provably non-nullable is a useful planner optimization that can improve performance without requiring users to rewrite queries. The patch has evolved from a draft to an implementation using the existing planner infrastructure, with regression tests and an early-exit optimization. Reviewer feedback has identified important correctness constraints: the RHS may contain nested outer joins or inheritance, so NOT NULL must be applied only when the var cannot be nulled by lower joins or by inheritance. Follow-up work centers on Richard's approach (recording nullable base rels and restricting the NOT NULL check accordingly) and on handling inheritance safely.

References

Week 06 (2026)

PostgreSQL mailing list discussions for Week 06, 2026.

Articles

COPY TO with JSON Format: Native JSON Export from PostgreSQL

Introduction

In November 2023, Davin Shearer asked on the pgsql-general list how to emit JSON from PostgreSQL to a file using COPY TO. When he used COPY TO with a query that produced a single JSON column (e.g. json_agg(row_to_json(t))), the text format applied its own quoting rules: double quotes inside the JSON were escaped again, producing invalid JSON that tools like jq could not parse. The community agreed that a proper solution would be a native JSON format for COPY TO—so that one column of JSON (or a row rendered as one JSON object) is written as valid JSON without an extra layer of text/CSV escaping.

Base on the idea and, in a long-running thread that also references Joe Conway’s earlier COPY/JSON discussion, posted a series of patches. The design that emerged is: add a FORMAT json option for COPY TO only, and an optional FORCE_ARRAY option to wrap the output in a JSON array. The thread has seen many revisions (v8 through v23), with feedback from Tom Lane, Joe Conway, Alvaro Herrera, Joel Jacobson, Jian He, Junwang Zhao, and others. This post summarizes the discussion, the implementation, and the current status.

Why This Matters

  • Correct JSON export: Today, exporting a query result as JSON from the server usually means using COPY TO in text or CSV format. Text format treats the result as a string and escapes quotes and backslashes, which breaks JSON. A dedicated JSON format writes each row as one JSON object (or one JSON value per column) with proper escaping, so the output is valid JSON.
  • Interoperability: Many pipelines expect JSON (e.g. one JSON object per line, or a single JSON array). Native COPY TO ... (FORMAT json) and FORCE_ARRAY allow exporting directly from the database without client-side formatting or workarounds (e.g. psql -t -A or LO API).
  • Consistency with existing formats: COPY already supports text, csv, and binary. Adding json keeps the same mental model: choose a format and get correctly encoded output.

Technical Analysis

Design Decisions

The patch set makes these choices:

  1. JSON format is COPY TO only. COPY FROM with JSON is not supported (parsing arbitrary JSON is a larger feature). The grammar and option validation reject FORMAT json for COPY FROM.
  2. No HEADER with JSON. The documentation and code disallow HEADER when using JSON format, to avoid mixing a header line with JSON lines/array.
  3. One logical column in protocol. In JSON mode, the frontend/backend Copy protocol sends a single (non-binary) column; the row is rendered as one JSON value (e.g. one object per row).
  4. FORCE_ARRAY only with JSON. The FORCE_ARRAY option wraps the entire COPY output in [ ... ] and inserts commas between rows, so the result is a single JSON array. It is only valid with FORMAT json.

Patch Structure

  • Patch 1 (from v13) — CopyFormat refactor Joel Jacobson introduced an enum CopyFormat (e.g. COPY_FORMAT_TEXT, COPY_FORMAT_CSV, COPY_FORMAT_BINARY) and replaced the two booleans csv_mode and binary in CopyFormatOptions with a single format field. This makes adding new formats (like JSON) cleaner. jian he later refactored this patch to address review feedback; Junwang Zhao adapted it to the new CopyToRoutine structure in the executor.

  • Patch 2 — JSON format for COPY TO

    • Grammar (gram.y): Add JSON as a format option and allow FORMAT json in COPY options.
    • Options (copy.c, copy.h): From v13, format is represented by CopyFormat; JSON adds COPY_FORMAT_JSON and the same checks: no HEADER/default/null/delimiter with JSON, no JSON with COPY FROM.
    • Copy protocol (copyto.c): In SendCopyBegin, when in JSON mode, send a single column with format 0 (text) instead of per-column formats.
    • Row output (copyto.c): In CopyOneRowTo, when json_mode is set, the row is converted to JSON via composite_to_json() (from utils/adt/json.c) and the resulting string is sent. For query-based COPY (no relation), the patch ensures the slot’s tuple descriptor matches the query’s so that composite_to_json sees the correct attribute metadata for key names.
    • json.c: composite_to_json() is changed from static to exported and declared in utils/json.h so COPY can call it.
  • Patch 3 — FORCE_ARRAY for COPY TO

    • Options (copy.c, copy.h): Add force_array and parse force_array / force_array true|false. Validation: FORCE_ARRAY is only allowed with JSON mode (v12+ uses ERRCODE_INVALID_PARAMETER_VALUE for the error).
    • Output (copyto.c): Before the first row, if JSON mode and force_array, send [ and a newline; between rows, send , before each JSON object (using a json_row_delim_needed flag); after the last row, send ] and newline. Default output (without FORCE_ARRAY) remains one JSON object per line.

Evolution: v8 Through v23

  • v8 took a larger approach: extracting COPY TO/FROM format implementations and adding a pluggable mechanism (including a contrib module pg_copy_json). Reviewers preferred a smaller, in-core change.
  • v9–v10 simplified to adding only the JSON format for COPY TO (no pluggable API). v10 introduced the json_mode flag and the use of composite_to_json.
  • v11 added the FORCE_ARRAY option and the corresponding tests; it also fixed the error code for “COPY FROM with json” and tightened option validation.
  • v12 (August 2024): only patch 2 (FORCE_ARRAY) was resent; the error when FORCE_ARRAY is used without JSON was changed to ERRCODE_INVALID_PARAMETER_VALUE.
  • v13 (October 2024): Joel Jacobson contributed patch 0001 — introduce CopyFormat enum and replace csv_mode and binary in CopyFormatOptions with a single format field. Patches 0002 (json format) and 0003 (force_array) were rebased on top; the docs explicitly state that JSON cannot be used with header, default, null, or delimiter.
  • v14–v22: Mostly rebase and adaptation to upstream. v14 dropped the separate CopyFormat patch in some postings (rebase on different bases). Junwang Zhao (v15, March 2025) adapted the JSON implementation to the new CopyToRoutine struct (commit 2e4127b6d2). Subsequent versions (v16–v22) continued to rebase and address review comments without changing the core design.
  • v23 (January 2026): Current series. Three patches: (1) CopyFormat refactor (originally Joel Jacobson; refactored by jian he), (2) json format for COPY TO (Author: Joe Conway; Reviewed-by multiple contributors including Andrey M. Borodin, Dean Rasheed, Daniel Verite, Andrew Dunstan, Davin Shearer, Masahiko Sawada, Alvaro Herrera), (3) FORCE_ARRAY for COPY JSON format. The feature set is unchanged; the patch set is rebased and has gathered substantial review.

Code Highlights

Row as JSON (patch 2) Each row is turned into a single JSON object via the existing composite_to_json():

rowdata = ExecFetchSlotHeapTupleDatum(slot);
result = makeStringInfo();
composite_to_json(rowdata, result, false);
CopySendData(cstate, result->data, result->len);

FORCE_ARRAY framing (patch 3) Before the row loop, send [; for each row after the first, send , then the object; after the loop, send ]:

if (cstate->opts.json_mode && cstate->opts.force_array)
{
    CopySendChar(cstate, '[');
    CopySendEndOfRow(cstate);
}
// ... row loop: first row no comma, then CopySendChar(cstate, ','); then object ...
if (cstate->opts.json_mode && cstate->opts.force_array)
{
    CopySendChar(cstate, ']');
    CopySendEndOfRow(cstate);
}

Example usage (from regression tests)

COPY copytest TO STDOUT (FORMAT json);
-- One JSON object per line.

COPY copytest TO STDOUT (FORMAT json, force_array true);
-- Single JSON array: [ {"col":1,...}, {"col":2,...} ]

Community Insights

Original Problem and Workarounds

Davin’s initial issue was that COPY’s text format escaped his JSON again, breaking it. David G. Johnston and Adrian Klaver suggested using psql to write query results to a file instead of COPY. Dominique Devienne and David G. Johnston agreed that a “not formatted” or raw option for COPY would help when dumping a single column (e.g. JSON) as-is. Tom Lane concurred that a copy option for unformatted output would address the use case. That consensus led to the idea of a dedicated JSON format rather than overloading text/CSV.

Reviewer Feedback and Refinements

  • Tom Lane and others emphasized keeping the change minimal: add JSON output for COPY TO without a large refactor. That pushed the design from the v8-style pluggable formats to the current in-core JSON path.
  • Joe Conway had previously discussed COPY and JSON in a separate thread; jian he’s patches reference that discussion and align with the idea of first-class JSON output for COPY TO.
  • Alvaro Herrera and other reviewers have commented on the thread; the evolution from v8 to v9/v10 reflects their preference for a smaller, focused patch set.

Edge Cases in the Patch

  • Query-based COPY: When the source is a query (no relation), the tuple descriptor from the slot can differ from the query’s. The patch copies the query’s attribute metadata into the slot’s tuple descriptor so composite_to_json generates correct key names.
  • Protocol: In JSON mode, the Copy protocol sends one column; the backend still produces one JSON “value” per row (one object, or one element inside the array when FORCE_ARRAY is used).

Technical Details

Implementation Notes

  • Escaping: JSON string escaping is handled by composite_to_json() and the existing escape_json-style helpers in json.c, so quotes, backslashes, and control characters in column values are encoded correctly.
  • HEADER: Explicitly disallowed with JSON to keep the stream either pure JSON lines or a single JSON array.
  • FORCE_ARRAY output: The regression tests show that with force_array, the output is [, then newline, then the first object; then for each next row, , followed by the object; then ]. So the result is a single valid JSON array (with optional whitespace/newlines between elements).

Limitations

  • COPY FROM: No JSON import; only COPY TO is extended.
  • HEADER: Not supported in JSON mode.
  • Binary: JSON format is text-only (no binary JSON in this patch).

Current Status

  • The thread has seen activity from 2023 through early 2026. The latest series is v23 (January 2026): three patches (CopyFormat refactor, json format for COPY TO, FORCE_ARRAY).
  • As of the thread snapshot, the patches have not been committed; they remain under discussion. v23 reflects the current design and review state.
  • Design: COPY TO ... (FORMAT json) and optional (FORMAT json, force_array true) for a single JSON array; JSON is incompatible with HEADER, DEFAULT, NULL, DELIMITER, and COPY FROM.

Conclusion

The “Emitting JSON to file using COPY TO” thread started with a user hitting double-escaping when exporting JSON via COPY. The community agreed that a native JSON format for COPY TO was the right fix. jian he (and later Junwang Zhao) implemented FORMAT json (COPY TO only; incompatible with HEADER, DEFAULT, NULL, DELIMITER) and FORCE_ARRAY, reusing composite_to_json() and the existing JSON escaping. Joel Jacobson’s CopyFormat refactor (v13+) replaced the format booleans with an enum, making the codebase ready for JSON and future formats. The series has evolved through v23 with rebases and adaptations (e.g. to CopyToRoutine) and has received Reviewed-by from several committers.

References

Week 05 (2026)

PostgreSQL mailing list discussions for Week 05, 2026.

Articles

GOO: A Greedy Join Search Algorithm for Large Join Problems

Introduction

PostgreSQL uses different strategies for join ordering depending on query complexity. For queries with fewer than geqo_threshold relations (default 12), the planner uses dynamic programming (DP) to find the optimal join order. For larger join graphs, it falls back to GEQO (Genetic Query Optimizer), which uses a genetic algorithm to search the space of possible join orders in a more scalable but heuristic way. GEQO has well-known drawbacks: it can be slower than DP for moderate join counts due to overhead, and it lacks tuning knobs like a reproducible seed for debugging.

In December 2025, Chengpeng Yan proposed GOO (Greedy Operator Ordering) on the pgsql-hackers mailing list—a deterministic greedy join-order search method intended as an alternative to GEQO for large join problems. The algorithm is based on the 1998 DEXA paper by Leonidas Fegaras, "A New Heuristic for Optimizing Large Queries."

Why This Matters

  • Planning time: On star/snowflake and TPC-DS style workloads, GOO can plan much faster than GEQO (e.g., ~5s vs ~20s for EXPLAIN on 99 TPC-DS queries in one test), while DP remains fastest when the join count is below the threshold.
  • Plan quality: The goal is to be "good enough" where GEQO is used today—reducing tail regressions and offering more predictable behavior than a genetic search.
  • Memory: GOO’s structure (iterative, commit-to-one-join-per-step) suggests lower memory use than full DP and different characteristics than GEQO; the author plans to measure this.

Understanding this thread helps you see how PostgreSQL might evolve for queries with many joins and where GOO fits next to DP and GEQO.

Technical Analysis

The GOO Algorithm

GOO builds the join order incrementally:

  1. Start with one "clump" per base relation.
  2. At each step, consider all legal join pairs (clumps that can be joined under the query’s join constraints).
  3. For each pair, build the join relation and use the planner’s existing cost model to get a total cost.
  4. Choose the pair with the lowest estimated total cost, merge them into one clump, and repeat.
  5. Stop when a single clump remains; that clump’s best path is the chosen plan.

So the "greedy" choice is: among all current clumps, always do the join that looks cheapest right now. The paper uses estimated result size; the patch uses the planner’s existing total_cost for consistency with the rest of PostgreSQL.

Complexity: Time is O(n³) in the number of base relations n: (n−1) merges, and at each step O(k²) pairs for k clumps.

Integration with the Planner

The patch adds:

  • enable_goo_join_search — GUC to turn GOO on (default off).
  • Threshold: For now, GOO reuses geqo_threshold: when the number of join levels is ≥ geqo_threshold, and GOO is enabled, the planner calls goo_join_search() instead of GEQO. So GOO is positioned as a GEQO replacement, not a replacement for DP.

Relevant code in allpaths.c:

else if (enable_goo_join_search && levels_needed >= geqo_threshold)
    return goo_join_search(root, levels_needed, initial_rels);
else if (enable_geqo && levels_needed >= geqo_threshold)
    return geqo(root, levels_needed, initial_rels);

Greedy Strategies Under Experiment

The author experimented with different signals for "cheapest" in the greedy step:

StrategyDescription
CostUse the planner’s total_cost for the join (baseline).
Result sizeUse estimated output size in bytes (reltarget->width * rows).
RowsUse estimated output row count.
SelectivityUse join selectivity (output rows / (left_rows × right_rows)).
CombinedRun GOO twice (cost and result_size), then pick the plan with lower final estimated cost.

Findings:

  • Cost alone can produce very bad tail cases (e.g., JOB: max 431×, many ≥10× regressions).
  • Result size is better on average but still has bad tails (e.g., 67× max on JOB).
  • Combined (cost + result_size, pick cheaper) improves robustness: best geometric mean, no ≥10× regressions in the JOB subset, worst case 8.68×.

So the discussion shifted from "which single metric?" to "how to reduce tail risk?"—e.g., by combining multiple greedy strategies and choosing the better plan.

Community Insights

Benchmark Clarification

Dilip Kumar initially questioned the use of pgbench, since default pgbench queries do not exercise join search. The author clarified: the numbers came from custom star-join and snowflake workloads from Tomas Vondra’s earlier thread, not from default pgbench. Those workloads use multi-table joins and empty tables, so the reported throughput mainly reflects planning time of DP vs GEQO vs GOO.

Crashes and the Eager-Aggregation Fix (v1 → v2)

Tomas Vondra reported crashes when running EXPLAIN on TPC-DS queries. The backtrace showed sort_inner_and_outer() with inner_path = NULL. Root cause: in some paths, the GOO code built join relations without calling set_cheapest(), so later code saw missing cheapest path. Tomas also narrowed it to queries with aggregates in the select list (e.g., TPC-DS Q7).

The fix in v2 was to correctly handle eager aggregation: the planner can create grouped/base relations that require a proper cheapest path. After the v2 fix, all 99 TPC-DS queries could be planned without crashes.

GEQO Slower Than DP for Moderate Join Counts

Tomas reported EXPLAIN times for 99 TPC-DS queries (3 scales × 0/4 workers):

  • master (DP): 8s
  • master/geqo: 20s
  • master/goo: 5s

So for that workload, GEQO was slower than DP, and GOO was fastest. John Naylor and Pavel Stehule noted that GEQO is designed to win only when the join problem is large enough; for smaller or moderate join counts, its overhead can dominate. So the comparison should focus on the range where GEQO is actually used (e.g., relation count above geqo_threshold).

TPC-DS Execution Results (Tomas Vondra)

Tomas shared full TPC-DS run results (scale 1 and 10, 0 and 4 workers). Summary of total duration (all 99 queries):

  • Scale 1: GOO was slower than both master and GEQO (e.g., ~1124s vs 399s geqo vs 816s master).
  • Scale 10: GOO was faster (e.g., ~1859s vs 2325s geqo vs 2439s master).

So GOO behaved worse at small scale and better at larger scale—suggesting workload- and scale-dependent behavior. Tomas suggested inspecting queries that got worse to refine heuristics and to test with larger data sets and cold cache.

TPC-H: Failure Modes of Single-Metric Greedy (v3)

The author ran TPC-H SF=1 with four strategies: rows, selectivity, result size, and cost. Main lessons:

  • Q20: Join between partsupp and an aggregated lineitem subquery. Row count was misestimated by orders of magnitude (tens vs hundreds of thousands). Output-oriented rules (rows, selectivity, result size) favored this join very early because it "looked" very shrinking; in reality it produced a huge intermediate and blew up downstream cost. So bad estimates can make output-oriented greedy rules fail badly.
  • Q7: Cost-based greedy chose a locally cheap join that created a large many-to-many intermediate, which made later joins much more expensive. So locally optimal cost can be globally bad.

Tomas pointed out: Q20 is largely an estimation problem (garbage in, garbage out); Q7 is inherent to greediness—locally good choices can be globally poor, and that’s not fixable by picking a different single metric.

JOB and Combined Strategy (v4)

On the full JOB workload, the combined strategy (cost + result_size, pick cheaper) gave:

  • Best geometric mean (0.953 vs DP).
  • No regressions ≥10×; max 8.68×.
  • Fewer bad tail cases than GOO(cost) or GOO(result_size) alone.

So increasing plan diversity by running two greedy criteria and selecting the cheaper plan helps avoid catastrophic plans without much extra planning cost.

Scope: GOO as GEQO Replacement

Tomas asked whether the goal was to replace DP or GEQO. The author confirmed: GOO is intended as a GEQO replacement, not to replace DP. When the join count is below geqo_threshold, DP should remain in use.

Literature and Next Steps

Tomas pointed to the CIDR 2021 paper "Simplicity Done Right for Join Ordering" (Hertzschuch et al.), which focuses on robustness (e.g., upper bound / worst-case join orders) and trusting base relation estimates—potentially relevant to nestloop blowups when cardinality is over-optimistic. The author plans to establish a solid baseline with the current approach, then incorporate ideas from such work incrementally.

Technical Details

Implementation Approach

  • New files: src/backend/optimizer/path/goo.c, src/include/optimizer/goo.h.
  • GOO builds join relations by repeatedly calling the existing planner routines (e.g., make_join_rel, path creation), so it reuses the same cost model and path types as DP/GEQO.
  • Memory is managed with multiple memory contexts to limit usage during candidate evaluation.

Edge Cases and Robustness

  • Eager aggregation: v2 fixed crashes by ensuring join relations created during GOO have proper cheapest paths set (so code like sort_inner_and_outer never sees NULL inner_path).
  • Cardinality misestimation: All approaches suffer when estimates are wrong; GOO’s sensitivity differs by strategy (e.g., output-oriented rules can be worse when row estimates are off). The combined strategy is aimed at reducing tail risk rather than fixing estimation.
  • Structural limits: Some shapes (e.g., star with fan-out) cause both cost and result_size to pick similarly bad plans; that’s a limitation of myopic greedy enumeration.

Performance Considerations

  • Planning time: GOO is O(n³) and in practice was faster than GEQO in the reported benchmarks; the author plans to add explicit planning-time and memory measurements.
  • Execution time: Highly workload-dependent; GOO can be better or worse than GEQO/DP depending on scale and query mix (e.g., TPC-DS scale 1 vs 10, JOB GEQO-relevant subset).

Current Status

  • Patch: v4 was the latest at thread closure. v4-0001 is the core GOO implementation (unchanged from v3-0001); v4-0002 adds test-only GUCs and machinery for trying different greedy strategies (e.g., combined).
  • Goal: Establish GOO as a viable GEQO replacement—same threshold, better or comparable plan quality and planning time, with reduced tail regressions.
  • Next steps (from the author): Broader evaluation (more workloads, larger join graphs, cold cache, larger scale factors); consider adding selectivity to the combined strategy; measure planning time and memory; look at tunability and graceful degradation (e.g., DP up to a resource limit then greedy).

Conclusion

The GOO thread shows a serious effort to replace GEQO with a deterministic, greedy join ordering algorithm that:

  • Reuses the existing cost model and planner infrastructure.
  • Improves planning time in several benchmarks compared to GEQO.
  • Reduces worst-case plan quality by combining multiple greedy strategies (e.g., cost and result_size) and picking the cheaper plan.

Limitations are acknowledged: greedy methods are inherently local and can produce bad plans when estimates are wrong or when the join graph has unfavorable structure. The focus has shifted to robustness and tail behavior rather than perfect single-metric tuning. For PostgreSQL users, this is a patch to watch: if committed, it would offer an alternative to GEQO for complex queries, with different trade-offs and potentially better predictability and planning performance.

References

Batching in Executor: Batch-Oriented Tuple Processing

Introduction

PostgreSQL’s executor has long been tuple-at-a-time: each plan node typically requests one tuple from its child, processes it, and passes one result tuple upward. That design is simple and works well for OLTP, but for analytical and bulk workloads the per-tuple overhead—especially repeated function-call and expression-evaluation cost—can dominate. At PGConf.dev 2025, the community discussed whether PostgreSQL could support an additional batch executor that moves batches of tuples between nodes instead of one slot at a time.

Following that discussion and off-list input from Andres Freund and David Rowley, Amit Langote posted a patch series on the pgsql-hackers list in September 2025 titled “Batching in executor”. The series introduces a batch table AM API, extends the executor with batch-capable interfaces (ExecProcNodeBatch, TupleBatch), and prototypes batch-aware expression evaluation (including batched quals and aggregate transitions). The goal is to reduce per-tuple overhead, enable future optimizations such as SIMD in aggregate functions, and lay groundwork for columnar or compressed table AMs that benefit from batch-oriented execution.

Why This Matters

  • Executor overhead: In CPU-bound, IO-minimal workloads (e.g., fully cached tables), a large share of time goes into the executor. Batching reduces calls into the table AM and expression interpreter, and can cut function-call overhead by evaluating expressions over many rows at once.
  • Aggregates and analytics: Batched transition evaluation (e.g., count(*), sum(), avg()) can pay fmgr cost per batch instead of per row and opens the door to vectorized or SIMD-friendly code paths.
  • Future table AMs: A batch-oriented executor makes it easier for columnar or compressed table AMs (e.g., Parquet-style) to pass native batch formats without forcing early materialization into heap tuples.
  • OLTP safety: The design keeps the existing row-at-a-time path unchanged; batching is opt-in (e.g., via executor_batching GUC) so OLTP workloads are not affected.

Understanding this thread helps you see how PostgreSQL might gain a second, batch-oriented execution path and what trade-offs (materialization, ExprContext, EEOP design) the community is working through.

Technical Analysis

Patch Structure

The series is split into two parts:

  1. Patches 0001–0003 — Foundation: batch table AM API, heapam batch implementation, and executor batch interface wired to SeqScan.
  2. Patches 0004–0008 — Prototype: batch-aware Agg node, new EEOPs for TupleBatch processing, batched qual evaluation, and batched aggregate transition (row-loop and “direct” per-batch fmgr).

Patches 0001–0003 are intended as the first candidates for review and eventual commit; 0004–0008 are marked WIP/PoC.

Key Abstractions

Table AM batch API (0001) New callbacks let a table AM return multiple tuples per call instead of one. For heap:

  • HeapBatch holds tuples from a single page; size is limited by EXEC_BATCH_ROWS (currently 64) and by not crossing page boundaries.
  • heapgettup_pagemode_batch() fills a HeapTupleData array from the current page, mirroring the logic of heapgettup_pagemode() but for a batch. Visibility and scan direction are handled the same way.

The generic layer introduces a batch type and ops in tableam.h so other AMs can supply their own batch format and implementation.

Executor batch path (0002–0003)

  • TupleBatch is the container passed between nodes when running in batch mode. It can hold the AM’s native batch (e.g., heap tuples) or materialized slots, depending on the path.
  • ExecProcNodeBatch() is the batch analogue of ExecProcNode(): it returns a TupleBatch* instead of a TupleTableSlot*. PlanState gains an ExecProcNodeBatch function pointer, with the same “first call” and instrumentation wrappers as the row path.
  • SeqScan gets:
    • Batch-driven slot path: still returns one slot per call, but fills it from an internal batch (fewer AM calls).
    • Batch path: when the parent supports batching, SeqScan’s ExecProcNodeBatch returns a TupleBatch directly (e.g., from ExecSeqScanBatch*).

So the first three patches give: (1) table AMs that can produce batches, (2) an executor API to request and pass batches, and (3) SeqScan as the first node that can both consume and produce batches.

Batch-Aware Expression Evaluation (0004–0008)

The later patches experiment with evaluating expressions over a batch of rows:

  • Batch input to Agg: Agg can pull TupleBatch from its child via ExecProcNodeBatch() and feed rows into the aggregate transition in bulk.
  • New EEOPs: Expression interpreter gains steps that operate on TupleBatch data—e.g., fetching attributes into batch vectors, evaluating a qual over a batch, and running aggregate transitions either by looping over rows inside the interpreter (ROWLOOP) or by calling the transition function once per batch with bulk arguments (DIRECT).
  • Batched qual evaluation: A batch of tuples can be filtered with a single pass over the batch (ExecQualBatch and related EEOPs), reducing per-row interpreter and fmgr overhead.

Two prototype paths for batched aggregation are provided: one that iterates over rows in the interpreter (per-row transition), and one that invokes the transition function once per batch (per-batch fmgr). The latter shows larger gains in Amit’s benchmarks when executor cost dominates.

Design Choices and Open Points

  • Single-page batches: Heap batches are limited to one page. So batches may be smaller than EXEC_BATCH_ROWS (e.g., with few tuples per page or selective quals). The thread mentions possible future improvements: batches spanning pages or the scan requesting more tuples when the batch is not full.
  • TupleBatch vs ExprContext: The patches extend ExprContext with scan_batch, inner_batch, and outer_batch. Per-batch expression evaluation still uses ecxt_per_tuple_memory, which Amit notes is “arguably an abuse” of the per-tuple contract. A clearer model for batch-scoped memory is still needed.
  • Materialization: Today, batch-aware expression evaluation typically works on tuples materialized into slots (or heap tuple arrays). The long-term goal is to allow expression evaluation on native batch formats (e.g., columnar or compressed) without forcing materialization; that would require more infrastructure (e.g., AM-controlled expression evaluation or batch-aware operators).

Community Insights

Tomas Vondra: Batch Design vs Index Prefetching

Tomas compared the patch to index prefetching work (which he is involved in), which also introduces a “batch” concept for passing data between the index AM and the executor. He noted the designs differ on purpose:

  • Index prefetching: A shared batch struct is filled by the index AM and then managed by indexam.c; the batch is AM-agnostic after that.
  • Executor batching: Each table AM can produce its own batch format (e.g., HeapBatch) wrapped in a generic TupleBatch with AM-specific ops. The executor retains TAM-specific optimizations and relies on the TAM for operations on batch contents.

Amit agreed: for executor batching the aim is to keep TAM-specific behavior and avoid early materialization where possible; for prefetching the aim is a single, indexam-driven batch format. Both designs are consistent with their goals.

Tomas also asked: (1) When must a TupleBatch be materialized into a generic format (e.g., slots)? (2) Can expressions run directly on “custom” batches (e.g., compressed/columnar)? Amit replied that materialization is currently required for expression evaluation but that the design should not block future work to evaluate expressions on native batch data (e.g., columnar or Parquet-style). Giving the table AM more control over how expressions are evaluated on its batch data is a possible future extension.

Tomas Vondra: TPC-H Q22 Segfault and Fix (v3)

Tomas reported a segfault when running TPC-H with batching enabled, only on Q22, with backtraces always pointing to the same place: numeric_avg_accum with a NULL datum (DatumGetNumeric(X=0)), called from ExecAggPlainTransBatch and then agg_retrieve_direct_batch. So the bug was in the batched aggregate path: a NULL was being passed where the transition function expected a valid value.

Amit tracked the crash to the expression interpreter. Two different EEOPs (for the ROWLOOP and DIRECT batched aggregate paths) both called the same helper function. That helper re-derived the opcode at execution time (e.g., via ExecExprEvalOp(op)). In some builds (e.g., clang-17 on macOS), the two EEOP cases compiled to identical code, so their dispatch labels had the same address. The interpreter’s reverse lookup by label address could then return the wrong EEOP; the init path could think it was running the ROWLOOP EEOP while the exec path behaved like the DIRECT EEOP, leading to incorrect state and the NULL/crash.

The fix (in v3, patch 0009) was to split the shared helper into two separate functions, one per EEOP, so the helper no longer re-derives the opcode. With that change, Amit could not reproduce the crash on macOS with clang-17. The same fix addresses the TPC-H Q22 segfault that Tomas saw.

Bruce Momjian: POSETTE Talks and OLTP

Bruce pointed to two POSETTE 2025 talks for context: one on data warehouse needs and one on “Hacking Postgres Executor For Performance”. Amit (who gave the second talk) confirmed that batching is designed to avoid adding meaningful overhead to the OLTP path; the row-at-a-time path remains default and unchanged.

Regression When Batching Is Off

Tomas had observed that with batching disabled (executor_batching=off), the patched tree could be slower than unpatched master—i.e., a regression when the new code path is not used. Amit reproduced this: for example, single-aggregate SELECT count(*) FROM bar and multi-aggregate SELECT avg(a), … FROM bar showed roughly 3–18% slowdown with batching off vs master, depending on row count and parallelism. He acknowledged the regression and said he was looking into it. Ensuring zero or minimal cost when batching is disabled is important for committable patches.

Technical Details

Implementation Highlights

  • Batch size: EXEC_BATCH_ROWS is 64. Heap batches are further limited to one page, so effective batch size can be smaller (e.g., ~43 rows per page in Amit’s 10M-row test table).
  • Instrumentation: ExecProcNodeBatch uses the same instrumentation hooks as the row path; the “tuple” count for a batch call is recorded as the number of valid rows in the returned TupleBatch (b->nvalid), so EXPLAIN ANALYZE-style stats remain meaningful.
  • GUC: In v4/v5 the GUC is executor_batch_rows (0 = batching off; e.g. 64 = batch size).

Edge Cases and Limitations

  • Sparse batches: With selective quals, batches can end up with few valid rows after filtering. The thread suggests future work: cross-page batches or the scan refilling the batch when it is not full.
  • ExprContext and batch lifetime: Reusing ecxt_per_tuple_memory for per-batch work is a known design debt; a dedicated batch-scoped allocator or context would be cleaner.
  • Parallel and nested Agg: The backtrace from Tomas’s crash showed parallel workers (Gather/GatherMerge) and nested aggregation (e.g., Agg over subplan). The NULL-datum bug was in the batched transition path used in that setting; the v3 fix (split EEOP helpers) addresses the root cause rather than a single query.

Benchmark Summary (from Amit’s v1 post)

All runs were on fully VACUUMed tables with large shared_buffers and prewarmed cache; timings in ms, “off” = batching off, “on” = batching on. Negative %diff means “on” is faster.

  • Single aggregate, no WHERE (e.g., SELECT count(*) FROM bar_N): With only batched SeqScan (0001–0003), ~8–22% faster; with batched agg (0001–0007), ~33–49% faster in several cases.
  • Single aggregate, with WHERE: With batched agg and batched qual (0001–0008), ~31–40% faster.
  • Five aggregates, no WHERE: Batched transitions (per-batch fmgr, 0001–0007) ~22–31% faster.
  • Five aggregates, with WHERE: Batched transitions + batched qual (0001–0008) ~18–32% faster.

So once the executor dominates (minimal IO), batching consistently reduces CPU time, with the largest gains from avoiding per-row fmgr calls and evaluating quals over batches.

Evolution: v4 and v5

Later revisions refined the foundation and added observability and batch qual work:

  • v4 (Oct 2025): Adds EXPLAIN (BATCHES) (patch 0003) to show tuple-batching statistics, addressing the earlier “instrumentation” open point. Amit reported that the regression when batching is off (vs unpatched master) was no longer seen in v4—likely due to removing stray fields from HeapScanData and avoiding mixed compiler (gcc vs clang) comparisons. New benchmarks use SELECT * FROM t LIMIT 1 OFFSET n; with batch=64, improvements are ~22–26% for no-WHERE and ~21–48% for WHERE a > 0; deform-heavy cases (e.g. qual on last column) show smaller gains. Daniil Davydov reviewed the heap batch code (e.g. SO_ALLOW_PAGEMODE assertion, heapgettup_pagemode_batch logic, style); Amit addressed these in v4.

  • v5 (Jan 2026): Keeps 0001–0003 as core (batch AM API, SeqScan + TupleBatch, EXPLAIN BATCHES). 0004 adds ExecQualBatch for batched qual evaluation (WIP); 0005 moves batch qual opcodes into a dedicated interpreter so the per-tuple path (ExecInterpExpr) is not modified, aiming to avoid any cost when executor_batch_rows=0. Amit removed the BatchVector intermediate (quals now read batch slots’ tts_values directly). Two open issues: (1) With 0% selectivity (all rows fail the qual), the per-tuple path is still hotter with the batch qual patches applied even when batching is off; (2) Quals on late columns (deform-heavy) get little or no benefit from batching. The GUC in recent patches is executor_batch_rows (0 = off).

Current Status

  • The thread is active; the latest messages are from January 2026. The series is still work in progress.
  • v5 is the current revision. Patches 0001–0003 (table AM batch API, heapam batch, SeqScan + TupleBatch, EXPLAIN BATCHES) are the intended first step for review and possible commit.
  • Patches 0004–0005 in v5 are experimental (ExecQualBatch, dedicated interpreter for batch qual).
  • v3 had the segfault fix (split EEOP helpers) for the TPC-H Q22 / batched-agg crash; the v4/v5 series builds on that.
  • Open items: (1) Per-tuple path regression when batch qual (0004–0005) is in the tree but executor_batch_rows=0 (e.g. 0% selectivity); (2) batch-scoped memory and ExprContext; (3) future work on cross-page batches and expression evaluation on native/compressed batch formats.

Conclusion

Amit Langote’s “Batching in executor” series introduces a batch-oriented path in the PostgreSQL executor: table AMs can return batches of tuples, the executor can request and pass them via TupleBatch, and SeqScan is the first node wired to this path. Revisions v4 and v5 add EXPLAIN (BATCHES) for observability and prototype batched qual evaluation with a dedicated interpreter to keep the row-at-a-time path unchanged. Benchmarks show substantial gains (often 20–50%) when batching is on; the earlier “batching off” regression was addressed in v4, but a remaining issue is per-tuple path cost when the batch qual patches are applied and batching is disabled (e.g. 0% selectivity).

Reviewers have raised important points: alignment with other batch-like work (e.g. index prefetching), materialization and future expression-on-batch design, the TPC-H Q22 segfault (fixed in v3), and Daniil’s heap-batch review (addressed in v4). The foundation (0001–0003) plus EXPLAIN BATCHES is the current focus for review and possible commit.

References

Week 04 (2026)

PostgreSQL mailing list discussions for Week 04, 2026.

Articles

PostgreSQL Planner Optimization: Automatic COUNT(*) Conversion

Introduction

In October 2025, PostgreSQL committer David Rowley proposed a significant query planner optimization that automatically converts COUNT(1) and COUNT(not_null_col) expressions to COUNT(*). This optimization addresses a common performance anti-pattern where developers write COUNT(1) thinking it's equivalent to COUNT(*), when in fact COUNT(*) is more efficient. The patch was committed in November 2025 and introduces new infrastructure for aggregate function simplification.

Why This Matters

The performance difference between COUNT(*) and COUNT(column) can be substantial, especially for large tables. When counting a specific column, PostgreSQL must:

  1. Deform the tuple to extract the column value
  2. Check for NULL values (even for NOT NULL columns, the check still occurs)
  3. Process the column data through the aggregate function

In contrast, COUNT(*) can count rows without accessing individual column values, resulting in significantly better performance. David Rowley's benchmarks showed approximately 37% performance improvement when using COUNT(*) instead of COUNT(not_null_col) on a table with 1 million rows.

Technical Analysis

The Infrastructure: SupportRequestSimplifyAggref

The patch introduces a new infrastructure called SupportRequestSimplifyAggref, which is similar to the existing SupportRequestSimplify used for regular function expressions (FuncExpr). Since aggregates use Aggref nodes, a separate mechanism was needed.

The key components include:

  1. New support node type: SupportRequestSimplifyAggref in supportnodes.h
  2. Simplification function: simplify_aggref() in clauses.c that calls the aggregate's support function during constant folding
  3. Enhanced nullability checking: Extended expr_is_nonnullable() to handle Const nodes, not just Var nodes

Implementation Details

The optimization is performed during the constant folding phase of query planning, specifically in eval_const_expressions_mutator(). When an Aggref node is encountered, the planner:

  1. Checks if the aggregate function has a support function registered via pg_proc.prosupport
  2. Calls the support function with a SupportRequestSimplifyAggref request
  3. If the support function returns a simplified node, replaces the original Aggref

For the COUNT aggregate specifically, the support function (int8_agg_support_simplify()) checks:

  • Whether the argument is non-nullable (using expr_is_nonnullable())
  • Whether there are no ORDER BY or DISTINCT clauses in the aggregate
  • If both conditions are met, converts COUNT(ANY) to COUNT(*)

Code Example

The core simplification logic in int8.c:

static Node *
int8_agg_support_simplify(SupportRequestSimplifyAggref *req)
{
    Aggref    *aggref = req->aggref;

    /* Only handle COUNT */
    if (aggref->aggfnoid != INT8_AGG_COUNT_OID)
        return NULL;

    /* Must have exactly one argument */
    if (list_length(aggref->args) != 1)
        return NULL;

    /* No ORDER BY or DISTINCT */
    if (aggref->aggorder != NIL || aggref->aggdistinct != NIL)
        return NULL;

    /* Check if argument is non-nullable */
    if (!expr_is_nonnullable(req->root,
                             (Expr *) linitial(aggref->args),
                             true))
        return NULL;

    /* Convert to COUNT(*) */
    return make_count_star_aggref(aggref);
}

Patch Evolution

The patch went through four iterations, each refining the implementation:

Version 1 (Initial Proposal)

  • Introduced the basic infrastructure
  • Used SysCache to fetch pg_proc tuples

Version 2 (Code Cleanup)

  • Replaced SysCache lookup with get_func_support() function
  • Cleaner and more efficient approach

Version 3 (Removed Experimental Code)

  • Removed #ifdef NOT_USED block that handled COUNT(NULL) optimization
  • Cleaned up unused includes
  • Improved comments

Version 4 (Final Version)

  • Rebased after commit b140c8d7a
  • Fixed assumption that support function always returns an Aggref
  • Allows support functions to return other node types (e.g., constants) for more aggressive optimizations
  • This flexibility enables future optimizations like converting COUNT(NULL) to '0'::bigint

Community Insights

Reviewer Feedback

Corey Huinker provided positive feedback:

  • +1 for the automatic query improvement
  • Noted that we can't educate everyone that COUNT(1) is an anti-pattern, so making it not an anti-pattern is the right approach
  • Confirmed the patch applies cleanly and tests pass

Matheus Alcantara also reviewed and tested:

  • Confirmed ~30% performance improvement in benchmarks
  • Validated that the code placement is consistent with existing SupportRequestSimplify infrastructure
  • +1 for the idea

Design Decisions

Timing of Optimization: The optimization happens during constant folding, which is early in the planning process. David considered whether it should happen later (after add_base_clause_to_rel()) to catch cases like:

SELECT count(nullable_col) FROM t WHERE nullable_col IS NOT NULL;

However, it must happen before preprocess_aggref(), which groups aggregates with the same transition function. The current placement is consistent with SupportRequestSimplify for regular functions.

Support Function Return Type: The infrastructure allows support functions to return nodes other than Aggref. This design decision enables future optimizations, such as:

  • Converting COUNT(NULL) to '0'::bigint
  • More aggressive constant folding for aggregates

Performance Considerations

The optimization provides significant performance benefits:

  1. Reduced tuple deformation: COUNT(*) doesn't need to extract column values from tuples
  2. Fewer NULL checks: No need to check individual column values
  3. Better cache utilization: Less data movement means better CPU cache usage

For tables with many columns, the performance gain can be even more substantial, as COUNT(column) might require deforming many columns to reach the target column.

Edge Cases and Limitations

The optimization only applies when:

  1. The column is provably non-nullable (NOT NULL constraint or constant)
  2. There are no ORDER BY clauses in the aggregate
  3. There are no DISTINCT clauses in the aggregate

Cases that are not optimized (yet):

  • COUNT(nullable_col) where the column might be NULL (even if filtered by WHERE nullable_col IS NOT NULL in the same query)
  • COUNT(col ORDER BY col) - the ORDER BY prevents optimization
  • COUNT(DISTINCT col) - DISTINCT prevents optimization

The limitation with WHERE clauses is due to the timing of the optimization (during constant folding, before relation information is fully available).

Current Status

The patch was committed by David Rowley on November 26, 2025. It's available in PostgreSQL master branch and will be included in PostgreSQL 18.

Conclusion

This optimization represents a significant improvement to PostgreSQL's query planner, automatically fixing a common performance anti-pattern without requiring application changes. The new SupportRequestSimplifyAggref infrastructure also opens the door for future aggregate optimizations.

For developers and DBAs:

  • No action required: The optimization happens automatically
  • Performance benefit: Existing queries using COUNT(1) or COUNT(not_null_col) will automatically get faster
  • Best practice: While the planner now optimizes these cases, COUNT(*) remains the clearest and most idiomatic way to count rows

This change demonstrates PostgreSQL's commitment to improving query performance automatically, reducing the burden on developers to know every optimization detail while still allowing experts to write optimal queries when needed.

References

Week 03 (2026)

PostgreSQL mailing list discussions for Week 03, 2026.

Articles

Extended Statistics Import/Export Functions for PostgreSQL 19

Introduction

PostgreSQL's extended statistics feature, introduced in version 10, allows the optimizer to understand correlations between columns that simple per-column statistics cannot capture. This includes n-distinct coefficients (how many distinct combinations exist), functional dependencies (when one column determines another), and Most Common Values (MCV) lists for multi-column combinations.

While PostgreSQL 17 introduced functions for importing and exporting relation and attribute statistics (pg_restore_relation_stats, pg_restore_attribute_stats), extended statistics were left out of this initial implementation. A recent thread on pgsql-hackers, initiated by Corey Huinker, addresses this gap with a comprehensive patch series that adds pg_restore_extended_stats(), pg_clear_extended_stats(), and related infrastructure.

This work is significant for several reasons:

  • Enables complete statistics preservation across pg_dump/pg_restore and pg_upgrade
  • Allows query planner experimentation with hypothetical statistics
  • Supports schema-only dumps with statistics for testing query plans without actual data

Technical Analysis

The Problem with the Original Format

The original output format for pg_ndistinct and pg_dependencies types used a JSON structure where the keys themselves contained structured data:

{"1, 2": 2323, "1, 3": 3232, "2, 3": 1500}

While technically valid JSON, this format posed several problems:

  1. Keys containing comma-separated attribute numbers require additional parsing
  2. Difficult to manipulate programmatically
  3. No working input function existed—these types were effectively output-only

The New JSON Format

The patch series introduces a cleaner, more structured JSON format. For pg_ndistinct:

[
  {"attributes": [2, 3], "ndistinct": 4},
  {"attributes": [2, -1], "ndistinct": 4},
  {"attributes": [2, 3, -1], "ndistinct": 4}
]

For pg_dependencies:

[
  {"attributes": [2], "dependency": 3, "degree": 1.000000},
  {"attributes": [2, 3], "dependency": -1, "degree": 0.850000}
]

Key improvements:

  • Proper JSON arrays with named keys for each element
  • Clear separation of attributes, values, and metadata
  • Machine-readable without custom parsing logic
  • Negative attribute numbers represent expressions in the statistics object (e.g., -1 is the first expression)

Input Function Implementation

The new input functions use PostgreSQL's JSON parser infrastructure with a custom semantic action handler. Here's a simplified view of the parsing state machine for pg_ndistinct:

typedef enum
{
    NDIST_EXPECT_START = 0,
    NDIST_EXPECT_ITEM,
    NDIST_EXPECT_KEY,
    NDIST_EXPECT_ATTNUM_LIST,
    NDIST_EXPECT_ATTNUM,
    NDIST_EXPECT_NDISTINCT,
    NDIST_EXPECT_COMPLETE
} ndistinctSemanticState;

The parser validates:

  • Proper JSON structure (array of objects)
  • Required keys (attributes and ndistinct for ndistinct statistics)
  • Attribute numbers within valid ranges (positive for columns, negative for expressions, but not beyond STATS_MAX_DIMENSIONS)
  • No duplicate attributes within a single item

Extended Statistics Functions

The patch introduces three main SQL functions:

pg_restore_extended_stats() — Imports extended statistics from a previously exported value:

SELECT pg_restore_extended_stats(
    'public',                    -- relation schema
    'my_table',                  -- relation name
    'public',                    -- statistics schema  
    'my_stats',                  -- statistics name
    false,                       -- inherited
    '{"version": ..., "ndistinct": [...], "dependencies": [...], "mcv": [...], "exprs": [...]}'::text
);

pg_clear_extended_stats() — Removes extended statistics data from pg_statistic_ext_data:

SELECT pg_clear_extended_stats(
    'public',        -- statistics schema
    'my_stats',      -- statistics name
    false            -- inherited
);

The functions follow the same patterns established for relation/attribute statistics:

  • Return boolean indicating success
  • Issue WARNING (not ERROR) on problems to avoid breaking pg_restore scripts
  • Require MAINTAIN privilege on the target relation

Validation and Safety

The implementation includes careful validation:

  1. Attribute bounds checking: Positive attnums must exist in stxkeys, negative attnums must not exceed the number of expressions
  2. Combination completeness: For pg_ndistinct, all N-choose-K combinations must be present based on the longest attribute list
  3. Soft error handling: Uses PostgreSQL's ErrorSaveContext for safe error reporting without crashing

Example validation for attribute numbers:

if (attnum == 0 || attnum < (0 - STATS_MAX_DIMENSIONS))
{
    errsave(parse->escontext,
            errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
            errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
            errdetail("Invalid \"%s\" element: %d.",
                      PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
    return JSON_SEM_ACTION_FAILED;
}

Community Insights

Key Discussion Points

Format Change Timing: Tomas Vondra initially suggested a more structured JSON format. The community recognized this was the last opportunity to change the format before a working input function locked in backward compatibility requirements.

Validation Scope: There was significant discussion about how much validation to perform:

  • Early patches had extensive checks for statistical consistency (e.g., MCV frequencies summing to 1.0)
  • Reviewers pushed back, preferring minimal validation to avoid breaking legitimate but unusual imports
  • Final consensus: validate structure and attribute references, but not statistical values

pg_dependencies Special Case: Unlike pg_ndistinct which stores all combinations, pg_dependencies may omit statistically insignificant combinations. This means the input function cannot enforce complete combination coverage for dependencies.

Reviewer Feedback Integration

Michael Paquier provided extensive review and contributed significant improvements:

  • Restructured the patch series for cleaner commits
  • Split format changes from input function additions
  • Added comprehensive regression tests achieving >90% code coverage
  • Fixed compiler warnings on older GCC versions

Tom Lane caught style issues:

  • Error detail messages converted to complete sentences
  • Replaced SOFT_ERROR_OCCURRED() macro with direct state checks to avoid warnings

Current Status

As of January 2026, the patch series has progressed significantly:

Committed:

  • Output format changes for pg_ndistinct (new JSON array format)
  • Output format changes for pg_dependencies (new JSON array format)
  • Input functions for both types with comprehensive validation
  • pg_clear_extended_stats() function

In Review (v27):

  • pg_restore_extended_stats() function
  • pg_dump integration for extended statistics export/import

The pg_dump integration supports backward compatibility to PostgreSQL 10, with version-specific SQL generation to handle format differences.

Technical Details

Internal Storage Unchanged

Importantly, the internal binary storage format remains unchanged. The new input/output functions only affect the text representation. This means:

  • No catalog changes required
  • Existing data remains valid
  • Binary COPY operations unaffected

Expression Statistics Support

Extended statistics can include expressions (e.g., CREATE STATISTICS s ON (a + b), c FROM t). The implementation handles these via negative attribute numbers:

  • -1 = first expression
  • -2 = second expression
  • etc.

The exprs element in the restore format contains per-expression statistics similar to pg_statistic entries, enabling complete round-trip preservation.

MCV List Handling

MCV (Most Common Values) lists for extended statistics are particularly complex, containing:

  • Value combinations across multiple columns
  • Frequency and base frequency arrays
  • Per-value null bitmaps

The implementation reuses infrastructure from attribute statistics import, with extensions for multi-column value arrays.

Conclusion

This patch series represents a significant enhancement to PostgreSQL's statistics infrastructure. By enabling import/export of extended statistics, it:

  1. Completes the statistics story started in PostgreSQL 17 for relation and attribute statistics
  2. Enables realistic testing with production-like statistics on sanitized schemas
  3. Improves upgrade reliability by preserving optimizer information across pg_upgrade

For DBAs and developers:

  • Extended statistics created with CREATE STATISTICS will now survive pg_dump/pg_restore
  • Query plan testing becomes more practical with --no-data dumps that include full statistics
  • The new JSON format is human-readable for debugging and hypothetical scenario testing

The target release is PostgreSQL 19, with the remaining restore function and pg_dump integration expected to land soon.

References

pg_plan_advice: A New Approach to PostgreSQL Query Plan Control

Introduction

PostgreSQL's query planner is sophisticated and generally produces excellent execution plans. However, experienced DBAs and developers occasionally encounter situations where they wish they could influence or stabilize the planner's decisions. Robert Haas from EnterpriseDB has been working on a significant new contrib module called pg_plan_advice that aims to address this long-standing need.

This article examines the pg_plan_advice thread on the pgsql-hackers mailing list, which has been actively discussed since October 2025.

What is pg_plan_advice?

pg_plan_advice is a proposed contrib module that introduces a special-purpose "advice mini-language" for controlling key planning decisions. The module can:

  • Generate advice strings from existing query plans using EXPLAIN (PLAN_ADVICE)
  • Apply advice strings via the pg_plan_advice.advice GUC parameter to reproduce or constrain future planning decisions

The advice language allows control over:

  • Join order: Which tables are joined in what sequence
  • Join methods: Nested loop, merge join, hash join
  • Scan types: Sequential scan, index scan (with specific index selection)
  • Parallelism: Where and how parallel execution is used
  • Partitionwise joins: How partitioned table joins are handled

Key Design Philosophy

Robert Haas emphasizes in the README that the principal use case is not about users "out-planning the planner" but rather about reproducing plans that worked well in the past:

"We don't need to accept the proposition that users can out-plan the planner. We only need to accept that they can tell good plans from bad plans better than the planner. That is a low bar to clear. The planner never finds out what happens when the plans that it generates are actually executed, but users do."

This positions pg_plan_advice as a plan stability tool rather than a hint system for micromanaging the optimizer.

Technical Architecture

The Relation Identifier System

One of the most innovative aspects of pg_plan_advice is its relation identifier system. This system provides unambiguous references to parts of a query, handling complex scenarios like:

  • Multiple references to the same table with different aliases
  • Subqueries and CTEs
  • Partitioned tables and their partitions

The identifier syntax uses special notation like t#2 to distinguish between the first and second occurrence of table t in a query.

Example Usage

Here's an example from Jakub Wartak's testing showing the power of the system:

-- Generate advice for a query with aliasing
EXPLAIN (PLAN_ADVICE, COSTS OFF) 
SELECT * FROM (SELECT * FROM t1 a JOIN t2 b USING (id)) a, t2 b, t3 c 
WHERE a.id = b.id AND b.id = c.id;

-- Output includes:
-- Generated Plan Advice:
--   JOIN_ORDER(a#2 b#2 c)
--   MERGE_JOIN_PLAIN(b#2 c)
--   SEQ_SCAN(c)
--   INDEX_SCAN(a#2 public.t1_pkey)
--   NO_GATHER(c a#2 b#2)

You can then selectively apply constraints:

-- Force a specific scan type
SET pg_plan_advice.advice = 'SEQ_SCAN(b#2)';

-- Re-explain to see the new plan
EXPLAIN (PLAN_ADVICE, COSTS OFF) 
SELECT * FROM (SELECT * FROM t1 a JOIN t2 b USING (id)) a, t2 b, t3 c 
WHERE a.id = b.id AND b.id = c.id;

-- The output shows:
-- Supplied Plan Advice:
--   SEQ_SCAN(b#2) /* matched */

Patch Structure (v10)

The implementation is split into five patches:

PatchDescriptionSize
0001Store information about range table flattening7.8 KB
0002Store information about elided nodes in the final plan9.8 KB
0003Store information about Append node consolidation40.4 KB
0004Allow for plugin control over path generation strategies56.1 KB
0005WIP: Add pg_plan_advice contrib module399.1 KB

The first four patches add necessary infrastructure to the planner, while the fifth contains the actual module. This separation allows the infrastructure to potentially benefit other extensions in the future.

Community Review and Testing

The thread has seen active participation from several community members:

Jakub Wartak (EDB)

Conducted extensive TPC-H benchmark testing and found several bugs:

  • Crashes in debug/ASAN builds with NULL pointer dereferences
  • Issues with semijoin uniqueness detection without statistics
  • Join order advice conflicts in complex queries

Jacob Champion (EDB)

Applied fuzzing techniques to discover edge cases:

  • Parser crashes with malformed advice strings
  • Issues with partition-related advice on non-partitioned tables
  • AST utility bugs revealed through corpus-based fuzzing

Other Contributors

  • Alastair Turner: Appreciated the ability to test alternative plans
  • Hannu Krosing (Google): Referenced VLDB research showing 20% of real-world queries have 10+ joins
  • Lukas Fittl: Interested in pg_stat_statements integration possibilities

Issues Discovered and Fixed

The collaborative review process has uncovered and fixed several issues across versions:

  1. Compiler warnings (gcc-13, clang-20) - Fixed in early versions
  2. NULL pointer crashes in pgpa_join_path_setup() when extension state wasn't allocated
  3. Join order conflict detection incorrectly treating join method advice as positive constraints
  4. Semijoin uniqueness tracking not working correctly without PLAN_ADVICE in EXPLAIN
  5. Partial match detection in nested join order specifications

Current Status

As of v10 (posted January 15, 2026):

  • The patch is registered in Commitfest
  • Still marked as WIP (Work In Progress)
  • Active testing continues, particularly with TPC-H queries
  • Robert Haas is seeking substantive code review, especially for patch 0001

Implications for PostgreSQL Users

If committed, pg_plan_advice would provide:

  1. Plan Stability: Capture and reproduce known-good query plans
  2. Debugging Aid: Understand why the planner makes specific choices
  3. Testing Tool: Experiment with alternative plan shapes without modifying queries
  4. Production Safety Net: Guard against unexpected plan regressions after statistics changes

Comparison with pg_hint_plan

Unlike the popular pg_hint_plan extension, pg_plan_advice focuses on round-trip safety:

  • Plans can be captured and reapplied reliably
  • The relation identifier system handles complex aliasing automatically
  • Designed to work with any query structure without manual identifier management

Conclusion

pg_plan_advice represents a significant step forward in PostgreSQL's planner extensibility story. Rather than replacing the optimizer's judgment, it provides a safety mechanism for preserving proven execution strategies. The active community review process has already improved the code substantially, and continued testing is helping ensure robustness.

For DBAs managing complex workloads, particularly those with queries that occasionally suffer from plan regressions, this module offers a promising solution that works with the planner rather than against it.


Thread Link: pg_plan_advice - pgsql-hackers

Commitfest Entry: CF 6184

Author: Robert Haas (EnterpriseDB)

Reviewers: Jakub Wartak, Jacob Champion, Alastair Turner, Hannu Krosing, John Naylor, and others

2026 年归档

2026 年 PostgreSQL 周报文章。

各周

2026 年第 13 周

2026 年第 13 周 PostgreSQL 邮件列表讨论。

文章

UUID 与 base32hex 编码

引言

在 URL、日志、JSON 等场景里,经常需要一种更短、更易口述的 UUID 文本形式。2025 年 10 月起,在 pgsql-hackers 讨论串 中,有人倡议增加两个内置函数——uuid_to_base32hex()base32hex_to_uuid()——把 UUID 编成 RFC 4648 第 7 节base32hex 字符串(26 个字符)。后续讨论的焦点很快从「格式好不好」转向:在 PostgreSQL 的 SQL 接口里,这类能力应该放在哪里——是再增加一对 UUID 专用函数,还是走 encode() / decode() 与显式类型转换的组合。

为何值得关注

PostgreSQL 已有 uuid 类型与 uuidv7() 等函数,存储层面很高效;但在系统间交换、对外 API 和人工阅读时,文本形态仍然重要。Base32hex 在字节层面保持 字典序与二进制一致(与常见的 base64 不同),解码时 大小写不敏感,口述时也不必区分字母大小写。讨论中还把该格式与 DNSSEC 等现实用法、以及 RFC 9562 对「规范 hyphen 表示 + 库内二进制存储」的取向联系起来,强调紧凑文本编码若碎片化(各家用各家的短编码)会带来互操作噩梦。

技术分析

最初设想(概念)

线程中保留的原始说明(hi-hackers.txt)大致描述:

  • uuid_to_base32hex(uuid) → text:26 个大写 base32hex 字符,无连字符、无填充;在 128 位与 base32 的 5 位对齐需求之间补两位零比特。
  • base32hex_to_uuid(text) → uuid:解码大小写不敏感;非法输入返回 NULL。

作者同时对比了 base36(性能)、Crockford Base32(标准库支持弱)等,倾向 base32hex。

社区更偏好的方向:组合而非重复

Aleksander Alekseev 的第一回复既包含流程建议(避免大量 Cc;用 git format-patch;在 Commitfest 登记),也包含接口设计:单独的 UUID 编解码函数 可组合性差,更稳妥的是:

  1. 提供显式的 uuid ↔ bytea 转换;
  2. encode(bytea, ...) / decode(text, ...) 上增加 base32hex 格式,例如:
SELECT encode(uuidv7()::bytea, 'base32hex');

(早期邮件里写的是 'base32';后续补丁采用的格式名是 base32hex,与 RFC 4648 的命名一致。)

Andrey BorodinJelte Fennema-Nio 同意扩展 encode() 更符合既有习惯;Jelte 以 PostgreSQL 18 中 base64url 的引入为例(提交 e1d917182),说明在相同入口上增加编码格式是可行路径。

补丁系列实际走向(概要)

下载的补丁系列(讨论串中可见到 v12 等版本)逐步收敛为:

  • encode.c 中实现 encode(bytea, 'base32hex')decode(text, 'base32hex')(编码侧按 RFC 使用 = 填充;解码侧接受带填充或不带填充;大小写与空白处理以补丁说明为准)。
  • func-binarystring.sgml 中记录该格式,并给出紧凑 UUID 的推荐写法:
rtrim(encode(uuid_value::bytea, 'base32hex'), '=')

即相对 36 字符的规范 UUID 文本,得到 26 字符 的短形式。

  • 与之配套:uuidbytea 的显式转换,避免在核心中为每种编码再增加一对 UUID 专用函数。

SQL 示例(示意)

下面语句对应讨论中形成的用法(encode / decode 使用 'base32hex',UUID 经 ::bytea 参与编码)。需以包含该功能的 PostgreSQL 版本为准——撰写本文时相关补丁仍在评审流程中。

1. 带 RFC 填充的原始输出encode() 会输出 '=' 填充;对 16 字节的 UUID,在未 rtrim 前字符串长度会大于 26:

SELECT encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex') AS padded;
-- 32 个字符(按 RFC 4648 填充到 8 的倍数,末尾为 '=')

2. 26 字符紧凑形式 — 去掉尾部 =,适合 URL、日志等(与线程中的示例 UUID 一致):

SELECT rtrim(
         encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
         '='
       ) AS short_id;
-- 06AJBM9TUTSVND36VA87V8BVJO

3. 往返decode() 返回 bytea,再 cast 回 uuid

WITH x AS (
  SELECT rtrim(
           encode('019535d9-3df7-79fb-b466-fa907fa17f9e'::uuid::bytea, 'base32hex'),
           '='
         ) AS short_id
)
SELECT short_id,
       decode(short_id, 'base32hex')::uuid AS back_to_uuid
FROM x;
-- back_to_uuid = 019535d9-3df7-79fb-b466-fa907fa17f9e

4. 与 uuidv7() 组合(需要时间有序 ID 且对外用短字符串时):

SELECT rtrim(encode(uuidv7()::bytea, 'base32hex'), '=') AS short_new_id;

补丁演进

早期版本曾出现 uuid_encode / uuid_decode 一类接口;后续版本将能力并入 encode/decode,并补充回归测试与文档,还涉及 排序与排序规则(collation) 的说明与小幅修正(附件列表中有单独的 doc 补丁)。

社区观点

  • 列表与可见性:Aleksander 指出 Sergey 若未订阅列表,首发邮件未必被所有人看到;附原文并建议订阅,有助于讨论在归档里自洽。

  • 为何选 base32hex:Sergey 归纳了排序保持、体积、标准库支持、口述友好、JSON 场景下实现简单等理由,并强调短格式若各自为政会导致生态分裂。

  • API 膨胀Masahiko Sawada 认为若再堆一批 uuid_* 编码函数,容易与 encode() 职责重叠;他支持 encode/decode + UUID 与 bytea 转换,并认为转换开销可忽略。

  • 多态 encodedecode 的签名限制:Sergey 曾问能否让 encode() 直接吃 uuid,或让 decode() 直接产出 uuid。Masahiko 说明:在 PostgreSQL 里 无法用同一组参数类型decode(text, text) 定义两种不同返回类型;显式 cast + 可内联的 SQL 包装函数 是务实做法,例如:

CREATE FUNCTION uuid_to_base32(u uuid) RETURNS text
LANGUAGE SQL IMMUTABLE STRICT
BEGIN ATOMIC
  SELECT encode($1::bytea, 'base32hex');
END;

并说明与手写 encode($1::bytea, ...) 相比,额外成本主要在类型转换。

  • 「只能有一种短格式」吗:Sergey 担心多种短编码并存;Masahiko 指出在异构系统集成时,开发者仍可能因兼容性选择 hex 等。工程上的折中仍是在核心提供 标准的 base32hex,并在文档中写清 26 字符 UUID 的用法。

技术细节

  • 填充:按 RFC,encode() 会输出带 = 的填充;去掉尾部 = 即得到提案中的 26 字符形态。
  • 排序:base32hex 保持字节序对应的字典序;若把编码结果当作 text 比较,需注意 排序规则 与二进制比较的差异。
  • 实现位置:把编解码放在 encode.c 与现有 base64url 等并列,符合「二进制↔文本编码集中管理」的习惯。

结语

Base32hex 本身并不复杂,但这串讨论体现了 PostgreSQL 对 接口正交性 的偏好:uuid 存数据、bytea 表示字节、encode/decode 负责文本编码。若你需要紧凑且排序友好的 UUID 文本,正在形成的使用习惯是 encode(uuid::bytea, 'base32hex'),必要时 rtrim(..., '=');若业务希望一行 SQL更短,可以用 可内联的包装函数 封装,而不必把每一种编码都塞进核心函数名列表。

JSONPath 字符串方法:在路径里清洗 JSON,以及一场关于不可变性的长跑讨论

引言

处理“脏”JSON 时,常常要在比较之前做去空白、大小写转换、按分隔符拆分等操作。现在你可以在 jsonb_path_query() 等函数外围用 SQL 完成这些步骤,但不一定能在 JSONPath 表达式内部直接完成。Florents Tselai 在 pgsql-hackers 上发布了一系列补丁,为 JSONPath 增加常见的字符串方法:lower()upper()initcap()ltrim() / rtrim() / btrim()replace()split_part(),实现上委托给 PostgreSQL 内置的字符串处理函数。

讨论很快从“能写什么表达式”扩展到更底层的问题:这些方法与 易变性与不可变性(依赖区域设置时的行为)、PostgreSQL 的 JSONPath 究竟应对齐 SQL 标准 还是互联网 RFC,以及现有带 *_tz 后缀的 JSONPath 入口在命名上的历史包袱。该工作登记在 Commitfest 5270邮件串 从 2024 年延续到 2025 年,补丁版本迭代很多轮。

为何重要

JSONPath 是 jsonb_path_* 系列函数内嵌的领域语言。在路径里直接提供字符串原语,可以减少外层 SQL 嵌套,让意图集中在路径上,也更接近人们熟悉的“管道式”文本处理——只是作用在 JSON 标量上。对数据清洗场景(空白、大小写、分隔符拆分),这类能力在易用性上很有吸引力。

真正的难点在于:大小写映射和许多字符串操作依赖区域设置(locale)。PostgreSQL 的规划与优化依赖正确的易变性标注:若把仍可能随区域或系统环境变化的结果标成 immutable,会破坏优化器的基本假设。线程里大部分篇幅正是在处理这一矛盾。

技术分析

补丁在做什么

首版补丁让这些方法转发到已在 pg_proc 注册的实现;在 JsonPathParseItem 上为需要比传统左右操作数更灵活入参的方法增加了 method_argsarg0arg1),并增加 jspGetArgX() 访问器。作者还加入了 README.jsonpath,说明今后如何新增方法,方便后续贡献者。提案中的用法示例如下:

SELECT jsonb_path_query('" hElLo WorlD "', '$.btrim().lower().upper().lower().replace("hello","bye") starts with "bye"');
SELECT jsonb_path_query('"abc~(at)~def~@~ghi"', '$.split_part("~(at)~", 2)');

首帖列出的待决事项包括:若将来 SQL/JSON 标准定义了同名方法如何避免冲突(相关讨论里提到过 pg_ 等前缀)、与现有 JSONPath 代码一致的 默认排序规则、尚缺的用户文档,以及类似 CREATE JSONPATH FUNCTION 的可扩展性设想。

补丁演进(概览)

可下载的系列从 v1v18;自 v6 起每个版本拆成两个补丁:一个重命名 JSONPath 方法实参相关的词法/语法标记,另一个承载字符串方法本体。中间多轮修订涉及测试、jsonpath_scan.l 等文件的变基与冲突消解。较新的版本在 doc/src/sgml/func/func-json.sgml 中补充了 SGML 文档,并为 jsonpathjsonb_jsonpath 等增加了 回归测试

社区观点

  • Tom Lane 很早就指出 不可变性问题:依赖区域设置的字符串操作无法保证 JSONPath 运算处处不可变;他引用提交 cb599b9dd,并对比 JSONPath 中为时区敏感日期时间引入的 _tz 分裂——他认为这种为每一种易变来源再复制一套入口的做法难以持续扩展。

  • Alexander Korotkov 提出 “灵活的易变性” 设想:是否可以有辅助逻辑分析 JSONPath 参数是否为常量、路径中方法是否都“安全”,从而在受限情形下把 jsonb_path_query() 标为 immutable,否则标为 stable。他还询问这些新名字是否出现在 SQL 标准(或草案)中。

  • Florents Tselai 提到 2019 年的相关讨论,并勾勒 启发式(若路径上各段都安全,则整体可视为不可变);他引用 RFC 9535 中的“函数扩展”,认为厂商扩展在规范上有依据,而易变性属于实现细节。

  • David E. Wheeler 说明:PostgreSQL 的 JSONPath 跟踪的是 SQL 标准中的 SQL/JSON,而不是 RFC 9535;公开 RFC 中的扩展机制与 SQL 标准文本中的规则未必一致。他同时关心 可扩展钩子(词法、语法、执行器);Florents 概括了步骤:新增 JsonPathItemType、修改 jsonpath_scan.l / jsonpath_gram.y、在 executeItemOptUnwrapTarget 中分发。

  • Robert Haas 把问题部分归结为 通用策略(依赖操作系统时间或区域设置时,函数往往是 stable),并类比现有的 json_path_existsjson_path_exists_tz:可以考虑让带后缀的一组函数承担更多“非纯”行为,而在无后缀版本中报错——同时承认 _tz 这个名字对“区域设置”并不贴切。

  • David Wheeler 将 Tom 所说的“难扩展”理解为:不会为每一种易变来源都造一套 json_path_exists_*;并讨论是否 改名 或泛化。在欧洲 PostgreSQL 社区会议上的讨论之后,Florents 总结了较务实的路线:把新行为放在 jsonb_path_*_tz 家族下、在非 _tz 变体中拒绝使用并 写清文档——在命名上与现有 _tz 保持一致,尽管语义上已不只在说时区。

  • 命名与 API 膨胀:David 提议是否引入 _stable 一类后缀;Robert 描述了Deprecation、GUC、多年迁移等重流程,并认为抱怨难以避免。Tom Lane 认为不必为迁移投入那么大精力,用户若执意保留旧名可以包一层 包装函数,并反问:为何不能 新增一组更好的名字且永不删除旧名Robert 表示若能接受多出来的符号也可以。Florents 则指出再并行一套五个 jsonb_path_* 会显著增加 API 表面积。David 还从索引化角度追问:现实中 _tz 路径是否适合索引场景;讨论中也提到生成列以及未来虚拟生成列等用法。

技术细节

不可变性与区域设置

PostgreSQL 区分 immutable(在规划器假设下,相同输入应产生相同结果)与 stable(在一次语句执行过程中可能变化,例如随会话或环境)。区域数据可能随操作系统或 ICU 更新而改变,因此依赖区域的函数通常不能标为 immutable。JSONPath 若深度参与表达式求值与优化,必须与核心 SQL 函数遵守同一套规则。

标准立场

讨论中区分了两类文献:

  • RFC 9535(IETF JSONPath):公开,描述扩展点。
  • SQL/JSON 与 SQL 标准中的 JSONPath(PostgreSQL 对齐目标):文本不公开,扩展规则可能与 RFC 不同。

即便规范允许厂商扩展方法名,未来标准若占用同名仍有风险——首帖提出的命名策略问题仍未消失。

实现层面

除易变性外,新增方法会牵动 JSONPath 词法分析器、语法、执行器与测试。将“重命名标记”和“功能本体”拆成两个补丁,有利于在版本号变多之后仍保持评审可控。

现状

补丁系列在多轮变基中持续演进(抓取数据中可见至 v18)。工作条目见 Commitfest 5270,也可在 GitHub PR 浏览。2025 年 5 月的讨论在实现策略(*_tz 表面、非后缀版本报错、文档)与命名取舍上趋于务实,但本邮件串片段中未见最终合入主线的结论。

结语

JSONPath 字符串方法回应的是真实的产品需求:在路径表达式内完成规范化与拆分。社区回应的焦点与其说在“要不要这个功能”,不如说在 易变性标注是否正确与 SQL/JSON 的关系,以及 jsonb_path_*_tz 后缀 的长期可维护性。这条线程很好地展示了 PostgreSQL 在扩展内嵌 DSL 时,如何在标准符合性、优化器正确性与 API 卫生之间取舍。

第 12 周(2026)

2026 年第 12 周 PostgreSQL 邮件列表讨论。

文章

缩短含 NULL 的大规模 NOT IN 列表的规划时间

引言

当查询使用 x NOT IN (NULL, ...)x <> ALL (NULL, ...) 时,结果恒为空——不会有任何行匹配。在 SQL 中,NOT IN 列表中只要存在一个 NULL,整个谓词对每一行都会得到 NULL 或 false,因此选择性为 0.0。然而 PostgreSQL 优化器此前仍会遍历列表中的每个元素,并为每个元素调用算子的选择性估计函数,在大列表上浪费规划时间。

Ilia Evdokimov(Tantor Labs)提出了一项简单优化:在 <> ALL / NOT IN 情形下,检测数组是否包含 NULL,并提前短路选择性计算循环,直接返回 0.0。该补丁经过多轮评审,发现并修复了与「曾含 NULL 但现已不含」的数组相关的回归问题,最终由 David Rowley 于 2026 年 3 月提交。

为何重要

在报表和 ETL 场景中,带有大规模 NOT IN<> ALL 列表的查询很常见。当这类列表包含 NULL 时(无论是有意还是来自子查询),优化器此前会做大量无用工作:

  • 对于常量数组:解构数组,遍历每个元素,并调用算子的选择性函数。
  • 对于非常量表达式:遍历列表元素以检查 NULL。

在 Ilia 的测试中,当列具有详细统计信息时,WHERE x NOT IN (NULL, ...) 的规划时间从 5–200 ms 降至 约 1–2 ms,具体取决于列表大小。该优化保持语义不变,且不会影响常见场景。

技术分析

语义

对于 x NOT IN (a, b, c, ...)(或 x <> ALL (array)):

  • 若任一元素为 NULL,谓词对每一行都会得到 NULL(在 WHERE 子句中意味着该行被过滤掉)。
  • 优化器将其建模为选择性 = 0.0:无行匹配。

src/backend/utils/adt/selfuncs.c 中的 scalararraysel() 通过 useOr 标志同时处理 = ANY(IN)和 <> ALL(NOT IN)。对于 <> ALL,选择性通过遍历数组元素并组合各元素估计值来计算。当存在 NULL 时,最终结果恒为 0.0,因此该循环是多余的。

补丁演进

v1deconstruct_array() 之后增加早期检查,使用 memchr()elem_nulls 上检测任意 NULL。David Geier 提出疑问:memchr() 在每次调用时都会带来开销。他建议在 ArrayType 上增加标志位。

v2 改为在逐元素循环内部短路:当元素为 Const 且为 NULL 时,立即返回 0.0。这避免了单独一轮遍历,但仍需进入循环。

v3 将检查提前到 DatumGetArrayTypeP() 之后,使用 ARR_HASNULL() 在解构数组之前检测 NULL。这样在存在 NULL 时,既不需要解构数组,也不需要逐元素循环。Ilia 报告规划时间从 5–200 ms 降至约 1–2 ms。

v4 修复了 Zsolt Parragi 发现的回归。宏 ARR_HASNULL() 仅检查 NULL 位图是否存在,而非是否有元素实际为 NULL。一个最初含有 NULL、但后来通过 array_set_element() 等将所有 NULL 替换掉的数组,仍可能保留 NULL 位图。仅使用 ARR_HASNULL() 会导致对此类数组错误地返回选择性 0.0。

修复方案:使用 array_contains_nulls(),该函数会遍历 NULL 位图,仅在实际存在 NULL 元素时返回 true。v4 还增加了回归测试,通过 replace_elem(ARRAY[1,NULL,3], 2, 99) 构造数组,确保优化器对 x <> ALL(replace_elem(ARRAY[1,NULL,3], 2, 99)) 估计 997 行(而非 0)。

v5–v9 采纳 David Rowley 的反馈:将测试移至新的 selectivity_est.sql 文件(后更名为 planner_est.sql),使用 test_setup.sql 中的 tenk1 而非自定义表,增加关于假定算子为 strict 的注释(与 var_eq_const() 一致),并简化测试以断言不变量(存在 NULL 时选择性为 0.0),而非精确行估计。

社区反馈

  • David Geier 质疑 memchr() 的代价,并建议在 ArrayType 上增加标志;Ilia 发现已有 ARR_HASNULL() / array_contains_nulls()
  • Zsolt Parragi 发现了与「已将 NULL 替换掉」的数组相关的回归,并提出了 replace_elem 测试用例。
  • David Geier 澄清:ARR_HASNULL() 检查的是位图是否存在,而非实际 NULL 元素;array_contains_nulls() 才是正确的检查。
  • David Rowley 建议将测试移至专门的 planner 估计相关文件,使用现有测试表,并注明 strict 算子假设。他还提交了重构补丁(planner_est.sql)和主优化补丁。

技术细节

实现

该优化在 scalararraysel() 中增加两条短路路径:

  1. 常量数组情形:在 DatumGetArrayTypeP() 之后,若 !useOr(即 <> ALL / NOT IN)且 array_contains_nulls(arrayval),则立即返回 0.0。这发生在 deconstruct_array() 之前。

  2. 非常量列表情形:在逐元素循环中,若 !useOr 且元素为 constisnullConst,则返回 0.0。这处理了当列表来自非常量表达式时的 x NOT IN (1, 2, NULL, ...)

代码假定算子为 strict(与 var_eq_const() 一致):当常量为 NULL 时,算子返回零选择性。这与现有优化器行为一致。

边界情况

  • 有 NULL 位图但无实际 NULL 的数组:通过使用 array_contains_nulls() 而非 ARR_HASNULL() 处理。
  • 非 strict 算子:注释中说明,该短路与 var_eq_const() 采用相同假设。

相关工作

David Geier 指出,一旦基于哈希的 NOT IN 代码合并后,加速效果会有所减弱,但仍能在选择性估计阶段节省大量计算。

当前状态

该补丁由 David Rowley 于 2026 年 3 月 19 日提交。将优化器行估计测试迁移至 planner_est.sql 的重构先被提交,主优化随后跟进。该优化将出现在后续 PostgreSQL 版本中。

结论

一项小改动——在 NOT IN / <> ALL 列表中检测 NULL 并提前返回选择性 0.0——避免了规划阶段不必要的逐元素计算。修复需要正确处理 NULL 位图与实际 NULL 元素的区别,并得益于细致的评审和回归测试。该优化现已成为 PostgreSQL 的一部分,将惠及使用含 NULL 的大规模 NOT IN 列表的负载。

第 11 周(2026)

2026 年第 11 周的 PostgreSQL 邮件列表讨论。

文章

将 NOT IN 子链接安全地转换为 ANTI JOIN

引言

2026 年 2–3 月,Richard Guo 在 pgsql-hackers 邮件列表上提出并迭代了一个优化器补丁,用于在语义安全的前提下,将

... WHERE expr NOT IN (SELECT subexpr FROM ...)

自动转换为基于连接的 ANTI JOINNOT INNULL 相关的语义长期以来都非常棘手,稍有不慎就会改变查询结果,因此过去的尝试大多被搁置。

这一次,补丁充分利用了近年来在优化器中新增的基础设施:支持外连接可空信息的 Var 表达、全局的 not-null-attnums 哈希表,以及更智能的非空性推理。补丁从 v1 演进到 v6,期间 wenhui qiu、Zhang Mingli、Japin Li、David Geier 等多位开发者参与了审查和讨论。最终版本在 2026 年 3 月由 Richard 提交。

本文将解释为什么 NOT IN 难以优化、优化器如何证明“安全”、补丁从 v1 走到 v6 的关键变化,以及这对日常查询意味着什么。

为什么 NOT IN 很难

表面看起来,NOT IN 好像就是一个反连接:

SELECT *
FROM users
WHERE id NOT IN (SELECT user_id FROM banned_users);

直觉上这代表“没有被封禁的用户”,理想的执行计划自然是对 banned_users 做一个哈希 ANTI JOIN。问题在于 SQL 对 NULL 的定义

NOT IN 中,标量比较 (A = B) 的行为是:

  • (A = B)TRUE:说明命中,NOT IN 条件失败。
  • (A = B)FALSE:当前元素不是匹配项,整体结果取决于其他元素。
  • (A = B)NULLNOT (NULL) 仍然是 NULL,在 WHERE 中等价于 false,该行会被丢弃

而在 ANTI JOIN 中,如果连接条件对某一行对返回 NULL,执行器往往会把它当作“没有匹配”来处理,从而可能保留外侧行。正如 Richard 在邮件中指出的,只要比较算子在某些“被视为合法等值比较”的输入上可能返回 NULLNOT IN 和简单 ANTI JOIN 的语义就不一致。

历史上,正是这种语义差异让优化器一直避免把 NOT IN 子链接转换为 ANTI JOIN。想要连接计划的用户通常需要自己改写为 NOT EXISTS 等形式。

优化器基础设施:证明“不会为 NULL”

近几个版本中,PostgreSQL 引入了一些可以更可靠、低成本地证明“表达式不会为 NULL”的基础设施:

  • 支持外连接可空信息的 Var:记录某个变量是否可能被外连接置为 NULL。
  • not-null-attnums 哈希表:跟踪表定义中的 NOT NULL 列(包括主键等隐含非空约束)。
  • expr_is_nonnullable():可以对复杂表达式(不仅是简单 Var/Const)进行非空性推理。
  • find_nonnullable_vars():可以从条件(例如 col IS NOT NULL,或严格算子的连接条件)中推导出被强制非空的变量。

在新补丁中,Richard 利用这些能力回答两个关键问题:

  1. 比较两边的表达式是否可能为 NULL
    • 利用表级 NOT NULL 信息和外连接可空的 Var 元数据,排除来自外连接“可空侧”的 Var
    • 结合 find_nonnullable_vars() 与安全的条件表达式,识别被 WHERE/ON 条件强制为非空的值。
  2. 比较算子本身在非空输入上是否可能返回 NULL
    • 查询系统目录,限制只允许属于 btree 或 hash 操作符族 的算子,因为这些算子的行为必须满足“正常的全序或等值语义”。如果此类算子在非空输入上返回 NULL,依赖它的索引本身就会被破坏。

只有在两侧表达式都被证明不会为 NULL,且比较算子被认为“不会在非空输入上返回 NULL” 时,规划器才会考虑将 NOT IN 子链接改写为 ANTI JOIN。

从 v1 到 v6:不断收紧安全边界

补丁并不是一蹴而就的,邮件线程很详尽地记录了它的演进过程:

  • v1

    • 实现了基本的转换逻辑。
    • 重点使用已有工具证明比较两侧都是非空的。
    • 但尚未检查“算子本身是否可能在非空输入上返回 NULL”。
  • 围绕算子安全性的讨论

    • Richard 意识到仅要求操作数非空还不够,算子本身也可能返回 NULL
    • 他提出能否识别“在非空输入上永不返回 NULL”的算子,并建议把“属于 btree 操作符类”作为一个近似条件。
    • David Geier 指出,执行器中大量代码假定比较算子不会返回 NULL——例如 FunctionCall2() 一旦拿到 NULL 返回值就会抛错。因此,把范围限定在内置的 B-tree / hash 算子是合理且安全的。
  • v2–v4

    • 增加了“算子必须是 B-tree 或 hash 操作符族成员”的检查。
    • 明确和完善了注释,并补充更多回归测试,覆盖子查询输出来自外连接“可空侧”但又被 WHERE 条件强制非空的情况。
    • 针对测试用例中的注释和一些边界情况做了小幅修正。
  • v5–v6

    • 进一步打磨内部辅助函数,包括用于检查 SubLink 测试表达式非空性的 sublink_testexpr_is_not_nullable
    • 改进对行比较表达式 (RowCompareExpr) 的支持,让多列 NOT IN 模式同样可以受益。
    • 将多处 foreach 改写成 foreach_ptr / foreach_node,在开发构建中获得更强的类型检查。
    • 修复了 query_outputs_are_not_nullable() 中一个细微但重要的问题:在对分组表达式和连接别名 Var 做“展开”时,先后顺序必须与解析器处理 FROM/JOIN 与 GROUP BY 的顺序一致。
    • 补充和整理了回归测试后,进行了一轮自审,最后宣布准备提交。

到了 v6,被提交的版本已经在语义上足够保守、测试覆盖充分,并吸收了多轮审查反馈。

补丁实现了什么?

在高层上,当优化器看到一个标准 ANY/ALL 形式的 NOT IN 子链接时,会:

  1. 识别模式:在 SubLink 及其 testexpr 中识别出 expr NOT IN (SELECT ...)
  2. 收集外层表达式:即比较左侧(外查询)的表达式列表。
  3. 检查算子安全性:所有参与比较的算子都必须属于某个 B-tree 或 hash 操作符族。
  4. 证明操作数非空:借助表级 NOT NULL 信息、外连接可空 Var 元数据,以及从条件中推导出的“非空变量”集合,证明外层表达式与子查询输出都不会为 NULL
  5. 在且仅在上述条件全部满足时,将 NOT IN 子链接改写为 ANTI JOIN

完成改写之后,优化器就可以:

  • 把原本“像黑盒子一样的子计划”拉进全局连接树。
  • 在整个连接顺序中自由移动该子查询。
  • 根据代价选择最合适的连接算法(哈希 ANTI JOIN、归并 ANTI JOIN 等)。

对使用者而言,收益是:很多用 NOT IN 写出来的排除模式,现在可以自动得到与精心写成 NOT EXISTS 或显式 ANTI JOIN 类似的执行计划,而不需要手工改写 SQL。

示例:典型的排除查询

补丁主要面向如下“教科书式”的写法:

SELECT *
FROM users
WHERE id NOT IN (SELECT user_id FROM banned_users);

以及:

SELECT *
FROM users
WHERE id NOT IN (
  SELECT user_id
  FROM banned_users
  WHERE user_id IS NOT NULL
);

在设计良好的模式中,users.idbanned_users.user_id 通常都是 NOT NULL,并且使用标准的等号比较。在这种场景下,规划器可以证明:

  • 比较两侧都不可能为 NULL
  • 所使用的等号算子是标准 B-tree/hash 等值算子。

此时 NOT IN 子链接会被改写为 ANTI JOIN,执行计划就可以是:

  • 针对 banned_usersHash Anti Join,或者
  • 在有合适索引且代价模型更倾向合并策略时,使用 Merge Anti Join

线程中还包含了 wenhui qiu 提供的大规模压测脚本,展示了在相关列被标记为 NOT NULL 之后,新优化如何在合成数据上自动产生高效的 ANTI JOIN 计划。

社区讨论与作用范围

邮件中也讨论了“优化应当走多远”的问题:

  • David Geier 描述了一些更激进的改写方式:在外层添加 IS NOT NULL 谓词,再加额外的 NOT EXISTS 子查询,从而覆盖“任一侧可为 NULL”更多情况。Richard 逐一给出反例,指出其中某些改写在子查询为空时会改变结果,并明确这些都超出了本补丁的范围
  • 邮件也简要提到,未来也许可以增加类似 Oracle 的**“感知 NULL 的 ANTI JOIN 执行节点”**,以执行层面的新算子来支持更多 NOT IN 场景,而不是完全依赖语法层改写。这被认为是后续可以探索的方向。
  • 审查者们多次强调:必须采取保守策略——宁可错过一些理论上的优化机会,也不能冒着改变查询结果的风险。

最终版本刻意聚焦在**“高收益的基本形态”**:比较两边都可被证明非空,且使用标准 B-tree/hash 比较算子。这覆盖了绝大多数现实中的 NOT IN 排除查询,同时在代码复杂度和风险之间取得平衡。

当前状态

截至 2026 年 3 月中旬:

  • “Convert NOT IN sublinks to anti-joins when safe” v6 补丁已经被提交
  • 当优化器能够证明安全时,该优化会自动启用。

在实践中,这意味着:如果你在 NOT NULL 键上使用内置比较算子写出常见的 NOT IN 排除查询,PostgreSQL 现在可以自动为你生成 ANTI JOIN 计划。只要模式和约束准确反映了非空性,应用端 SQL 无需做任何修改。

对用户的启示

  • 尽量正确声明 NOT NULL 和主键约束。 模式越准确地表达“哪些列不允许为 NULL”,优化器就越有机会安全地应用此类优化。
  • 在可为 NULL 的列上使用 NOT IN 仍然很危险。 PostgreSQL 在这些场景下仍会保守行事;如果你需要在包含 NULL 的数据上获得可预期的行为,NOT EXISTS 往往更适合。
  • 对于典型的排除查询,不必再为了“拿到 ANTI JOIN 计划”而主动改写为 NOT EXISTS 在满足安全条件时,优化器会自动完成改写,你可以继续使用语义上更直观的 NOT IN 写法。

参考

第 10 周(2026)

2026 年第 10 周的 PostgreSQL 邮件列表讨论。

文章

通用计划与初始裁剪:为分区表减少锁竞争

引言

2021 年 12 月,Amit Langote 提出了一组补丁,用于加速在分区表上执行通用计划(generic plan)时的表现。使用带参数的预处理语句时,若采用通用计划,则无法在规划阶段做分区裁剪,计划中会包含所有分区的节点。这样一来,AcquireExecutorLocks() 成为主要瓶颈:它会对计划中涉及到的每个关系加锁,而分区数量可能非常大。本文概述该思路、基准测试结果,以及 pgsql-hackers 上关于安全性与设计的讨论。

为什么这很重要

当使用如下预处理语句时:

PREPARE q AS SELECT * FROM partitioned_table WHERE key = $1;
EXECUTE q(123);

plan_cache_mode = force_generic_plan(或优化器已选择通用计划)的情况下,计划在多次执行间共享,且无法在规划时做分区裁剪,因此计划树会包含所有分区。每次执行前,CheckCachedPlan() 都要确认计划仍然有效,其中绝大部分开销来自 AcquireExecutorLocks()——它会对计划中的每个关系加锁。分区数量增加到数百或数千时,加锁成本占主导,吞吐会明显下降。

David Rowley 曾提出将加锁推迟到执行器里完成「初始」裁剪之后再做。该方案因存在竞态而被否决:部分分区不加锁时,并发会话可能在计划被判定有效之后、实际执行之前修改分区,导致计划部分失效。

Amit 的方案则仍在计划检查阶段加锁,但通过复用执行器将使用的「初始」裁剪逻辑,缩小需要加锁的关系集合:只对在初始裁剪后仍保留的分区加锁。这样既保证计划一致性,又让加锁数量随实际参与执行的分区数量增长,而不是随总分区数增长。

技术分析

思路:先裁剪再加锁

分区表的 Append、MergeAppend 节点上带有分区裁剪信息:哪些子计划会被「初始」(执行前)步骤裁掉,哪些会被「执行时」步骤裁掉。初始步骤只依赖执行前就已知的值(例如绑定参数),不依赖行级数据。补丁让 AcquireExecutorLocks() 在收集要加锁的关系时:

  1. 像现在一样遍历计划树;
  2. 对带有初始裁剪步骤contains_init_steps)的 Append/MergeAppend 节点,执行与执行器相同的初始裁剪逻辑,得到保留的子计划集合;
  3. 仅将这些保留子计划对应的关系加入待加锁集合。

因此加锁集合与真正会执行到的关系一致:被初始裁剪掉的分区不会加锁,会用到的分区不会漏锁。

裁剪的重复执行

这样一来,通用计划下「初始」裁剪会被执行两次:一次在 AcquireExecutorLocks() 里用于决定加锁对象,一次在 ExecInit[Merge]Append() 里用于决定要初始化哪些分区子节点。Amit 表示没有找到在不调整加锁时机(例如把加锁挪到执行器启动阶段)的前提下消除这种重复的简洁做法,而后者属于更大的改动。

基准测试

在 pgbench 分区库上使用 plan_cache_mode = force_generic_plan

  • HEAD:分区数增加时吞吐明显下降(例如 32 分区约 2.05 万 tps,2048 分区约 1.3k tps)。
  • 打补丁后:吞吐维持较高(例如 32 分区约 2.75 万 tps,2048 分区约 1.63 万 tps)。

说明在通用计划、多分区场景下,补丁显著消除了加锁带来的扩展性瓶颈。

社区讨论

适用场景

Ashutosh Bapat 问在哪些情况下会存在可用来减少加锁的「执行前」裁剪指令。Amit 说明:

  • 主要场景是使用通用计划预处理语句,例如 PREPARE q AS SELECT * FROM partitioned_table WHERE key = $1; 配合 EXECUTE q(...)
  • 其他瓶颈(例如遍历完整 range table 的执行器启动/关闭逻辑)本补丁未改动。

代码审查(Amul Sul)

Amul 对 v1 提出多处风格与结构建议:

  • 将变量声明移入 if (pruneinfo && pruneinfo->contains_init_steps) 分支内部。
  • 在该条件为假时补充简短注释:plan_tree_walker() 会继续遍历子节点,因此加锁行为仍然正确。
  • 优先使用已有的 GetLockableRelations_worker() 等,避免新增 get_plan_scanrelids()
  • 对 CustomScan 使用 plan_walk_members(),与其他节点类型一致。
  • 在锁收集路径中用于裁剪的临时 EState 应由调用方创建和释放,而不是在收集加锁关系的辅助函数内部创建/释放。
  • 在相关循环中使用 foreach_current_index() 提高可读性。

安全性(Robert Haas)

Robert 提出两点重要顾虑。

1. 计划「部分有效」 目前我们只执行完全有效的计划:对所有关系加锁,从而会接受失效消息并发现可能使计划失效的 DDL。若跳过对部分关系的加锁,就可能永远收不到这些关系的失效消息。例如:

  • 某个分区有额外索引,计划中使用了该索引的 Index Scan;
  • 该分区被初始裁剪掉,因此我们不对其加锁;
  • 另一会话删除了该索引;
  • 我们仍认为计划有效。虽然不会执行被裁掉的部分,但遍历整棵计划树的代码(如 EXPLAIN、auto_explain)可能访问该节点并出错(例如查找索引名)。

也就是说会引入「计划部分有效」的情况,而目前代码没有这种假设。Robert 虽未断言核心代码里一定存在由此触发的具体 bug,但认为这是一类新风险。

Amit 回复说,他检查了在执行器初始化之前访问计划树的路径,未发现会触及被裁掉部分;EXPLAIN 在 ExecutorStart() 之后运行,此时已构建 PlanState 树,只包含未裁掉的部分。他也同意不能据此断言绝对安全。

2. 加锁集合与初始化集合必须一致 在两处分别做初始裁剪意味着两次独立计算。若结果不一致(例如函数误标为 IMMUTABLE 实为 VOLATILE),可能出现加锁一组分区、初始化另一组分区的情况。Robert 认为应通过设计保证两者不可能不一致,而不是依赖两处结果永远相同。

Amit 同意补丁的前提是初始裁剪是确定性的(裁剪表达式中无 VOLATILE)。若 IMMUTABLE 标错,可能导致 Assert 失败,或在非 Assert 构建下使用未加锁分区,后果严重。

技术细节

适用范围

  • 仅通用计划:自定义计划可在规划时裁剪分区,不会像通用计划那样在计划中保留全部分区。
  • 仅初始裁剪:只有被初始(执行前)裁剪掉的分区才会从加锁集合中排除。执行时裁剪(例如另一次执行中不同的参数值)所涉及的分区仍在计划中,仍会被加锁;补丁不改变这一点。

边界情况

  • EXPLAIN / auto_explain:担心它们会访问我们未加锁的关系对应的计划节点。Amit 的分析是它们在执行器初始化之后运行,看到的只是初始化后的(裁剪后的)计划。
  • VOLATILE/IMMUTABLE 标错:可能导致加锁集合与初始化集合不一致;补丁未针对此增加额外防护。

当前状态

该线程中只出现了一个补丁版本(v1),未显示后续提交。讨论形成了以下共识与待办:

  • 通用计划在大量分区下的基准测试收益明显。
  • 需要落实代码风格与重构建议(EState 生命周期、walker 用法、注释等)。
  • 设计上仍有待明确:如何保证加锁集合与初始化集合不会分歧,以及「部分有效」计划对遍历整棵计划树的代码路径是否可接受。

小结

Amit 的补丁通过只对在初始裁剪后仍保留的分区加锁,降低了通用计划在分区表上的 AcquireExecutorLocks() 开销,同时避免了此前「推迟加锁」方案带来的竞态。基准测试表明在分区数较多时吞吐提升显著。讨论厘清了适用场景(使用通用计划的预处理语句)、提出了安全性与一致性方面的合理顾虑(部分有效计划、重复裁剪),并给出了具体的代码审查意见。若能在设计上保证「加锁」与「执行器初始化」使用同一份裁剪结果,将进一步提高可靠性,可能需要对加锁时机或执行器启动流程做一定重构。

参考

第 09 周(2026)

2026 年第 09 周的 PostgreSQL 邮件列表讨论。

文章

元组解构的进一步加速:预计算 attcacheoff

引言

元组解构(tuple deformation) 是将 PostgreSQL heap 元组的原始字节表示解包为 TupleTableSlot 中各个属性值的过程。它在查询执行中无时无刻不发生——每次顺序扫描、索引扫描或连接产生一行时,执行器都必须对元组进行解构才能访问列值。对于处理数百万行的负载,即便对解构热路径做小幅优化,也能带来可观的性能提升。

David Rowley 一直在持续优化元组解构。在 PostgreSQL 18 中,他已合并多项补丁:CompactAttribute(5983a4cff)、更快的偏移对齐(db448ce5a)以及内联解构循环(58a359e58)。在此基础上,他提出预计算 attcacheoff,而不是在每次属性访问时计算。讨论已演进至 v10(2026 年 2 月),Andres Freund 贡献了 NULL 位图转 isnull 的方案,使 Apple M2 在部分场景下加速达 63%。补丁集仍在积极审查中。

为什么重要

当执行器需要从元组中读取某列值时,必须:

  1. 对齐:按属性对齐要求对齐当前偏移
  2. 获取:通过 fetch_att() 读取值
  3. 前移:跳过当前属性到下一个

这些步骤形成依赖链:每个偏移都依赖前一个。指令级并行空间有限。对于定长属性,PostgreSQL 可以缓存偏移(attcacheoff)以避免重复计算对齐和长度——但此前缓存是在解构循环内部完成的。David 的想法是:在 TupleDesc 初始化完成时一次性计算,而不是对每个元组都算一遍。

技术方案

TupleDescFinalize()

核心改动是引入 TupleDescFinalize(),必须在 TupleDesc 创建或修改后调用。该函数会:

  1. 预计算所有定长属性的 attcacheoff
  2. 记录 firstNonCachedOffAttr——第一个无法缓存偏移的属性的 attnum(即首个 varlena 或 cstring 属性)
  3. 启用一个紧凑循环,先处理所有有缓存偏移的属性,再进入需要手动计算偏移的属性

如果元组在最后一个有缓存偏移的属性之前存在 NULL,则只能使用 attcacheoff 到该 NULL 为止——但对于没有早期 NULL 的元组,快速路径可以在一个紧凑循环中处理大量属性,而无需任何按属性的偏移运算。

专用解构循环

补丁添加了一个专用循环,先处理所有有预计算 attcacheoff 的属性,再进入处理 varlena/cstring 属性的循环。对于设置了 HEAP_HASNULL 的元组,当前代码会对每个属性调用 att_isnull()。进一步优化是:在遇到第一个 NULL 之前,持续解构而不调用 att_isnull()。基准测试中的场景 #5(首列 int not null、末列 int null)最能体现这一点——常表现为最大加速。

可选的 OPTIMIZE_BYVAL 循环

可选变体针对所有被解构属性均为 attbyval == true 的元组增加一个循环。此时可以内联 fetch_att(),而无需处理指针类型的分支,从而减少分支、获得更紧凑的循环。代价是:当该优化不适用时,需要额外检查 attnum < firstByRefAttr。基准测试中,不同硬件和编译器对是否启用该优化效果不一。

基准测试设计

为最大化解构工作占总 CPU 的比例,David 设计了如下基准查询:

SELECT sum(a) FROM t1;

其中 a 列几乎在最后,因此必须先解构前面的所有属性才能读取 a。八种表结构涵盖首列(int/text、null/not null)和末列(int null/not null)的组合。对每种表结构,分别在 0、10、20、30、40 个额外 INT NOT NULL 列下运行——每次基准运行包含 40 个场景,每个场景 100 万行。

基准测试结果

结果因硬件和编译器而异:

  • AMD Zen 2(3990x)+ GCC:启用 OPTIMIZE_BYVAL 时平均加速达 21%;部分测试超过 44%;无回退。
  • AMD Zen 2 + Clang:0 额外列场景下存在小幅回退。
  • Apple M2:场景 #1 和 #5 提升明显;其余提升较小;部分补丁有轻微回退。
  • Intel(Azure):在共享、少核实例上运行,因与其他负载共享 L3,结果噪声较大。

补丁演进

v1 → v3(2025 年 12 月 – 2026 年 1 月)

  • v1:三个补丁——0001(预计算 attcacheoff)、0002(实验性 NULL 位图前瞻)、0003(移除专用 hasnulls 循环)
  • v2:代码库同步、修复 0003 中 NULL 位图 Assert、JIT 修复(移除 TTS_FLAG_SLOW)、更多基准
  • v3:代码库同步、放弃 0002 和 0003(基准收益有限)、仅保留 0001

v4(2026 年 1 月)

回应 Chao Li 的代码审查:

  • NULL 位图 mask(tupmacs.h):补充注释——当 natts & 7 == 0 时 mask 为 0,代码会正确返回 natts
  • 未初始化 TupleDescfirstNonCachedOffAttr == 0 表示无缓存属性;-1 表示未初始化。添加 Assert,失败时提示调用 TupleDescFinalize()
  • 拼写:"possibily" → "possibly"
  • LLVM:修复编译警告

v5–v8(2026 年 1–2 月):Andres Freund 的 NULL 位图优化

Andres Freund 加入讨论并提出关键改进:不再对每列调用 att_isnull(),而是用 SWAR(SIMD Within A Register)技术直接从 NULL 位图计算 isnull[] 数组。思路是:将位图的一个字节乘以精心选定的值(如 0x204081),使每位扩散到独立字节,再掩码。这样无需 2KB 查找表,在多数硬件上效果良好。

David 在补丁 0004(“Various experimental changes”)中实现了该方案。0004 的其他改动包括:

  • populate_isnull_array():用乘法技巧批量将 NULL 位图转换为 tts_isnull
  • tts_isnull 大小:向上取整到 8 的倍数,使循环可一次写 8 字节(避免 memset 内联问题)
  • t_hoff:对 !hasnulls 元组,使用 MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) 替代 t_hoff
  • fetch_att_noerr():新增无 elog 的变体,用于常见的 attlen == 8 情况

John Naylor 指出当字节为 255 时 __builtin_ctz(~bits[bytenum]) 未定义;David 通过强制转换修复:pg_rightmost_one_pos32(~((uint32) bits[bytenum]))

启用 0004 后,Apple M2 平均比 master 快 53%(排除 0 额外列测试约 63%)。Andres 建议使用 pg_nounrollpg_novector pragma 防止 GCC 对 populate_isnull_array() 过度向量化,该函数曾生成低效代码。

v9(2026 年 2 月 24 日)

  • 补丁重排deform_bench 移至 0001,便于在 master 上做基准测试
  • 0004(新)slot_getsomeattrs 的 sibling-call 优化——将 slot_getmissingattrs() 移入 getsomeattrs(),使编译器可应用尾调用优化,降低开销并改善 0 额外列测试
  • 0005(新):将 CompactAttribute 从 16 字节缩小到 8 字节——attcacheoff 改为 int16(最大 2^15),布尔用位标志。Andres 指出 8 字节便于编译器使用 scale factor 8 的单条 LEA;6 字节则需两条 LEA

v10(2026 年 2 月 25 日)— 最新补丁集

基于实际 v10 补丁内容:

0003(优化元组解构)

  • firstNonCachedOffsetAttr:首个无缓存偏移的属性的索引
  • firstNonGuaranteedAttr:首个可为 NULL、缺失或 !attbyval 的属性的索引。仅解构到此属性时,无需访问 HeapTupleHeaderGetNatts(tup),减少 CPU 流水线依赖
  • TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS:保证属性优化的可选标志(部分代码在 NOT NULL 校验前即解构元组)
  • populate_isnull_array():使用 SPREAD_BITS_MULTIPLIER_32(0x204081)将反转的 NULL 位图每位扩散到独立字节;分低 4 位和高 4 位处理以避免 uint64 溢出
  • fetch_att_noerr():无 elogfetch_att() 变体;当 attlen 来自 CompactAttribute 时安全
  • first_null_attr():用 pg_rightmost_one_pos32__builtin_ctz 查找位图中首个 NULL

0004(sibling-call 优化)

  • getsomeattrs() 现负责调用 slot_getmissingattrs()
  • slot_getmissingattrs():用 for 循环替代 memset(基准显示循环更快)
  • slot_deform_heap_tuple():在 attnum < reqnatts 时于末尾调用 slot_getmissingattrs();参数由 natts 改为 reqnatts

0005(8 字节 CompactAttribute)

  • attcacheoff 改为 int16;大于 PG_INT16_MAX 的偏移不缓存
  • attispackableatthasmissingattisdroppedattgenerated 使用位标志
  • 保存 cattrs = tupleDesc->compact_attrs 以帮助 GCC 生成更优代码(避免重复 TupleDescCompactAttr() 调用)

审查修复

  • Amit Langote:修复 rebase 噪声(重复的 attcacheoff 检查)
  • Zsolt Parragi:大端序修复——在 populate_isnull_array()memcpy 前加入 pg_bswap64()
  • 拼写:"benchmaring" → "benchmarking","to info" → "into"
  • Andres:在 slot_getmissingattrs 前设置 *offp 以减少栈溢出;将 attnum 改为 size_t 以修复 GCC -fwrapv 下的代码生成

deform_bench 与基准基础设施

AndresÁlvaro Herrera 讨论了 deform_bench 的放置:src/test/modules/benchmark_toolssrc/benchmark/tuple_deform,或单一微基准扩展。Andres 主张逐步合并有用工具,而非等待完整套件。David 倾向于先完成解构优化补丁;deform_bench 可能单独提交。

代码审查:Chao Li 的反馈

Chao Li 审查了补丁并提出几点:

  1. NULL 位图 mask:补充注释说明 natts & 7 == 0 时无溢出/OOB 风险
  2. 未初始化 TupleDesc:在 TupleDesc 创建时将 firstNonCachedOffAttr 初始化为 -1;在 nocachegetattr() 中断言 >= 0
  3. 语义一致性:用 0 表示“无缓存属性”,>0 表示“有缓存”
  4. 拼写:"possibily" → "possibly"

David 在 v4 中均已处理。

当前状态

  • v10(2026 年 2 月)为最新补丁集:0001(deform_bench)、0002(TupleDescFinalize 桩)、0003(主优化)、0004(sibling-call + NULL 位图→isnull)、0005(8 字节 CompactAttribute)
  • Andres Freund 支持合并 0004,认为收益明显;0005 的收益较不确定(在解构少量列时有助于 LEA 寻址)
  • Zsolt Parragi(Percona)、Álvaro HerreraJohn NaylorAmit Langote 持续参与审查
  • deform_bench 的放置(src/test/modules 或 src/benchmark)仍在讨论;David 希望先落地优化补丁

结论

TupleDescFinalize() 中预计算 attcacheoff,并为有缓存偏移的属性使用专用紧凑循环,可为现代 CPU 上的元组解构带来可观的加速。当元组具有大量定长列且 NULL 较少或较晚出现时,优化效果最佳。结合 Andres Freund 的 NULL 位图转 isnull 方案(“0x204081” SWAR 技巧),Apple M2 在排除边缘情况后可达 63% 加速。slot_getsomeattrs 的 sibling-call 优化进一步降低开销。结果因硬件和编译器而异;GCC 可能对部分循环过度向量化,可通过 pragma 或将循环索引改为 size_t 缓解。补丁集(v10)经 Andres、John Naylor、Zsolt Parragi、Álvaro Herrera、Amit Langote 等多轮审查,正在向集成推进。

参考资料

第 08 周(2026)

2026 年第 08 周 PostgreSQL 邮件列表讨论。

文章

消除 RI 触发器中的 SPI:外键检查的快速路径

引言

PostgreSQL 中的引用完整性(Referential Integrity, RI)触发器传统上通过 SPI(Server Programming Interface)执行 SQL 查询,以验证引用表(Referencing Table)中新插入或更新的行是否在被引用表(Referenced Table, 主键表)中存在匹配行。对于批量操作(大批量 INSERTUPDATE),这意味着每一行都会启动和销毁一次完整的执行计划,ExecutorStart()ExecutorEnd() 带来的开销相当可观。

Amit Langote 一直在致力于消除这一开销,通过用直接索引探测替代 SQL 计划来完成 RI 检查。这项工作最新迭代“Eliminating SPI / SQL from some RI triggers - take 3”通过绕过 SPI 执行器、在约束语义允许时直接调用索引访问方法,将批量外键检查的速度提升了最高 57%

补丁集历经多版演进,Junwang Zhao 于 2025 年底加入开发。当前方向为混合快速路径 + 回退:在简单场景下使用直接索引探测,在正确性依赖执行器复杂行为时回退到现有 SPI 路径。

为什么重要

外键约束无处不在。每次向引用表执行 INSERTUPDATE 都会触发 RI 检查,验证每一行是否在被引用表的主键中存在匹配。传统做法下:

CREATE TABLE pk (a int PRIMARY KEY);
CREATE TABLE fk (a int REFERENCES pk);

INSERT INTO pk SELECT generate_series(1, 1000000);
INSERT INTO fk SELECT generate_series(1, 1000000);  -- 100 万次 RI 检查

每一次插入都会触发 RI 检查,执行:

  1. 构建用于扫描主键索引的查询计划
  2. 调用 ExecutorStart()ExecutorEnd()
  3. 执行计划查找(或确认不存在)匹配行

每行都要经历一次执行计划的建立与销毁,主导了总耗时。在 Amit 的 v3 补丁下,同样的批量插入从约 1000 ms 降至约 432 ms(快 57%) —— 通过直接探测主键索引,而不经过执行器。

技术背景

传统 RI 路径

ri_triggers.c 中的 RI 触发器函数(如 RI_FKey_check)调用 ri_PerformCheck(),其流程为:

  1. 构建形如 SELECT 1 FROM pk WHERE pk.a = $1 的 SQL 字符串
  2. 使用 SPI_prepareSPI_execute_plan 执行
  3. 执行器在主键上执行索引扫描,若被引用值存在则返回一行

这种方式在所有场景下都正确 —— 分区表、时态外键、并发更新 —— 但每行都承担完整的计划执行成本。

快速路径思路

对于简单外键(被引用表非分区、无非时态语义),检查本质上是:“用该值探测主键索引;若找到且能加锁,则检查通过”。可通过以下方式实现:

  1. 打开主键关系和其唯一索引
  2. 根据外键列值构建扫描键
  3. 调用 index_getnext()(或等效接口)查找元组
  4. 在当前快照下用 LockTupleKeyShare 加锁

无需 SQL、计划或执行器,只需直接索引探测和元组加锁。

补丁演进

v1:原始方案(2024 年 12 月)

初版补丁集(3 个补丁)引入:

  • 0001:重构 PartitionDesc 接口,显式传递 omit_detached 可见性(已分离挂起分区)所需的快照。解决了一个 bug:在 REPEATABLE READ 下,因 RI 查找会操作 ActiveSnapshot,而 find_inheritance_children() 对已分离挂起分区的可见性依赖该快照,导致主键查找可能返回错误结果。
  • 0002:在 RI 触发器函数中避免使用 SPI,引入直接索引探测路径。
  • 0003:对部分 RI 检查避免使用 SQL 查询——主要性能优化。

Amit 指出 temporal foreign key 查询仍保留在 SPI 路径,因其计划涉及范围重叠和聚合,无法用简单索引探测处理。他还为快速路径增加了与 EvalPlanQual() 等价的逻辑,在 READ COMMITTED 下正确处理并发更新。

v2:Junwang 的混合快速路径(2025 年 12 月)

Junwang Zhao 在此基础上继续推进,采用混合设计:

  • 0001:为外键约束检查添加快速路径。适用条件:被引用表非分区,约束不涉及 temporal semantics 时。
  • 0002:缓存快速路径元数据(操作符哈希条目、操作符 OID、策略号、子类型)。当时该元数据缓存尚未带来性能提升。

基准测试(100 万行,numeric 主键 / bigint 外键):

  • 主线:INSERT 13.5s,UPDATE 15s
  • 补丁版:INSERT 8.2s,UPDATE 10.1s

v3:Amit 的重构与按语句缓存(2026 年 2 月)

Amit 将 Junwang 的补丁重构成两个补丁:

  • 0001:功能完整的快速路径。包含并发处理、REPEATABLE READ 交叉检查、跨类型操作符、安全上下文(RLS/ACL)及元数据缓存。主要逻辑集中在 ri_FastPathCheck()RI_FKey_check 仅负责分支判断并在需要时回退到 SPI。

  • 0002:按语句的资源缓存。不共享 trigger.cri_triggers.cEState,而是引入新的 AfterTriggerBatchCallback 机制,在每次触发器执行周期结束时调用。借此,可在单一周期内缓存主键关系、索引、扫描描述符和快照,从而在多次 FK 触发器调用之间复用,而不是每行都打开和关闭。

Amit 的基准测试:

场景主线00010001+0002
100 万行,numeric/bigint2444 ms1382 ms(快 43%)1202 ms(快 51%)
100 万行,int/int1000 ms520 ms(快 48%)432 ms(快 57%)

0002 的额外收益(约 13–17%)来自消除每行的关系打开/关闭、扫描开始/结束、槽分配/释放,并将每行的 GetSnapshotData() 替换为缓存中的快照副本。

设计:何时走快速路径,何时走 SPI

快速路径适用条件:

  • 被引用表非分区
  • 约束涉及 temporal semantics(范围重叠、range_agg() 等)
  • 多列键、跨类型相等(通过索引操作符族)、排序规则匹配、RLS/ACL 均在快速路径内处理

在以下情况回退到 SPI:

  1. 并发更新或删除:若 table_tuple_lock() 报告目标元组已被更新或删除,则委托给 SPI,由 EvalPlanQual 和可见性规则按现有逻辑处理。
  2. 分区被引用表:需要通过 PartitionDirectory 将探测路由到正确分区,可后续单独补丁支持。
  3. Temporal foreign keys:使用范围重叠和包含语义,本质上涉及聚合,保留在 SPI 路径。

安全行为与现有 SPI 路径一致:快速路径在探测时临时切换到父表所有者,使用 SECURITY_LOCAL_USERID_CHANGE | SECURITY_NOFORCE_RLS,与 ri_PerformCheck() 保持一致。

后续方向

David Rowley 在私下交流中建议,将多个 FK 值批量为单次索引探测可进一步提升性能,利用 PostgreSQL 17 的 ScalarArrayOp 对 btree 的改进。思路:在按约束的缓存中跨触发器调用缓冲 FK 值,构建 SK_SEARCHARRAY 扫描键,让 btree AM 在一次有序遍历中扫描匹配的叶页,而不是每行一次树下降。加锁和重检查仍按元组进行。可作为独立补丁在现有系列之上探索。

当前状态

  • 补丁系列位于 PG19-Drafts。Amit 于 2025 年 10 月移入;Junwang Zhao 正在继续推进。
  • Amit 的 v3 补丁(2026 年 2 月)已基本成型,等待审查。欢迎反馈,尤其是关于 ri_LockPKTuple() 中的并发处理及 0002 中快照生命周期的意见。
  • Pavel Stehule 表示愿意协助测试和审查。

结论

对简单外键检查消除 SPI 调用,可为批量操作带来可观的性能提升。混合快速路径 + 回退设计回应了审查者对正确性的关切:在正确性依赖执行器复杂行为时回退到 SPI。v3 中的按语句资源缓存进一步优化,将关系/索引的建立成本分摊到单一触发器执行周期内的多行上。

对于具有大量外键的批量插入或更新场景——常见于 ETL、暂存加载、数据迁移 —— 该工作有望显著缩短运行时间。当前限制(分区主键、时态外键)使这些场景仍走现有路径,在保证正确性的同时优化大多数 FK 工作负载。

参考资料

第 07 周(2026)

2026 年第 07 周的 PostgreSQL 邮件列表讨论。

文章

SQL/PGQ:为 PostgreSQL 引入图查询能力

引言

2024 年 2 月,Peter Eisentraut 在 pgsql-hackers 邮件列表中宣布了 SQL 属性图查询(SQL/PGQ) 的原型实现——一种在 PostgreSQL 中直接执行图风格查询的新方式,遵循 SQL:2023 标准(ISO 9075-16)。该提议曾在 FOSDEM 开发者会议上被简要讨论,社区的兴趣促使 Peter 将进行中的工作整理并分享出来。

近两年后,补丁已演进为体量可观的实现:118 个文件变更,约 14,800 行新增代码。Peter 和 Ashutosh Bapat 是主要作者,Junwang Zhao 参与审阅,Ajay Pal 和 Henson Choi 参与测试。最新版本(v20260113)整合了从 v0 到 v14 及之后的全部功能——包括循环路径模式、访问权限、RLS 支持、图元素函数(LABELS()PROPERTY_NAMES())、多模式路径匹配、ECPG 支持、属性排序规则以及 pg_overexplain 集成。

SQL/PGQ 让你可以在现有关系表上定义属性图,并用路径模式(由边连接的顶点)进行查询,类似 Cypher 或 GQL。与专用图数据库不同,该方案将图映射到关系模型上:图是表的视图,图查询被改写为连接与并集。讨论中提出了重要的架构问题:这种转换应在 PostgreSQL 的何时以何种方式完成——以及改写器是否是合适的实现位置。

背景与意义

许多应用的数据天然具有图结构:社交网络、供应链、推荐系统、反欺诈等。目前开发者通常需要:

  • 使用独立的图数据库(如 Neo4j),维护两套系统,或
  • 在 PostgreSQL 中用递归 CTE 和复杂连接来编码图遍历。

SQL/PGQ 旨在为 PostgreSQL 用户提供一种标准的、声明式的方式来表达图查询,而无需脱离 SQL 或复制数据。该标准已被 Oracle 23c 等采用;将其引入 PostgreSQL 将提升互操作性,并使图能力惠及更广泛的用户。

技术分析

SQL/PGQ 模型

SQL/PGQ 中的属性图是在已有表上定义的虚拟结构:

  • 顶点:来自一个或多个表的行(可带可选标签)。
  • :关系,通常由外键推断或显式指定。
  • 属性:来自这些表的列。
  • 标签:可在多个元素表间共享(例如 person 同时用于 customersemployees),每个标签暴露自己的属性集合。

使用 CREATE PROPERTY GRAPH 创建图,用 GRAPH_TABLE(... MATCH ... COLUMNS ...) 查询。示例:

CREATE PROPERTY GRAPH myshop
    VERTEX TABLES (
        products LABEL product,
        customers LABEL customer,
        orders LABEL "order"
    )
    EDGE TABLES (
        order_items SOURCE orders DESTINATION products LABEL contains,
        customer_orders SOURCE customers DESTINATION orders LABEL has_placed
    );

SELECT customer_name FROM GRAPH_TABLE (myshop
  MATCH (c IS customer)-[IS has_placed]->(o IS "order" WHERE o.ordered_when = current_date)
  COLUMNS (c.name AS customer_name));

MATCH 子句描述路径模式;实现会将其改写为 PostgreSQL 可以规划和执行连接与过滤。

实现方式:改写系统

实现采用改写系统完成图到关系的转换——与视图展开处于同一阶段。Peter 解释,这与 SQL/PGQ 规范一致:图映射到关系,查询像视图定义一样被展开。当优化器看到查询时,图结构已被展开为标准关系形式,从而与视图安全(权限、安全屏障)保持一致。

补丁演进:从 v0 到 v20260113

v0:脆弱原型(2024 年 2 月)

初始补丁约 332 KB。Peter 称其「相当脆弱」。它引入了 CREATE PROPERTY GRAPHGRAPH_TABLE、基本路径模式,以及 ddl.sgmlqueries.sgml 中的文档。

v1 与早期完善(2024 年 6–8 月)

Peter 与 Ashutosh 发布了 v1,具备「相当完整的最小功能集」。Ashutosh 贡献了:

  • 图模式中的 WHERE 子句——例如 MATCH (a)->(b)->(c) WHERE a.vname = c.vname
  • 「列未找到」假阳性修复:属性名原本引用自 pg_attribute 的堆元组;在 RELCACHE_FORCE_RELEASE 后可能指向已释放内存。修复方式是将属性名复制出来。
  • 编译修复、pgperltidy 合规、隐式属性/标签的错误位置报告。

Imran Zaheer 增加了标签和属性 EXCEPT 列表的支持。

v14:循环路径、访问权限、RLS(2024 年 8–10 月)

Ashutosh 贡献了主要功能:

  • 循环路径模式:元素变量在路径中重复出现的模式(例如同一顶点既在起点又在终点)。共享变量的元素必须具有相同类型和标签表达式;重复边模式不受支持。
  • 属性图的访问权限:属性图行为类似安全调用者视图——当前用户必须对底层表有权限。仅对用户可访问的元素,查询才能成功。安全定义者属性图未实现。
  • 行级安全(RLS)graph_table_rls.sql 中的回归测试验证 RLS 与属性图的配合。
  • 属性排序规则与边-顶点链接:同名字的属性在各元素间必须具有相同排序规则。边的键列与引用顶点键的排序规则必须兼容。边-顶点链接使用等值运算符构建,并建立依赖,使这些运算符无法在未删除边的情况下被删除。
  • \d\dG 变体:对属性图执行 \d 显示元素、表、类型和终点顶点别名;\d+ 增加标签与属性。\dG 列出属性图;\dG+ 增加所有者与描述。

Henson Choi:LABELS()、PROPERTY_NAMES()、多模式(2025 年 12 月)

Henson Choi 贡献了三个补丁:

  • LABELS() 图元素函数:返回图元素的所有标签为 text[]。通过在子查询中包装元素表、添加虚拟 __labels__ 列实现,使优化器在按标签过滤时能剪枝 Append 分支(例如 WHERE 'Person' = ANY(LABELS(v)))。
  • PROPERTY_NAMES() 图元素函数:返回所有属性名为 text[],同样支持基于属性的过滤剪枝。
  • 多模式路径匹配:支持 MATCH 中逗号分隔的路径模式,如 MATCH (a)->(b), (b)->(c)。共享变量的模式合并为一个连接;不相连的模式产生笛卡尔积(与 SQL/PGQ 和 Neo4j Cypher 一致)。

v20260113:整合实现(2026 年 1 月)

最新补丁(v20260113)将此前工作合并为单一 WIP 补丁:

  • ECPG 支持:ECPG 中的 SQL/PGQ——基本查询、预处理语句、游标、动态查询。ECPG 中的标签析取需要对 ecpg 词法分析器做改动。
  • pg_overexplain 集成:属性图 RTE 与 RELKIND_PROPGRAPHEXPLAIN (RANGE_TABLE, ...) 中得到识别。
  • 扩展测试覆盖create_property_graph.sql(365 行)、graph_table.sql(561 行)、graph_table_rls.sql(363 行)、privileges.sql(58 行)。
  • rewriteGraphTable.c:从约 420 行增至约 1,330 行;propgraphcmds.c 从约 1,000 行增至约 1,860 行。

社区讨论

Andres Freund:对改写器的担忧

Andres Freund 提出结构层面的顾虑:通过改写系统转换会妨碍优化器利用图语义,并增加对改写系统的依赖。Peter 回应称,PGQ 在设计上以关系为核心(类似视图展开),标准和其他实现都遵循这一模型。Tomas Vondra 质疑是否应更长时间保留图结构以支持图专属索引或执行器节点;Ashutosh Bapat 指出,许多优化本身会改善底层连接,与视图展开保持一致对安全也有意义。

Florents Tselai:文档

Florents Tselai 建议调整文档顺序:先回答「我的数据建模为图 G=(V, E),Postgres 能帮上忙吗?」再深入实现细节,并使用 graph_table.sql 中的可运行示例。他将该方案与 Apache Age 的 jsonpath 风格做法对比,但同意标准的关系映射适合 PostgreSQL 核心。

技术细节

架构

  • 解析器CREATE PROPERTY GRAPHALTER PROPERTY GRAPHDROP PROPERTY GRAPH 以及 GRAPH_TABLE(... MATCH ... COLUMNS ...) 的语法。
  • 系统表pg_propgraph_elementpg_propgraph_element_labelpg_propgraph_labelpg_propgraph_label_propertypg_propgraph_property
  • 改写rewriteGraphTable.c 将图模式转换为连接与并集。
  • 工具pg_dumppsql\d/\dG、补全;pg_get_propgraphdef() 用于自省。
  • ECPG:嵌入式 SQL 中的完整支持。

访问控制

  1. 用户需要对属性图有 SELECT 权限。
  2. 属性图采用安全调用者:当前用户必须对底层表有权限。仅对用户可访问的元素,查询才能成功。
  3. 在安全定义者视图中,属性图访问使用视图所有者的权限;底层关系的访问使用执行者的权限。
  4. 安全定义者属性图未实现(标准未提及)。

属性排序规则与边-顶点链接

  • 同名属性在各元素间必须具有相同排序规则。
  • 显式指定键时,边的键与顶点键的排序规则必须匹配(外键推导出的链接依赖约束本身)。
  • 边-顶点链接使用等值运算符;边依赖这些运算符,因此无法独立删除。

当前状态

v20260113 补丁是一个整合后的 WIP。它包含:

  • 完整的 CREATE PROPERTY GRAPH / ALTER / DROP,支持标签、属性、KEY 子句、SOURCE/DESTINATION REFERENCES
  • GRAPH_TABLE,支持路径模式、模式内 WHERE、循环路径、多模式
  • LABELS()PROPERTY_NAMES() 图元素函数
  • 访问权限、RLS、权限测试
  • ECPG 支持、pg_overexplain 集成
  • 文档与回归测试

补丁尚未提交。Peter 和 Ashutosh 持续改进;基于改写器的设计仍是当前选择,审阅与测试仍在进行中。

小结

SQL/PGQ 将为 PostgreSQL 带来标准化的图查询语法。实现已从脆弱原型发展为功能较全的补丁,包含循环路径、图元素函数、多模式匹配、访问控制、RLS、ECPG 支持及较完整的测试。主要架构选择——在改写系统中将图查询改写为关系形式——与标准和视图语义一致。若被提交,PostgreSQL 用户将能在不离开 SQL、不维护独立图数据库的情况下,在关系数据上表达路径模式。

参考资料

将 LEFT JOIN 归约为 ANTI JOIN:针对 "WHERE col IS NULL" 的优化器优化

引言

2025 年 12 月底,Nicolas Adenis-Lamarre 在 pgsql-hackers 邮件列表中提出了一项优化器优化:当查询使用 LEFT JOIN ... WHERE right_table.column IS NULL,且该列在语义上非空(例如 NOT NULL 或主键)时,自动识别为反连接(anti-join)。这类查询的语义是“左侧有、右侧无匹配的行”,正是反连接所表达的含义。识别出该模式后,优化器可以选择显式的反连接计划(如 Hash Anti Join),而不是普通的左连接加过滤,往往能获得更好的执行效率。

讨论吸引了 Tom Lane、David Rowley、Tender Wang、Richard Guo 等人参与。补丁历经多版迭代,被提交到 CommitFest,并在代码审查中暴露出与嵌套外连接和继承相关的正确性问题。本文概述该优化的思路、实现方式以及当前状态。

为什么重要

很多开发者会这样写“在 A 中找在 B 中没有匹配的行”:

SELECT a.*
FROM a
LEFT JOIN b ON a.id = b.a_id
WHERE b.some_not_null_col IS NULL;

LEFT JOIN 会使来自 a 的无匹配行在 b 的所有列上为 NULL。当 b.some_not_null_col 在表 b 上为 NOT NULL 时,用 WHERE b.some_not_null_col IS NULL 过滤,留下的就是这些无匹配行。语义上这就是反连接:“在 A 中且不存在在 B 中匹配的行”。

若优化器不识别该模式,可能按普通左连接加过滤实现;若识别,则可以使用显式的 Hash Anti Join 等,执行更高效,也有利于选择更好的连接顺序。这类优化是“非强制”的——熟练用户可以把查询改写成 NOT EXISTSNOT IN(并注意 NULL 语义),但自动识别能惠及所有用户,同时保留原有 SQL 的可读性。

技术背景

PostgreSQL 已有在部分场景下归约外连接的逻辑,例如:

  • 提交 904f6a593 与 e2debb643 引入了优化器可用于此类归约的基础设施。
  • reduce_outer_joins_pass2 中,优化器已经会在“连接自身的条件对某些被更高层条件强制为 NULL 的变量是严格的”时,尝试将 JOIN_LEFT 归约为 JOIN_ANTI

该处原有注释提到,还存在其他识别反连接的方式——例如检查来自右侧的变量是否因表约束(NOT NULL 等)而必然非空。Nicolas 的提议与 Tender 的补丁实现的正是这一点:利用 NOT NULL 等信息,在 WHERE rhs_col IS NULL 能推出“无匹配”时,将 LEFT JOIN 归约为 ANTI JOIN。

补丁演进

Nicolas 的初版补丁

Nicolas 提交的草稿补丁实现了:

  • 在“left join b where x is null”且 x 为来自右侧(RTE)的非空变量时,识别该模式。
  • 有意采用“快速实现”以验证可行性。

他还列举了其他想法(去掉冗余 DISTINCT/GROUP BY、合并双重 ORDER BY、对 NOT IN 做反连接、以及“查看改写后查询”的方式等),邮件中略有讨论,但非本文重点。

Tom Lane 与 David Rowley

Tom Lane 指出:

  • 该优化合理,且应使用新基础设施(904f6a593、e2debb643)。
  • 草稿不应让周边注释过时;保持注释准确是必须的。

David Rowley 建议:

  • 使用 find_relation_notnullatts(),并与 forced_null_vars 比较,注意 FirstLowInvalidHeapAttributeNumber
  • 在邮件列表中检索 UniqueKeys 相关历史(用于冗余 DISTINCT 消除)。
  • 对“消除双重 ORDER”和“NOT IN 反连接”持谨慎态度,二者此前都有讨论且边界情况复杂。
  • “查看改写后查询”含义不清,很多优化无法再表达成单一 SQL。

Tender Wang 的实现(v2–v4)

Tender Wang 提供的补丁:

  • 基于 904f6a593 和 e2debb643 的基础设施实现。
  • 更新了 reduce_outer_joins_pass2 中的注释,说明通过右侧 NOT NULL 约束识别反连接的新情况。
  • 增加了回归测试。

随后 Nicolas:

  • 确认 Tender 的补丁是正确的(经重新测试)。
  • 建议增加提前退出:仅当 forced_null_vars != NIL 时才执行新逻辑,避免在大多数没有“强制为 NULL”变量的左连接上调用 find_nonnullable_varshave_var_is_notnull
  • 贡献了额外回归测试,使用新表(带 NOT NULL 约束),而不是修改 tenk1 等现有测试表。

Tom Lane 明确:不应修改通用测试对象(如 test_setup.sql 中的表),否则可能改变规划行为并影响其他测试。新测试应使用新表或已有且属性合适的表。

Tender 将 Nicolas 的提前退出与回归测试合并为 v4 单补丁,并提交到 CommitFest

Richard Guo 的审查:正确性问题

Richard Guo 对 v4 的审查发现了两个正确性问题。

1. 嵌套外连接

当左连接的右侧本身又包含外连接时,即使某列在其基表上为 NOT NULL,在连接结果中仍可能为 NULL。此时将该外连接归约为反连接会出错。

例如(表 t1t2t3,列如 (a NOT NULL, b, c)):

EXPLAIN (COSTS OFF)
SELECT * FROM t1
LEFT JOIN (t2 LEFT JOIN t3 ON t2.c = t3.c) ON t1.b = t2.b
WHERE t3.a IS NULL;

这里 t3.at3 上为 NOT NULL,但由于内层 t2 LEFT JOIN t3,来自 t1 的一行在与子查询连接后仍可能使 t3.a 为 NULL(当在 t3 中无匹配时)。因此上层连接必须保持为左连接;若错误地转为反连接,会错误地丢弃行。

补丁在判断“非空列”时没有考虑该变量是否会被下层外连接变为 NULL。Richard 指出当前 forced_null_vars 中并未记录 varnullingrels,一种简单修复是仅当右侧无外连接(right_state->contains_outer 为 false)时做此优化,但这会过于保守。

他提出的方向是:在 reduce_outer_joins_pass1_state 中记录每个子树下可为空的基表 relid;在检查 NOT NULL 约束时,跳过来自这些 rel 的变量。他附上了 v5 补丁以说明该思路。

2. 继承

对继承父表而言,某些子表可能在某列上有 NOT NULL,而其他子表没有。补丁未考虑这种情况;相比嵌套外连接,这一点相对容易修复。

其他讨论

  • Pavel Stehule 提醒避免在列表中 top-posting;PostgreSQL 维基有邮件列表风格说明。
  • 子查询中的常量:Nicolas 提到像 SELECT * FROM a LEFT JOIN (SELECT 1 AS const1 FROM b) x WHERE x.const1 IS NULL 这类情况未被处理,他认为不值得专门处理。

当前状态

  • v4 补丁(含提前退出与回归测试)已提交至 CommitFest(patch 6375)。
  • Richard Guo 的 v5 通过记录可为空的基表 rel 并收紧 NOT NULL 的使用条件,针对嵌套外连接与继承问题做了修正。
  • 截至该讨论,工作仍在进行;最终是否合入以及以何种形式合入,以邮件列表与 CommitFest 为准。

小结

在能够证明右侧某列为非空的前提下,将 LEFT JOIN ... WHERE rhs_not_null_col IS NULL 自动归约为反连接,是一项有用的优化器优化,可在不要求用户改写 SQL 的情况下提升性能。补丁从草稿发展到基于现有基础设施的实现,并加入了回归测试与提前退出。审查反馈指出了重要的正确性约束:右侧可能存在嵌套外连接或继承,因此只有在变量不会被下层连接或继承变为 NULL 时,才能基于 NOT NULL 做归约。后续工作集中在 Richard 的方案(记录可为空的基表 rel、限制 NOT NULL 检查)以及对继承的安全处理上。

参考

第 06 周(2026)

2026 年第 06 周的 PostgreSQL 邮件列表讨论。

文章

COPY TO 的 JSON 格式:PostgreSQL 原生 JSON 导出

引言

2023 年 11 月,Davin Shearer 在 pgsql-general 上询问如何用 COPY TO 将 JSON 从 PostgreSQL 导出到文件。当他用 COPY TO 导出只产生一列 JSON 的查询(例如 json_agg(row_to_json(t)))时,文本格式会再次对内容做转义:JSON 内部的双引号被多转义一层,得到的不是合法 JSON,用 jq 等工具无法解析。社区共识是:更合理的做法是为 COPY TO 提供原生 JSON 格式——这样单列 JSON(或整行渲染成一个 JSON 对象)会按合法 JSON 写出,不再叠加一层 text/CSV 转义。

在此基础上,结合 Joe Conway 早前关于 COPY/JSON 的讨论,提交了一系列补丁。最终设计是:为 COPY TO 增加 FORMAT json 选项,以及可选的 FORCE_ARRAY,用于将整段输出包在一个 JSON 数组中。该线程历经多个版本(v8 到 v23),Tom LaneJoe ConwayAlvaro HerreraJoel JacobsonJian HeJunwang Zhao 等人都参与过讨论。本文总结讨论要点、实现方式和当前状态。

为何重要

  • 正确的 JSON 导出:目前从库内把查询结果当 JSON 导出,通常只能用 COPY TO 的 text 或 csv 格式。文本格式会把结果当普通字符串再转义引号和反斜杠,导致 JSON 被破坏。专门的 JSON 格式会按行输出一个 JSON 对象(或每列一个 JSON 值),并做正确转义,输出即为合法 JSON。
  • 与上下游对接:很多流水线期望 JSON(例如每行一个 JSON 对象,或一个 JSON 数组)。原生 COPY TO ... (FORMAT json)FORCE_ARRAY 可以直接在库内完成导出,无需在客户端再格式化或借助 psql -t -A、LO API 等变通手段。
  • 与现有格式一致:COPY 已有 textcsvbinary,增加 json 后,仍是「选一种格式、得到正确编码」的同一套思路。

技术分析

设计取舍

补丁做了这些约定:

  1. JSON 格式仅用于 COPY TO。不支持 COPY FROM 的 JSON(解析任意 JSON 属于更大功能)。语法和选项校验会拒绝在 COPY FROM 时使用 FORMAT json
  2. JSON 下不使用 HEADER。文档和代码都禁止在 JSON 格式下使用 HEADER,避免在 JSON 行/数组前多出一行表头。
  3. 协议上视为一列。JSON 模式下,前后端 Copy 协议只发送一列(非二进制);每一行被渲染成一个 JSON 值(例如每行一个对象)。
  4. FORCE_ARRAY 仅在与 JSON 一起时有效FORCE_ARRAY 会在整段 COPY 输出外包上 [ ... ] 并在行间插入逗号,得到单个 JSON 数组;仅在与 FORMAT json 同时使用时合法。

补丁结构

  • 补丁 1(自 v13)— CopyFormat 重构 Joel Jacobson 引入枚举 CopyFormat(如 COPY_FORMAT_TEXTCOPY_FORMAT_CSVCOPY_FORMAT_BINARY),用 CopyFormatOptions 中的单一 format 字段替代原来的 csv_modebinary 两个布尔,便于后续增加新格式(如 JSON)。jian he 随后根据评审意见做了重构;Junwang Zhao 又针对执行器中新的 CopyToRoutine 做了适配。

  • 补丁 2 — COPY TO 的 JSON 格式

    • 语法gram.y):增加 JSON 为格式选项,并在 COPY 选项中允许 FORMAT json
    • 选项copy.ccopy.h):自 v13 起格式由 CopyFormat 表示,JSON 对应 COPY_FORMAT_JSON;校验不变:JSON 下不能有 HEADER/default/null/delimiter,COPY FROM 不能使用 JSON。
    • Copy 协议copyto.c):在 SendCopyBegin 中,JSON 模式下只发送一列、格式 0(text),不再按列送格式。
    • 行输出copyto.c):在 CopyOneRowTo 中,若开启 json_mode,则通过 composite_to_json()(来自 utils/adt/json.c)把整行转成 JSON 字符串并发送。对基于查询的 COPY(无关系),补丁会保证 slot 的 tuple descriptor 与查询一致,以便 composite_to_json 用正确的属性元数据生成键名。
    • json.c:将 composite_to_json()static 改为导出,并在 utils/json.h 中声明,供 COPY 调用。
  • 补丁 3 — COPY TO 的 FORCE_ARRAY

    • 选项copy.ccopy.h):增加 force_array,解析 force_array / force_array true|false。校验:FORCE_ARRAY 仅允许在 JSON 模式下使用(v12 起使用 ERRCODE_INVALID_PARAMETER_VALUE)。
    • 输出copyto.c):在首行前,若 JSON 模式且 force_array,先发送 [ 和换行;行与行之间在每行 JSON 对象前发送 ,(用 json_row_delim_needed 标记);最后一行之后发送 ] 和换行。默认(不加 FORCE_ARRAY)仍是每行一个 JSON 对象。

版本演进:v8 到 v23

  • v8 曾采用更大改动:抽出 COPY TO/FROM 的格式实现并做可插拔机制(含 contrib 模块 pg_copy_json)。评审意见倾向于在核心内做更小、直接的改动。
  • v9–v10 收窄为只增加 COPY TO 的 JSON 格式(不做可插拔 API)。v10 引入 json_mode 并采用 composite_to_json
  • v11 增加 FORCE_ARRAY 选项及相应测试,并修正「COPY FROM 与 json」的错误码,加强选项校验。
  • v12(2024 年 8 月):只重发补丁 2(FORCE_ARRAY);在非 JSON 模式下使用 FORCE_ARRAY 时错误码改为 ERRCODE_INVALID_PARAMETER_VALUE
  • v13(2024 年 10 月):Joel Jacobson 贡献 0001 — 引入 CopyFormat 枚举,用 CopyFormatOptions 中的单一 format 字段替代 csv_modebinary。0002(json 格式)、0003(force_array)在其上 rebase;文档明确 JSON 不能与 headerdefaultnulldelimiter 同用。
  • v14–v22:主要为 rebase 与上游适配。v14 在部分投稿中不再单独发 CopyFormat 片(因 rebase 基准不同)。Junwang Zhao(v15,2025 年 3 月)针对新的 CopyToRoutine 结构(commit 2e4127b6d2)做了适配。v16–v22 继续 rebase 并回应评审,核心设计未变。
  • v23(2026 年 1 月):当前系列。三片:(1) CopyFormat 重构(原创 Joel Jacobson,jian he 重构),(2) json format for COPY TO(Author: Joe Conway;Reviewed-by 包括 Andrey M. Borodin、Dean Rasheed、Daniel Verite、Andrew Dunstan、Davin Shearer、Masahiko Sawada、Alvaro Herrera 等),(3) FORCE_ARRAY。功能集不变,补丁已 rebase 并获较多评审。

代码要点

行转 JSON(补丁 2) 每行通过已有的 composite_to_json() 转成一个 JSON 对象:

rowdata = ExecFetchSlotHeapTupleDatum(slot);
result = makeStringInfo();
composite_to_json(rowdata, result, false);
CopySendData(cstate, result->data, result->len);

FORCE_ARRAY 的框定(补丁 3) 在行循环前发 [;从第二行起每行先发 , 再发对象;循环结束后发 ]

if (cstate->opts.json_mode && cstate->opts.force_array)
{
    CopySendChar(cstate, '[');
    CopySendEndOfRow(cstate);
}
// ... 行循环:首行前不发逗号,之后 CopySendChar(cstate, ','); 再发对象 ...
if (cstate->opts.json_mode && cstate->opts.force_array)
{
    CopySendChar(cstate, ']');
    CopySendEndOfRow(cstate);
}

用法示例(来自回归测试)

COPY copytest TO STDOUT (FORMAT json);
-- 每行一个 JSON 对象。

COPY copytest TO STDOUT (FORMAT json, force_array true);
-- 单个 JSON 数组:[ {"col":1,...}, {"col":2,...} ]

社区讨论要点

原始问题与权宜方案

Davin 最初遇到的是 COPY 文本格式对 JSON 的二次转义。David G. JohnstonAdrian Klaver 建议用 psql 把查询结果写到文件而不是 COPY。Dominique DevienneDavid G. Johnston 都认为,若 COPY 能提供「不格式化」或原始输出选项,在只导出一列(如 JSON)时即可按原样写出。Tom Lane 也同意,增加这样的选项可以满足该需求。这一共识促成了「专门做 COPY TO 的 JSON 格式」而不是在 text/CSV 上打补丁。

评审意见与调整

  • Tom Lane 等强调改动要尽量小:只增加 COPY TO 的 JSON 输出,不做大重构。这推动了从 v8 的可插拔格式方案收缩到当前的内核 JSON 路径。
  • Joe Conway 曾在另一线程讨论 COPY 与 JSON;jian he 的补丁引用了该讨论,并与「为 COPY TO 提供一等 JSON 输出」的思路一致。
  • Alvaro Herrera 等也在线程中提出意见;从 v8 到 v9/v10 的收缩即体现了「更小、更聚焦的补丁集」的偏好。

补丁中的边界情况

  • 基于查询的 COPY:数据源是查询(无表)时,slot 的 tuple descriptor 可能与查询的不一致。补丁会把查询的属性元数据同步到 slot 的 tuple descriptor,保证 composite_to_json 生成正确的键名。
  • 协议:JSON 模式下 Copy 协议只发送一列;后端仍按行产生一个 JSON「值」(一个对象,或在 FORCE_ARRAY 时数组里的一个元素)。

技术细节

实现要点

  • 转义:JSON 字符串转义由 composite_to_json()json.c 中已有的 escape_json 等完成,列值中的引号、反斜杠和控制字符会被正确编码。
  • HEADER:与 JSON 明确不兼容,保证输出要么是纯 JSON 行,要么是单个 JSON 数组。
  • FORCE_ARRAY 输出:回归测试显示,开启 force_array 时输出为 [、换行、第一个对象;之后每行先 , 再对象;最后 ],即一个合法 JSON 数组(元素间可有换行等空白)。

当前限制

  • COPY FROM:不提供 JSON 导入,仅扩展 COPY TO。
  • HEADER:JSON 模式下不支持。
  • 二进制:JSON 格式仅为文本,本补丁不涉及二进制 JSON。

当前状态

  • 该线程从 2023 年延续到 2026 年初。当前最新系列为 v23(2026 年 1 月):三片(CopyFormat 重构、json format for COPY TO、FORCE_ARRAY)。
  • 截至该线程快照,补丁尚未提交,仍处于讨论阶段;v23 代表当前设计与评审状态。
  • 设计COPY TO ... (FORMAT json) 及可选的 (FORMAT json, force_array true) 得到单个 JSON 数组;JSON 与 HEADER、DEFAULT、NULL、DELIMITER 及 COPY FROM 不兼容。

小结

「Emitting JSON to file using COPY TO」线程始于用户在使用 COPY 导出 JSON 时遇到的二次转义问题。社区认同应为 COPY TO 提供原生 JSON 格式jian he(及后来的 Junwang Zhao)实现了 FORMAT json(仅 COPY TO;与 HEADER、DEFAULT、NULL、DELIMITER 不兼容)和 FORCE_ARRAY,复用 composite_to_json() 与现有 JSON 转义。Joel Jacobson 的 CopyFormat 重构(v13 起)用枚举替代格式布尔,为 JSON 及未来格式打下基础。

参考

第 05 周(2026)

2026 年第 05 周的 PostgreSQL 邮件列表讨论。

文章

GOO:面向大规模连接问题的贪心连接顺序搜索算法

引言

PostgreSQL 根据查询复杂度采用不同的连接顺序策略:在关系数少于 geqo_threshold(默认 12)时使用动态规划(DP)求最优连接顺序;当连接图更大时则启用 GEQO(遗传查询优化器),用遗传算法在连接顺序空间中做启发式搜索。GEQO 存在一些已知问题:在中等规模连接数下规划时间可能比 DP 还慢,且缺少类似可复现种子的调优手段。

2025 年 12 月,Chengpeng Yan 在 pgsql-hackers 邮件列表中提出了 GOO(Greedy Operator Ordering)——一种确定性的贪心连接顺序搜索方法,旨在作为 GEQO 的替代方案处理大规模连接。算法基于 Leonidas Fegaras 在 1998 年 DEXA 上的论文《A New Heuristic for Optimizing Large Queries》。

为什么值得关注

  • 规划时间:在星型/雪花型及 TPC-DS 类负载上,GOO 的规划时间明显短于 GEQO(例如某次测试中,对 99 条 TPC-DS 查询做 EXPLAIN:GOO 约 5s,GEQO 约 20s),而在连接数未超阈值时 DP 仍然最快。
  • 计划质量:目标是在当前使用 GEQO 的场景下做到「足够好」——减少极端退化、行为更可预期。
  • 内存:GOO 的迭代、逐步合并结构理论上比完整 DP 更省内存,与 GEQO 的特性也不同;作者计划补充测量。

了解这一讨论有助于把握 PostgreSQL 在多表连接场景下的演进方向,以及 GOO 与 DP、GEQO 的定位关系。

技术分析

GOO 算法原理

GOO 以增量方式构建连接顺序:

  1. 初始时每个基表对应一个「团块」(clump, geqo 中的概念)。
  2. 每一步在所有合法连接对(满足查询连接约束的团块对)中评估。
  3. 对每一对构造连接关系,用规划器现有代价模型得到总代价。
  4. 选择估计总代价最低的一对,合并为一个团块,重复上述过程。
  5. 直到只剩一个团块,其最优路径即为最终计划。

因此「贪心」体现在:在每一步,在所有当前团块中,只做当前看起来代价最小的连接。原论文按估计结果大小排序;补丁中为与现有规划器一致,采用规划器的 total_cost

复杂度:时间为 O(n³),n 为基表数量:共 (n−1) 次合并,每步约 O(k²) 个团块对。

与规划器的集成

补丁引入:

  • enable_goo_join_search:GUC,用于开启 GOO(默认关闭)。
  • 阈值:目前复用 geqo_threshold;当连接层数 ≥ geqo_threshold 且开启 GOO 时,规划器调用 goo_join_search() 而非 GEQO。因此 GOO 的定位是替代 GEQO,不替代 DP。

allpaths.c 中的相关逻辑:

else if (enable_goo_join_search && levels_needed >= geqo_threshold)
    return goo_join_search(root, levels_needed, initial_rels);
else if (enable_geqo && levels_needed >= geqo_threshold)
    return geqo(root, levels_needed, initial_rels);

实验中的贪心策略

作者用不同指标作为「最便宜」的贪心依据做了对比:

策略含义
cost使用规划器对该连接的 total_cost(基线)。
result_size使用估计输出大小(字节):reltarget->width * rows
rows使用估计输出行数。
selectivity使用连接选择性(输出行数 / (左表行数 × 右表行数))。
combined分别按 cost 和 result_size 各跑一次 GOO,再选最终估计代价更低的计划。

结论概括:

  • 仅用 cost 会出现严重长尾(如 JOB 上最大 431 倍、大量 ≥10 倍退化)。
  • result_size 平均更好,但仍有差的长尾(如 JOB 上最大 67 倍)。
  • combined(cost + result_size 各生成一计划再选更便宜的)在鲁棒性上最好:几何平均更优、JOB 上无 ≥10 倍退化、最坏约 8.68 倍。

因此讨论重点从「选哪种单一指标」转向「如何降低长尾风险」——例如通过多种贪心策略并行、再选更优计划。

社区讨论要点

基准测试说明

Dilip Kumar 最初质疑用 pgbench 测连接顺序是否合适。作者澄清:数据来自 Tomas Vondra 此前线程中的自定义星型/雪花型负载,而非默认 pgbench;这些负载包含多表连接且表为空,因此吞吐主要反映 规划时间(DP / GEQO / GOO 的差异)。

崩溃与「提前聚合」修复(v1 → v2)

Tomas Vondra 在 TPC-DS 查询上跑 EXPLAIN 时遇到崩溃,堆栈显示 sort_inner_and_outer()inner_path = NULL。根因是:GOO 在某些路径下构造连接关系时未调用 set_cheapest(),导致后续代码读到空的 cheapest 路径。Tomas 进一步将问题缩小到 SELECT 列表中含聚合 的查询(如 TPC-DS Q7)。

v2 的修复是正确处理 eager aggregation:规划器会生成需要正确 cheapest 路径的分组/基表关系。修复后,99 条 TPC-DS 查询均可正常完成规划。

GEQO 在中等连接数下比 DP 更慢

Tomas 给出了 99 条 TPC-DS 查询的 EXPLAIN 耗时(3 种 scale × 0/4 worker):

  • master (DP):8s
  • master/geqo:20s
  • master/goo:5s

说明在该负载下 GEQO 比 DP 慢,GOO 最快。John Naylor 和 Pavel Stehule 指出:GEQO 的设计本就是在连接规模足够大时才占优;在连接数较小或中等时,其固定开销可能占主导。因此评估应聚焦在真正会启用 GEQO 的范围(例如关系数超过 geqo_threshold)。

TPC-DS 执行结果(Tomas Vondra)

Tomas 分享了完整 TPC-DS 运行结果(scale 1 和 10,0 和 4 worker)。99 条查询总耗时概括:

  • Scale 1:GOO 比 master 和 GEQO 都慢(例如约 1124s vs GEQO 399s vs master 816s)。
  • Scale 10:GOO 更快(例如约 1859s vs GEQO 2325s vs master 2439s)。

说明 GOO 在小数据量下表现更差、在较大数据量下更好,行为与负载和规模相关。Tomas 建议分析变慢的查询以改进启发式,并增加更大数据量和冷缓存测试。

TPC-H:单一指标贪心的失效模式(v3)

作者在 TPC-H SF=1 上对比了四种策略:rows、selectivity、result_size、cost。主要结论:

  • Q20partsupp 与聚合后的 lineitem 子查询连接;行数估计严重偏低(估计几十行、实际数十万)。面向输出的规则(rows、selectivity、result_size)会因「看起来」极度收缩而非常早地选这个连接;实际却产生巨大中间结果并放大后续代价。估计错误会使面向输出的贪心规则表现很差。
  • Q7:基于 cost 的贪心选了一个局部很便宜的连接,却形成大的多对多中间结果,使后续连接代价激增。说明局部最优代价可能全局很差

Tomas 指出:Q20 本质是估计问题(垃圾进垃圾出);Q7 则是贪心算法固有的——局部最优导致全局次优,换单一指标无法从根本上解决。

JOB 与组合策略(v4)

在完整 JOB 负载上,combined 策略(cost + result_size 各跑一次,选代价更低的计划)表现如下:

  • 几何平均最优(相对 DP 为 0.953)。
  • 无 ≥10 倍退化,最大约 8.68 倍。
  • 比单独 GOO(cost) 或 GOO(result_size) 的尾部好很多。

通过两种贪心策略并行、再选更优计划,可以在几乎不增加规划成本的前提下减少灾难性计划。

定位:GOO 作为 GEQO 替代

Tomas 询问目标是替代 DP 还是 GEQO。作者明确:GOO 的定位是替代 GEQO,不替代 DP;在连接数低于 geqo_threshold 时仍应使用 DP。

文献与后续方向

Tomas 提到 CIDR 2021 的 "Simplicity Done Right for Join Ordering"(Hertzschuch 等),该工作强调鲁棒性(如最坏情况/上界连接顺序)和仅信任基表估计,对基数估计过于乐观导致的 nestloop 爆炸可能有参考价值。作者计划先用当前方案打好基线,再逐步吸收这类思路。

实现与细节

实现方式

  • 新增:src/backend/optimizer/path/goo.csrc/include/optimizer/goo.h
  • GOO 通过反复调用现有规划接口(如 make_join_rel、路径生成)构建连接关系,因此与 DP/GEQO 共用同一代价模型和路径类型。
  • 使用多个内存上下文以在候选评估阶段控制内存占用。

边界与鲁棒性

  • Eager aggregation:v2 通过保证 GOO 生成的连接关系都正确设置 cheapest 路径,修复了 sort_inner_and_outer 遇到 NULL inner_path 的崩溃。
  • 基数估计错误:所有方法在估计严重偏差时都会受影响;GOO 对不同策略的敏感度不同(例如面向输出的规则在行数估计错误时可能更糟)。combined 策略的目标是降低长尾,而非从根本上修正估计。
  • 结构局限:某些图结构(如星型、扇出)会使 cost 和 result_size 都选到类似的差计划,这是「只看一步」的贪心枚举的固有局限。

性能相关

  • 规划时间:GOO 为 O(n³),在已报告的基准中通常快于 GEQO;作者计划补充规划时间和内存的显式测量。
  • 执行时间:高度依赖负载;GOO 相对 GEQO/DP 可能更好或更差(如 TPC-DS 的 scale 1 与 10、JOB 中涉及 GEQO 的子集)。

当前状态

  • 补丁:讨论截止时最新为 v4。v4-0001 为核心 GOO 实现(与 v3-0001 一致);v4-0002 为测试用 GUC 及多策略(如 combined)的实验框架。
  • 目标:将 GOO 确立为可行的 GEQO 替代——相同阈值下,计划质量和规划时间不逊于或优于 GEQO,并减少极端退化。
  • 后续(作者计划):在更多负载、更大连接图、冷缓存、更大 scale 上评估;考虑在组合策略中加入 selectivity;测量规划时间与内存;探索可调参数与渐进降级(如资源受限时先用 DP 再退化为贪心)。

小结

GOO 邮件线程展示了一种用确定性贪心连接顺序算法替代 GEQO 的完整尝试:

  • 复用现有代价模型与规划基础设施。
  • 在多个基准上缩短了规划时间(相对 GEQO)。
  • 通过组合多种贪心策略(如 cost 与 result_size)并选取更优计划,改善了最坏情况下的计划质量。

同时承认局限:贪心本质是局部决策,在估计错误或不利的连接图结构下仍可能产生差计划;当前重点放在鲁棒性与尾部行为而非单一指标的极致调优。对 PostgreSQL 用户而言,这是值得关注的补丁:若被采纳,将在复杂多表连接场景下提供除 GEQO 之外的另一种选择,并在可预测性和规划性能上可能带来改进。

参考

执行器批处理:面向批量的元组处理

引言

PostgreSQL 的执行器长期以来都是 逐元组(tuple-at-a-time) 的:每个计划节点通常向子节点要一个元组、处理后再向上返回一个结果元组。这种设计简单,在 OLTP 场景下表现良好,但在分析型和批量负载中,每元组的开销——尤其是重复的函数调用和表达式求值——往往会成为主要成本。在 PGConf.dev 2025 上,社区讨论了 PostgreSQL 是否能够支持一种 额外的批处理执行器,在节点之间传递 成批元组 而不是一次一个 slot。

在那次讨论以及 Andres Freund 和 David Rowley 的私下交流之后,Amit Langote 于 2025 年 9 月在 pgsql-hackers 上发布了题为 「Batching in executor」 的补丁系列。该系列引入了 批处理表访问方法(Table AM)API,在执行器中增加了 支持批处理的接口ExecProcNodeBatchTupleBatch),并原型化了 面向批量的表达式求值(包括批量 qual 和聚合转换函数)。目标是降低每元组开销、为聚合函数中的 SIMD 等未来优化铺路,并为受益于批量执行的列存或压缩表 AM 打基础。

为何重要

  • 执行器开销:在 CPU 受限、IO 极少的负载下(例如全缓存表),大量时间消耗在执行器内部。批处理减少了进入表 AM 和表达式解释器的调用次数,并可通过一次对多行求值来削减函数调用开销。
  • 聚合与分析:批量转换求值(如 count(*)sum()avg())可以按批而非按行支付 fmgr 成本,并为向量化或 SIMD 友好路径打开空间。
  • 未来表 AM:批量执行器便于列存或压缩表 AM(如 Parquet 风格)以原生批量格式传递数据,而不必过早物化为堆元组。
  • OLTP 安全:设计上保留现有逐行路径不变;批处理为可选(例如通过 executor_batching GUC),因此 OLTP 负载不受影响。

理解本线程的内容有助于把握 PostgreSQL 如何可能增加一条面向批量的执行路径,以及社区在物化、ExprContext、EEOP 设计等方面正在权衡的内容。

技术分析

补丁结构

系列分为两部分:

  1. 0001–0003 — 基础:批处理表 AM API、heapam 批处理实现、与 SeqScan 对接的执行器批处理接口。
  2. 0004–0008 — 原型(WIP/PoC):支持批处理的 Agg 节点、TupleBatch 相关新 EEOP、批量 qual 求值、批量聚合转换(按行循环与「直接」批 fmgr)。

核心抽象

表 AM 批处理 API(0001) 新增回调允许表 AM 一次返回 多个元组 而非一个。对 heap 而言:

  • HeapBatch 保存单页内的元组;大小受 EXEC_BATCH_ROWS(当前 64)和「不跨页」限制。
  • heapgettup_pagemode_batch() 从当前页填充 HeapTupleData 数组,逻辑与 heapgettup_pagemode() 对应,但面向一批。可见性和扫描方向处理方式一致。

通用层在 tableam.h 中引入 batch 类型与操作,以便其他 AM 提供自己的批量格式与实现。

执行器批处理路径(0002–0003)

  • TupleBatch 是批处理模式下在节点间传递的容器,可持有 AM 原生批(如堆元组)或物化后的 slot,视路径而定。
  • ExecProcNodeBatch() 对应 ExecProcNode():返回 TupleBatch* 而非 TupleTableSlot*PlanState 增加 ExecProcNodeBatch 函数指针,沿用与逐行路径相同的「首次调用」与插桩包装。
  • SeqScan 获得:
    • 批量驱动的 slot 路径:仍每次返回一个 slot,但从内部批中填充,减少对 AM 的调用。
    • 批路径:当父节点支持批处理时,SeqScan 的 ExecProcNodeBatch 直接返回 TupleBatch(如通过 ExecSeqScanBatch*)。

因此前三个补丁提供:(1) 能产生批的表 AM;(2) 请求与传递批的执行器 API;(3) SeqScan 作为首个既能消费又能产生批的节点。

面向批量的表达式求值(0004–0008)

后续补丁尝试对 一批 行做表达式求值:

  • Agg 的批量输入:Agg 可通过 ExecProcNodeBatch() 从子节点拉取 TupleBatch,并成批喂入聚合转换函数。
  • 新 EEOP:表达式解释器增加针对 TupleBatch 的步骤——例如将属性取到批量向量、对一批求 qual、以及按行在解释器内循环(ROWLOOP)或按批调用转换函数(DIRECT)执行聚合转换。
  • 批量 qual 求值:一批元组可用单次遍历完成过滤(ExecQualBatch 及相关 EEOP),降低每行解释器和 fmgr 开销。

提供了两种批量聚合原型路径:一是在解释器内按行迭代(每行转换);二是每批调用一次转换函数(每批 fmgr)。在 Amit 的基准中,当执行器成本占主导时,后者收益更大。

设计选择与未决点

  • 单页批:堆批限于一页,因此批大小可能小于 EXEC_BATCH_ROWS(例如每页元组少或 qual 选择性高)。线程中提到未来可改进:跨页批或扫描在批未满时继续要元组。
  • TupleBatch 与 ExprContext:补丁在 ExprContext 上扩展了 scan_batchinner_batchouter_batch。每批表达式求值仍使用 ecxt_per_tuple_memory,Amit 指出这「 arguably 滥用了」每元组契约。批作用域内存 的更清晰模型仍待定义。
  • 物化:目前面向批的表达式求值通常作用在已物化到 slot(或堆元组数组)的元组上。长期目标是在 原生批格式(如列存或压缩)上做表达式求值而不强制物化;这需要更多基础设施(如 AM 控制的表达式求值或面向批的算子)。

社区观点

Tomas Vondra:批设计与索引预取

Tomas 将本补丁与 索引预取 工作(他参与其中)对比,后者也在索引 AM 与执行器之间引入「批」概念。他指出两种设计因目标不同而不同:

  • 索引预取:共享的批结构由索引 AM 填充,之后由 indexam.c 管理;批在此之后与 AM 无关。
  • 执行器批处理:每个表 AM 可产生自己的批格式(如 HeapBatch),包装在带 AM 特定操作的通用 TupleBatch 中。执行器保留 TAM 特定优化,并依赖 TAM 对批内容进行操作。

Amit 同意:执行器批处理旨在保留 TAM 特定行为并尽可能避免过早物化;预取则追求由 indexam 统一的批格式。两种设计都与各自目标一致。

Tomas 还问:(1) 何时必须将 TupleBatch 物化为通用格式(如 slot)?(2) 表达式能否直接在「自定义」批(如压缩/列存)上执行?Amit 回复说目前表达式求值仍需物化,但设计上不应阻碍未来在原生批数据上求值(如列存或 Parquet 风格)。给表 AM 更多控制「如何在其批数据上求值」是可能的后续扩展。

Tomas Vondra:TPC-H Q22 段错误与 v3 修复

Tomas 报告在启用批处理运行 TPC-H 时出现 段错误,且 仅出现在 Q22,堆栈始终指向同一处:numeric_avg_accum 收到 NULL 的 datum(DatumGetNumeric(X=0)),从 ExecAggPlainTransBatchagg_retrieve_direct_batch。因此问题在批量聚合路径:转换函数收到了本不应为 NULL 的 NULL。

Amit 将崩溃追溯到 表达式解释器。两个不同的 EEOP(分别对应 ROWLOOP 和 DIRECT 批量聚合路径)都调用了 同一个辅助函数。该辅助函数在运行时再次推导 opcode(如通过 ExecExprEvalOp(op))。在某些构建(如 macOS 上的 clang-17)中,这两个 EEOP 分支编译成相同代码,导致 分发标签地址相同。解释器按标签地址做反向查找时可能返回错误的 EEOP;初始化路径可能以为在执行 ROWLOOP EEOP,而执行路径却按 DIRECT EEOP 行为,导致状态错误和 NULL/崩溃。

v3 中的修复(补丁 0009)是 将共享辅助拆成两个函数,每个 EEOP 一个,这样辅助不再重新推导 opcode。修改后 Amit 在 macOS clang-17 上无法再复现崩溃。同一修复也解决了 Tomas 遇到的 TPC-H Q22 段错误。

Bruce Momjian:POSETTE 与 OLTP

Bruce 引用了 POSETTE 2025 的两场演讲做背景:一场讲数据仓库需求,一场讲 「Hacking Postgres Executor For Performance」。Amit(第二场演讲者)确认批处理设计上不会给 OLTP 路径增加明显开销;逐行路径仍是默认且未改动。

关闭批处理时的回归

Tomas 观察到在 关闭批处理executor_batching=off)时,打补丁的树可能比未打补丁的 master 更慢——即新代码路径未启用时存在回归。Amit 复现了该现象:例如单聚合 SELECT count(*) FROM bar 和多聚合 SELECT avg(a), … FROM bar 在关闭批处理时相比 master 有约 3–18% 的变慢,具体取决于行数和并行度。他承认回归并表示正在排查。确保在关闭批处理时零或极小成本是合入基础补丁的重要前提。

技术细节

实现要点

  • 批大小EXEC_BATCH_ROWS 为 64。堆批还受单页限制,实际批大小可能更小(如 Amit 的 1000 万行测试表中每页约 43 行)。
  • 插桩ExecProcNodeBatch 使用与逐行路径相同的插桩钩子;批调用的「元组」数记为返回的 TupleBatch 的有效行数(b->nvalid),便于 EXPLAIN ANALYZE 等统计保持意义。
  • GUC:在 v4/v5 中 GUC 为 executor_batch_rows(0 = 关闭批处理;例如 64 = 批大小)。

边界与限制

  • 稀疏批:高选择性 qual 下,过滤后批内有效行可能很少。线程建议未来支持跨页批或扫描在批未满时继续填充。
  • ExprContext 与批生命周期:用 ecxt_per_tuple_memory 承担每批工作是目前的设计债;独立的批作用域分配器或上下文会更清晰。
  • 并行与嵌套 Agg:Tomas 崩溃的堆栈涉及并行 worker(Gather/GatherMerge)和嵌套聚合(如子计划上的 Agg)。NULL datum 问题出在该场景下使用的批量转换路径;v3 的 EEOP 辅助拆分从根因上修复,而非针对单条查询。

基准摘要(来自 Amit v1 邮件)

均在完全 VACUUM 的表、大 shared_buffers 且预热缓存下运行;时间单位为 ms,「off」= 批处理关,「on」= 批处理开;负 %diff 表示「on」更快。

  • 单聚合、无 WHERE(如 SELECT count(*) FROM bar_N):仅批量 SeqScan(0001–0003)约快 8–22%;加上批量 agg(0001–0007)在部分规模下约快 33–49%。
  • 单聚合、有 WHERE:批量 agg + 批量 qual(0001–0008)约快 31–40%。
  • 五聚合、无 WHERE:批量转换(每批 fmgr,0001–0007)约快 22–31%。
  • 五聚合、有 WHERE:批量转换 + 批量 qual(0001–0008)约快 18–32%。

因此在执行器占主导(IO 极少)时,批处理一致降低 CPU 时间,最大收益来自减少每行 fmgr 调用和对整批求 qual。

演进:v4 与 v5

后续修订在基础之上增加了可观测性与批量 qual 工作:

  • v4(2025 年 10 月):新增 EXPLAIN (BATCHES)(补丁 0003)用于展示元组批处理统计,对应此前「插桩」的待办项。Amit 报告在 v4 中 关闭批处理时的回归(相对未打补丁的 master)已不再出现——可能与移除 HeapScanData 中的多余字段以及避免混用编译器(gcc vs clang)比较有关。新基准使用 SELECT * FROM t LIMIT 1 OFFSET n;在 batch=64 下,无 WHERE 时约快 22–26%,WHERE a > 0 时约快 21–48%;变形开销大的情况(如对最后一列求 qual)收益较小。Daniil Davydov 审阅了堆批处理代码(如 SO_ALLOW_PAGEMODE 断言、heapgettup_pagemode_batch 逻辑与风格),Amit 在 v4 中已回应。

  • v5(2026 年 1 月):0001–0003 仍为核心(批表 AM API、SeqScan + TupleBatch、EXPLAIN BATCHES)。0004 增加 ExecQualBatch 用于批量 qual 求值(WIP);0005 将批量 qual 的 opcode 移到 专用解释器,使逐行路径(ExecInterpExpr)不被修改,从而在 executor_batch_rows=0 时避免额外成本。Amit 移除了 BatchVector 中间表示(qual 直接读取批内 slot 的 tts_values)。仍有两个待解决问题:(1) 在 0% 选择性(所有行不满足 qual)时,即使关闭批处理,打上批量 qual 补丁后逐行路径仍更热;(2) 对靠后列的 qual(变形开销大)批处理几乎无收益。近期补丁中的 GUC 为 executor_batch_rows(0 = 关闭)。

当前状态

  • 线程 仍在进行;最近消息为 2026 年 1 月。系列仍为 进行中
  • v5 为当前版本。0001–0003(表 AM 批 API、heapam 批、SeqScan + TupleBatch、EXPLAIN BATCHES)是拟先审阅并争取合入的部分。
  • v5 的 0004–0005实验性(ExecQualBatch、批量 qual 专用解释器)。
  • v3 已包含针对 TPC-H Q22 / 批量 agg 崩溃的 段错误修复(拆分 EEOP 辅助);v4/v5 在此基础上演进。
  • 待办:(1) 当批量 qual(0004–0005)在树中但 executor_batch_rows=0 时的逐行路径回归(如 0% 选择性);(2) 批作用域内存与 ExprContext;(3) 跨页批与在原生/压缩批格式上求值等后续工作。

小结

Amit Langote 的「Batching in executor」系列在 PostgreSQL 执行器中引入了一条 面向批量的路径:表 AM 可返回成批元组,执行器通过 TupleBatch 请求与传递批,SeqScan 是首个接入该路径的节点。v4、v5 增加了 EXPLAIN (BATCHES) 用于可观测性,并原型化了 批量 qual 求值 与专用解释器,以保持逐行路径不变。基准显示在开启批处理时收益可观(多为 20–50%);此前「关闭批处理」时的回归在 v4 中已解决,但仍有问题:在打上批量 qual 补丁且关闭批处理时(如 0% 选择性)逐行路径的成本。

审阅者提出了重要问题:与其他「批」类工作(如索引预取)的协调、物化与未来「在批上求值」的设计、TPC-H Q22 段错误(v3 修复)以及 Daniil 对堆批处理的审阅(v4 已回应)。当前审阅与合入重点为基础补丁(0001–0003)与 EXPLAIN BATCHES。

参考

第 04 周(2026)

2026 年第 04 周的 PostgreSQL 邮件列表讨论。

文章

PostgreSQL 查询优化器优化:自动 COUNT(*) 转换

引言

2025 年 10 月,PostgreSQL 提交者 David Rowley 提出了一个重要的查询优化器优化,能够自动将 COUNT(1)COUNT(not_null_col) 表达式转换为 COUNT(*)。这个优化解决了一个常见的性能反模式:开发者认为 COUNT(1) 等同于 COUNT(*),但实际上 COUNT(*) 更高效。该补丁于 2025 年 11 月提交,并引入了用于聚合函数简化的新基础设施。

为什么这很重要

COUNT(*)COUNT(column) 之间的性能差异可能非常显著,特别是对于大表。当计算特定列时,PostgreSQL 必须:

  1. 解构元组以提取列值
  2. 检查 NULL 值(即使对于 NOT NULL 列,检查仍然会发生)
  3. 通过聚合函数处理列数据

相比之下,COUNT(*) 可以在不访问单个列值的情况下计算行数,从而获得显著更好的性能。David Rowley 的基准测试显示,在包含 100 万行的表上,使用 COUNT(*) 而不是 COUNT(not_null_col) 可以获得约 37% 的性能提升

技术分析

基础设施:SupportRequestSimplifyAggref

该补丁引入了一个名为 SupportRequestSimplifyAggref 的新基础设施,类似于现有的用于常规函数表达式(FuncExpr)的 SupportRequestSimplify。由于聚合使用 Aggref 节点,因此需要一个单独的机制。

关键组件包括:

  1. 新的支持节点类型supportnodes.h 中的 SupportRequestSimplifyAggref
  2. 简化函数clauses.c 中的 simplify_aggref(),在常量折叠期间调用聚合的支持函数
  3. 增强的可空性检查:扩展 expr_is_nonnullable() 以处理 Const 节点,而不仅仅是 Var 节点

实现细节

优化在查询规划的常量折叠阶段执行,具体在 eval_const_expressions_mutator() 中。当遇到 Aggref 节点时,规划器会:

  1. 检查聚合函数是否通过 pg_proc.prosupport 注册了支持函数
  2. 使用 SupportRequestSimplifyAggref 请求调用支持函数
  3. 如果支持函数返回简化的节点,则替换原始的 Aggref

对于 COUNT 聚合,支持函数(int8_agg_support_simplify())会检查:

  • 参数是否不可为空(使用 expr_is_nonnullable()
  • 聚合中是否没有 ORDER BYDISTINCT 子句
  • 如果两个条件都满足,则将 COUNT(ANY) 转换为 COUNT(*)

代码示例

int8.c 中的核心简化逻辑:

static Node *
int8_agg_support_simplify(SupportRequestSimplifyAggref *req)
{
    Aggref    *aggref = req->aggref;

    /* 只处理 COUNT */
    if (aggref->aggfnoid != INT8_AGG_COUNT_OID)
        return NULL;

    /* 必须恰好有一个参数 */
    if (list_length(aggref->args) != 1)
        return NULL;

    /* 没有 ORDER BY 或 DISTINCT */
    if (aggref->aggorder != NIL || aggref->aggdistinct != NIL)
        return NULL;

    /* 检查参数是否不可为空 */
    if (!expr_is_nonnullable(req->root,
                             (Expr *) linitial(aggref->args),
                             true))
        return NULL;

    /* 转换为 COUNT(*) */
    return make_count_star_aggref(aggref);
}

补丁演进

该补丁经历了四次迭代,每次都在改进实现:

版本 1(初始提案)

  • 引入基本基础设施
  • 使用 SysCache 获取 pg_proc 元组

版本 2(代码清理)

  • get_func_support() 函数替换 SysCache 查找
  • 更清晰、更高效的方法

版本 3(移除实验性代码)

  • 移除了处理 COUNT(NULL) 优化的 #ifdef NOT_USED
  • 清理了未使用的包含文件
  • 改进了注释

版本 4(最终版本)

  • 在提交 b140c8d7a 后重新基于
  • 修复了支持函数总是返回 Aggref 的假设
  • 允许支持函数返回其他节点类型(例如常量),以实现更激进的优化
  • 这种灵活性使得未来的优化成为可能,例如将 COUNT(NULL) 转换为 '0'::bigint

社区见解

审查者反馈

Corey Huinker 提供了积极的反馈:

  • +1 支持自动查询改进
  • 指出我们无法教育所有人 COUNT(1) 是反模式,所以让它不再是反模式是正确的做法
  • 确认补丁可以干净地应用且测试通过

Matheus Alcantara 也进行了审查和测试:

  • 确认基准测试中约 30% 的性能提升
  • 验证了代码放置与现有的 SupportRequestSimplify 基础设施一致
  • +1 支持这个想法

设计决策

优化的时机:优化在常量折叠期间发生,这是规划过程的早期阶段。David 考虑过是否应该在稍后(在 add_base_clause_to_rel() 之后)进行,以捕获如下情况:

SELECT count(nullable_col) FROM t WHERE nullable_col IS NOT NULL;

但是,它必须在 preprocess_aggref() 之前发生,该函数将具有相同转换函数的聚合分组。当前的位置与常规函数的 SupportRequestSimplify 一致。

支持函数返回类型:基础设施允许支持函数返回 Aggref 以外的节点。这个设计决策使得未来的优化成为可能,例如:

  • COUNT(NULL) 转换为 '0'::bigint
  • 对聚合进行更激进的常量折叠

性能考虑

该优化提供了显著的性能优势:

  1. 减少元组解构COUNT(*) 不需要从元组中提取列值
  2. 更少的 NULL 检查:不需要检查单个列值
  3. 更好的缓存利用率:更少的数据移动意味着更好的 CPU 缓存使用

对于具有多列的表,性能提升可能更加显著,因为 COUNT(column) 可能需要解构许多列才能到达目标列。

边界情况和限制

优化仅在以下情况下应用:

  1. 列被证明不可为空(NOT NULL 约束或常量)
  2. 聚合中没有 ORDER BY 子句
  3. 聚合中没有 DISTINCT 子句

尚未优化的情况:

  • COUNT(nullable_col),其中列可能为 NULL(即使在同一查询中通过 WHERE nullable_col IS NOT NULL 过滤)
  • COUNT(col ORDER BY col) - ORDER BY 阻止优化
  • COUNT(DISTINCT col) - DISTINCT 阻止优化

WHERE 子句的限制是由于优化的时机(在常量折叠期间,在关系信息完全可用之前)。

当前状态

该补丁由 David Rowley 于 2025 年 11 月 26 日提交。它可在 PostgreSQL master 分支中使用,并将包含在 PostgreSQL 18 中。

结论

这个优化代表了 PostgreSQL 查询优化器的重大改进,自动修复了常见的性能反模式,而无需更改应用程序。新的 SupportRequestSimplifyAggref 基础设施也为未来的聚合优化打开了大门。

对于开发者和 DBA:

  • 无需操作:优化会自动发生
  • 性能优势:使用 COUNT(1)COUNT(not_null_col) 的现有查询将自动变得更快
  • 最佳实践:虽然规划器现在会优化这些情况,但 COUNT(*) 仍然是计算行数最清晰、最符合习惯的方式

这一变化体现了 PostgreSQL 对自动改进查询性能的承诺,减少了开发者了解每个优化细节的负担,同时仍然允许专家在需要时编写最优查询。

参考资料

  • 讨论线程
  • 相关:用于常规函数表达式的 SupportRequestSimplify

第 03 周(2026)

2026 年第 03 周的 PostgreSQL 邮件列表讨论。

文章

PostgreSQL 19 扩展统计信息导入/导出功能

引言

PostgreSQL 的扩展统计信息功能自版本 10 引入以来,使优化器能够理解简单单列统计信息无法捕获的列间相关性。这包括 n-distinct 系数(存在多少个不同的组合)、函数依赖(当一个列决定另一个列时)以及多列组合的最常见值(MCV)列表。

虽然 PostgreSQL 17 引入了用于导入和导出关系和属性统计信息的函数(pg_restore_relation_statspg_restore_attribute_stats),但扩展统计信息被排除在这一初始实现之外。最近在 pgsql-hackers 邮件列表上,由 Corey Huinker 发起的一个讨论线程解决了这一空白,提供了一个全面的补丁系列,添加了 pg_restore_extended_stats()pg_clear_extended_stats() 及相关基础设施。

这项工作意义重大,原因如下:

  • 实现跨 pg_dump/pg_restore 和 pg_upgrade 的完整统计信息保留
  • 允许使用假设统计信息进行查询计划器实验
  • 支持仅包含 schema 和统计信息的转储,用于在没有实际数据的情况下测试查询计划

技术分析

原始格式的问题

pg_ndistinctpg_dependencies 类型的原始输出格式使用了一种 JSON 结构,其中键本身包含结构化数据:

{"1, 2": 2323, "1, 3": 3232, "2, 3": 1500}

虽然这在技术上是有效的 JSON,但这种格式存在几个问题:

  1. 包含逗号分隔属性号的键需要额外解析
  2. 难以以编程方式操作
  3. 不存在可用的输入函数——这些类型实际上只能输出

新的 JSON 格式

补丁系列引入了一种更清晰、更结构化的 JSON 格式。对于 pg_ndistinct

[
  {"attributes": [2, 3], "ndistinct": 4},
  {"attributes": [2, -1], "ndistinct": 4},
  {"attributes": [2, 3, -1], "ndistinct": 4}
]

对于 pg_dependencies

[
  {"attributes": [2], "dependency": 3, "degree": 1.000000},
  {"attributes": [2, 3], "dependency": -1, "degree": 0.850000}
]

主要改进:

  • 规范的 JSON 数组,每个元素都有命名的键
  • 清晰分离属性、值和元数据
  • 机器可读,无需自定义解析逻辑
  • 负数属性号表示统计对象中的表达式(例如,-1 是第一个表达式)

输入函数实现

新的输入函数使用 PostgreSQL 的 JSON 解析器基础设施,配合自定义语义动作处理器。以下是 pg_ndistinct 解析状态机的简化视图:

typedef enum
{
    NDIST_EXPECT_START = 0,
    NDIST_EXPECT_ITEM,
    NDIST_EXPECT_KEY,
    NDIST_EXPECT_ATTNUM_LIST,
    NDIST_EXPECT_ATTNUM,
    NDIST_EXPECT_NDISTINCT,
    NDIST_EXPECT_COMPLETE
} ndistinctSemanticState;

解析器验证:

  • 正确的 JSON 结构(对象数组)
  • 必需的键(ndistinct 统计信息需要 attributesndistinct
  • 属性号在有效范围内(正数表示列,负数表示表达式,但不超过 STATS_MAX_DIMENSIONS
  • 单个项目内没有重复属性

扩展统计信息函数

补丁引入了三个主要 SQL 函数:

pg_restore_extended_stats() — 从先前导出的值导入扩展统计信息:

SELECT pg_restore_extended_stats(
    'public',                    -- 关系 schema
    'my_table',                  -- 关系名称
    'public',                    -- 统计信息 schema
    'my_stats',                  -- 统计信息名称
    false,                       -- inherited(是否继承)
    '{"version": ..., "ndistinct": [...], "dependencies": [...], "mcv": [...], "exprs": [...]}'::text
);

pg_clear_extended_stats() — 从 pg_statistic_ext_data 中删除扩展统计信息数据:

SELECT pg_clear_extended_stats(
    'public',        -- 统计信息 schema
    'my_stats',      -- 统计信息名称
    false            -- inherited(是否继承)
);

这些函数遵循为关系/属性统计信息建立的相同模式:

  • 返回布尔值表示成功与否
  • 遇到问题时发出 WARNING(而非 ERROR),以避免破坏 pg_restore 脚本
  • 需要目标关系的 MAINTAIN 权限

验证与安全

实现包括仔细的验证:

  1. 属性边界检查:正数 attnum 必须存在于 stxkeys 中,负数 attnum 不得超过表达式数量
  2. 组合完整性:对于 pg_ndistinct,基于最长属性列表,必须存在所有 N 选 K 组合
  3. 软错误处理:使用 PostgreSQL 的 ErrorSaveContext 进行安全的错误报告而不会崩溃

属性号验证示例:

if (attnum == 0 || attnum < (0 - STATS_MAX_DIMENSIONS))
{
    errsave(parse->escontext,
            errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
            errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
            errdetail("Invalid \"%s\" element: %d.",
                      PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
    return JSON_SEM_ACTION_FAILED;
}

社区洞察

关键讨论点

格式更改时机:Tomas Vondra 最初建议采用更结构化的 JSON 格式。社区认识到这是在可用输入函数锁定向后兼容性要求之前更改格式的最后机会。

验证范围:关于应执行多少验证存在重大讨论:

  • 早期补丁对统计一致性进行了广泛检查(例如,MCV 频率总和为 1.0)
  • 审查者提出反对意见,倾向于最小化验证以避免破坏合法但不寻常的导入
  • 最终共识:验证结构和属性引用,但不验证统计值

pg_dependencies 特殊情况:与存储所有组合的 pg_ndistinct 不同,pg_dependencies 可能会省略统计上不显著的组合。这意味着输入函数无法对依赖项强制执行完整的组合覆盖。

审查反馈整合

Michael Paquier 提供了广泛的审查并贡献了重大改进:

  • 重构补丁系列以获得更清晰的提交
  • 将格式更改与输入函数添加分开
  • 添加全面的回归测试,实现超过 90% 的代码覆盖率
  • 修复旧版 GCC 上的编译器警告

Tom Lane 发现了风格问题:

  • 错误详细消息转换为完整句子
  • 用直接状态检查替换 SOFT_ERROR_OCCURRED() 宏以避免警告

当前状态

截至 2026 年 1 月,补丁系列已取得重大进展:

已提交:

  • pg_ndistinct 的输出格式更改(新的 JSON 数组格式)
  • pg_dependencies 的输出格式更改(新的 JSON 数组格式)
  • 两种类型的输入函数及全面验证
  • pg_clear_extended_stats() 函数

审查中 (v27):

  • pg_restore_extended_stats() 函数
  • pg_dump 集成用于扩展统计信息导出/导入

pg_dump 集成支持向后兼容到 PostgreSQL 10,通过特定版本的 SQL 生成来处理格式差异。

技术细节

内部存储未更改

重要的是,内部二进制存储格式保持不变。新的输入/输出函数只影响文本表示。这意味着:

  • 不需要目录更改
  • 现有数据保持有效
  • 二进制 COPY 操作不受影响

表达式统计支持

扩展统计信息可以包含表达式(例如,CREATE STATISTICS s ON (a + b), c FROM t)。实现通过负数属性号处理这些:

  • -1 = 第一个表达式
  • -2 = 第二个表达式
  • 以此类推

恢复格式中的 exprs 元素包含类似于 pg_statistic 条目的每个表达式的统计信息,实现完整的往返保留。

MCV 列表处理

扩展统计信息的 MCV(最常见值)列表特别复杂,包含:

  • 跨多列的值组合
  • 频率和基础频率数组
  • 每个值的空值位图

实现重用了属性统计信息导入的基础设施,并针对多列值数组进行了扩展。

结论

这个补丁系列代表了 PostgreSQL 统计信息基础设施的重大增强。通过启用扩展统计信息的导入/导出,它:

  1. 完善了统计信息功能,延续了 PostgreSQL 17 中为关系和属性统计信息开始的工作
  2. 实现了真实的测试,在清理后的 schema 上使用类似生产环境的统计信息
  3. 提高了升级可靠性,通过 pg_upgrade 保留优化器信息

对于 DBA 和开发人员:

  • 使用 CREATE STATISTICS 创建的扩展统计信息现在可以在 pg_dump/pg_restore 后保留
  • 使用包含完整统计信息的 --no-data 转储,查询计划测试变得更加实用
  • 新的 JSON 格式便于人类阅读,用于调试和假设场景测试

目标发布版本是 PostgreSQL 19,剩余的恢复函数和 pg_dump 集成预计很快会合并。

参考资料

pg_plan_advice:PostgreSQL 查询计划控制的新方案

背景介绍

PostgreSQL 的查询优化器功能强大,通常能生成优秀的执行计划。然而,经验丰富的 DBA 和开发者偶尔会遇到需要影响或稳定优化器决策的场景。来自 EnterpriseDB 的 Robert Haas 正在开发一个重要的 contrib 模块 pg_plan_advice,旨在解决这一长期存在的需求。

本文分析了 pgsql-hackers 邮件列表上自 2025 年 10 月以来持续讨论的 pg_plan_advice 线程

pg_plan_advice 是什么?

pg_plan_advice 是一个提议中的 contrib 模块,引入了一种专门用于控制关键规划决策的"建议迷你语言"(advice mini-language)。该模块支持:

  • 使用 EXPLAIN (PLAN_ADVICE) 从现有查询计划生成建议字符串
  • 通过 pg_plan_advice.advice GUC 参数应用建议字符串,以复现或约束后续的规划决策

建议语言允许控制:

  • 连接顺序:表的连接顺序
  • 连接方法:Nested Loop、Merge Join、Hash Join
  • 扫描类型:顺序扫描、索引扫描(可指定具体索引)
  • 并行执行:并行执行的位置和方式
  • 分区连接:分区表连接的处理方式

核心设计理念

Robert Haas 在 README 中强调,主要使用场景并非让用户"战胜优化器",而是复现过去表现良好的执行计划

"我们不需要接受用户能比优化器做出更好规划的观点。我们只需要接受用户比优化器更能分辨好计划和坏计划。这是一个很低的门槛。优化器永远不知道它生成的计划实际执行时会发生什么,但用户知道。"

这将 pg_plan_advice 定位为计划稳定性工具,而非微观管理优化器的提示系统。

技术架构

关系标识符系统

pg_plan_advice 最创新的方面之一是其关系标识符系统(Relation Identifier System)。该系统提供对查询各部分的无歧义引用,能处理复杂场景:

  • 同一表使用不同别名的多次引用
  • 子查询和 CTE
  • 分区表及其分区

标识符语法使用特殊表示法如 t#2 来区分查询中表 t 的第一次和第二次出现。

使用示例

以下是 Jakub Wartak 测试中展示的系统能力:

-- 为带别名的查询生成建议
EXPLAIN (PLAN_ADVICE, COSTS OFF) 
SELECT * FROM (SELECT * FROM t1 a JOIN t2 b USING (id)) a, t2 b, t3 c 
WHERE a.id = b.id AND b.id = c.id;

-- 输出包含:
-- Generated Plan Advice:
--   JOIN_ORDER(a#2 b#2 c)
--   MERGE_JOIN_PLAIN(b#2 c)
--   SEQ_SCAN(c)
--   INDEX_SCAN(a#2 public.t1_pkey)
--   NO_GATHER(c a#2 b#2)

然后可以选择性地应用约束:

-- 强制使用特定扫描类型
SET pg_plan_advice.advice = 'SEQ_SCAN(b#2)';

-- 重新执行 EXPLAIN 查看新计划
EXPLAIN (PLAN_ADVICE, COSTS OFF) 
SELECT * FROM (SELECT * FROM t1 a JOIN t2 b USING (id)) a, t2 b, t3 c 
WHERE a.id = b.id AND b.id = c.id;

-- 输出显示:
-- Supplied Plan Advice:
--   SEQ_SCAN(b#2) /* matched */

补丁结构(v10)

实现分为五个补丁:

补丁描述大小
0001存储范围表扁平化信息7.8 KB
0002在最终计划中存储省略节点的信息9.8 KB
0003存储 Append 节点合并信息40.4 KB
0004允许插件控制路径生成策略56.1 KB
0005WIP:添加 pg_plan_advice contrib 模块399.1 KB

前四个补丁为优化器添加必要的基础设施,第五个包含实际模块。这种分离设计使基础设施在未来可能惠及其他扩展。

社区审查与测试

该线程得到了多位社区成员的积极参与:

Jakub Wartak(EDB)

进行了大量 TPC-H 基准测试,发现了若干 bug:

  • debug/ASAN 构建中的空指针崩溃
  • 无统计信息时的半连接唯一性检测问题
  • 复杂查询中的连接顺序建议冲突

Jacob Champion(EDB)

应用模糊测试技术发现边缘情况:

  • 畸形建议字符串导致的解析器崩溃
  • 对非分区表使用分区相关建议的问题
  • 通过语料库模糊测试发现的 AST 工具 bug

其他贡献者

  • Alastair Turner:赞赏测试替代计划的能力
  • Hannu Krosing(Google):引用 VLDB 研究,显示 20% 的实际查询有 10+ 个连接
  • Lukas Fittl:对与 pg_stat_statements 集成的可能性感兴趣

发现并修复的问题

协作审查过程发现并修复了多个版本中的若干问题:

  1. 编译器警告(gcc-13、clang-20)- 在早期版本中修复
  2. 扩展状态未分配时 pgpa_join_path_setup() 中的空指针崩溃
  3. 连接顺序冲突检测错误地将连接方法建议视为正向约束
  4. 在 EXPLAIN 中未使用 PLAN_ADVICE 时半连接唯一性追踪工作不正确
  5. 嵌套连接顺序规范中的部分匹配检测问题

当前状态

截至 v10(2026 年 1 月 15 日发布):

  • 补丁已注册在 Commitfest
  • 仍标记为 WIP(进行中)
  • 测试仍在进行,特别是 TPC-H 查询测试
  • Robert Haas 正在寻求实质性的代码审查,特别是针对补丁 0001

对 PostgreSQL 用户的意义

如果被合并,pg_plan_advice 将提供:

  1. 计划稳定性:捕获并复现已知良好的查询计划
  2. 调试辅助:理解优化器为何做出特定选择
  3. 测试工具:在不修改查询的情况下实验替代计划形状
  4. 生产安全网:防止统计信息变化后的意外计划退化

与 pg_hint_plan 的比较

与流行的 pg_hint_plan 扩展不同,pg_plan_advice 专注于往返安全性(round-trip safety)

  • 计划可以被可靠地捕获和重新应用
  • 关系标识符系统自动处理复杂的别名
  • 设计为可与任何查询结构配合使用,无需手动管理标识符

总结

pg_plan_advice 代表了 PostgreSQL 优化器可扩展性方面的重要进步。它不是要取代优化器的判断,而是提供一种安全机制来保留经过验证的执行策略。活跃的社区审查过程已经大幅改进了代码,持续的测试正在帮助确保其健壮性。

对于管理复杂工作负载的 DBA,特别是那些查询偶尔遭受计划退化的场景,该模块提供了一个有前途的解决方案,它与优化器协同工作而非对抗。


邮件列表链接: pg_plan_advice - pgsql-hackers

Commitfest 条目: CF 6184

作者: Robert Haas(EnterpriseDB)

审查者: Jakub Wartak、Jacob Champion、Alastair Turner、Hannu Krosing、John Naylor 等