2026 Archive

PostgreSQL Weekly posts for 2026.

Weeks

Week 09 (2026)

PostgreSQL mailing list discussions for Week 09, 2026.

🇨🇳 中文版本

Articles

More Speedups for Tuple Deformation: Precalculating attcacheoff

Introduction

Tuple deformation is the process of extracting individual attribute values from a PostgreSQL heap tuple's raw byte representation into a TupleTableSlot. It happens constantly during query execution—every time a sequential scan, index scan, or join produces a row, the executor must "deform" that tuple to access column values. For workloads that process millions of rows, even small improvements to the deforming hot path can yield significant gains.

David Rowley has been steadily optimizing tuple deformation. In PostgreSQL 18, he landed several patches: CompactAttribute (5983a4cff), faster offset aligning (db448ce5a), and inline deforming loops (58a359e58). Building on that work, he proposed precalculating attcacheoff rather than computing it on every attribute access. The discussion has since evolved through v10 (February 2026), with Andres Freund contributing a NULL-bitmap-to-isnull conversion that pushes Apple M2 speedups to 63% in some scenarios. The patch set remains under active review.

Why This Matters

When the executor needs a column value from a tuple, it must:

  1. Align the current offset according to the attribute's alignment
  2. Fetch the value via fetch_att()
  3. Advance past the attribute to the next one

These steps form a dependency chain: each offset depends on the previous. There is little opportunity for instruction-level parallelism. For fixed-width attributes, PostgreSQL can cache the offset (attcacheoff) to avoid recomputing alignment and length—but that caching was previously done inside the deforming loop. David's idea: do it once when the TupleDesc is finalized, not on every tuple.

Technical Approach

TupleDescFinalize()

The core change introduces TupleDescFinalize(), which must be called after a TupleDesc has been created or changed. This function:

  1. Pre-calculates attcacheoff for all fixed-width attributes
  2. Records firstNonCachedOffAttr—the first attribute (by attnum) that is varlena or cstring and thus cannot have a cached offset
  3. Enables a tight loop that deforms all attributes with cached offsets before falling through to attributes that require manual offset calculation

If a tuple has a NULL before the last attribute with a cached offset, the code can only use attcacheoff up to that NULL—but for tuples without early NULLs, the fast path handles many attributes in a tight loop without any per-attribute offset arithmetic.

Dedicated Deforming Loops

The patch adds a dedicated loop that processes all attributes with a precomputed attcacheoff before entering the loop that handles varlena/cstring attributes. For tuples with HEAP_HASNULL set, the current code calls att_isnull() for every attribute. A further optimization: keep deforming without calling att_isnull() until we reach the first NULL. Test #5 in the benchmark (first col int not null, last col int null) highlights this—it often shows the largest speedup.

Optional OPTIMIZE_BYVAL Loop

An optional variant adds a loop for tuples where all deformed attributes are attbyval == true. In that case, fetch_att() can be inlined without the branch that handles pointer types, reducing branching and yielding a tighter loop. The tradeoff: when the optimization doesn't apply, there is extra overhead to check attnum < firstByRefAttr. Benchmark results vary by hardware and compiler as to whether this helps.

Benchmark Design

To stress tuple deformation, David designed a benchmark that maximizes deforming work relative to other CPU:

SELECT sum(a) FROM t1;

The a column is almost last, so all prior attributes must be deformed before a can be read. Eight test schemas cover combinations of first column (int/text, null/not null) and last column (int null/not null). For each of the 8 tests, he ran with 0, 10, 20, 30, and 40 extra INT NOT NULL columns—40 scenarios per benchmark run. Each scenario used 1 million rows.

Benchmark Results

Results varied by hardware and compiler:

  • AMD Zen 2 (3990x) with GCC: Up to 21% average speedup with OPTIMIZE_BYVAL; some tests exceed 44%; no regressions.
  • AMD Zen 2 with Clang: Some small regressions in the 0-extra-column tests.
  • Apple M2: Tests #1 and #5 improve significantly; others less so; a few slight regressions with certain patches.
  • Intel (Azure): Benchmarks run on shared, low-core instances; results were noisier due to co-located workloads.

Patch Evolution

v1 → v3 (December 2025 – January 2026)

  • v1: Three patches—0001 (precalculate attcacheoff), 0002 (experimental NULL bitmap look-ahead), 0003 (remove dedicated hasnulls loop)
  • v2: Rebase, fix Assert for NULL bitmap in 0003, JIT fix (remove TTS_FLAG_SLOW), more benchmarks
  • v3: Rebase, drop 0002 and 0003 (benchmarks showed little advantage), keep only 0001

v4 (January 2026)

Addressed code review from Chao Li:

  • NULL bitmap mask (tupmacs.h): Clarified comment—when natts & 7 == 0, the mask is zero and the code correctly returns natts
  • Uninitialized TupleDesc: firstNonCachedOffAttr == 0 means no cached attributes; -1 means uninitialized. Added Asserts with hints to call TupleDescFinalize() if they fail
  • Typo: "possibily" → "possibly"
  • LLVM: Fixed compiler warning

v5–v8 (January–February 2026): Andres Freund's NULL Bitmap Optimization

Andres Freund joined the discussion and proposed a key improvement: instead of calling att_isnull() for each column, compute the isnull[] array directly from the NULL bitmap using a SWAR (SIMD Within A Register) technique. The idea: multiply one byte of the bitmap by a carefully chosen value (e.g. 0x204081) so each bit spreads into a separate byte, then mask. This avoids a 2KB lookup table and works well on most hardware.

David implemented this in patch 0004 ("Various experimental changes"). Additional changes in 0004:

  • populate_isnull_array(): Converts the NULL bitmap to tts_isnull in bulk using the multiplication trick
  • tts_isnull sizing: Rounded up to a multiple of 8 so the loop can write 8 bytes at a time (avoids memset inlining issues)
  • t_hoff: For !hasnulls tuples, use MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) instead of t_hoff
  • fetch_att_noerr(): New variant without elog for the common attlen == 8 case

John Naylor noted that __builtin_ctz(~bits[bytenum]) is undefined when the byte is 255; David fixed this with a cast: pg_rightmost_one_pos32(~((uint32) bits[bytenum])).

Results with 0004: Apple M2 averaged 53% faster than master (or ~63% excluding 0-extra-column tests). Andres suggested pg_nounroll and pg_novector pragmas to prevent GCC from over-vectorizing populate_isnull_array(), which was generating poor code.

v9 (February 24, 2026)

  • Resequenced patches: deform_bench moved to 0001 for easier master benchmarking
  • 0004 (new): Sibling-call optimization in slot_getsomeattrs—moved slot_getmissingattrs() into getsomeattrs() so the compiler can apply tail-call optimization. Reduces overhead and improves 0-extra-column tests
  • 0005 (new): Shrink CompactAttribute from 16 to 8 bytes—attcacheoffint16 (max 2^15), bitflags for booleans. Andres noted the 8-byte size lets the compiler use a single LEA with scale factor 8; 6 bytes would require two LEA instructions

v10 (February 25, 2026) — Latest Patch Set

Based on the actual v10 patch content:

0003 (Optimize tuple deformation):

  • firstNonCachedOffsetAttr: index of the first attribute without a cached offset
  • firstNonGuaranteedAttr: index of the first nullable, missing, or !attbyval attribute. When deforming only up to this point, the code need not access HeapTupleHeaderGetNatts(tup)—a dependency reduction that helps the CPU pipeline
  • TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS: opt-in flag for the guaranteed-attribute optimization (some code deforms tuples before NOT NULL validation)
  • populate_isnull_array(): uses SPREAD_BITS_MULTIPLIER_32 (0x204081) to spread each bit of the inverted NULL bitmap into a separate byte; processes lower 4 and upper 4 bits separately to avoid uint64 overflow
  • fetch_att_noerr(): variant of fetch_att() without elog for invalid attlen; safe when attlen comes from CompactAttribute
  • first_null_attr(): finds the first NULL in the bitmap using pg_rightmost_one_pos32 or __builtin_ctz

0004 (Sibling-call optimization):

  • getsomeattrs() is now responsible for calling slot_getmissingattrs()
  • slot_getmissingattrs(): replaced memset with a for-loop (benchmarks showed the loop is faster)
  • slot_deform_heap_tuple(): calls slot_getmissingattrs() at the end when attnum < reqnatts; parameter renamed from natts to reqnatts

0005 (8-byte CompactAttribute):

  • attcacheoffint16; offsets > PG_INT16_MAX are not cached
  • Bitflags for attispackable, atthasmissing, attisdropped, attgenerated
  • Stores cattrs = tupleDesc->compact_attrs to help GCC generate better code (avoids repeated TupleDescCompactAttr() calls)

Review fixes:

  • Amit Langote: Fixed rebase noise (duplicate attcacheoff check)
  • Zsolt Parragi: Big-endian fix—pg_bswap64() before memcpy in populate_isnull_array()
  • Typos: "benchmaring" → "benchmarking", "to info" → "into"
  • Andres: Set *offp before slot_getmissingattrs to reduce stack spills; use size_t for attnum to fix GCC -fwrapv codegen

deform_bench and Benchmark Infrastructure

Andres and Álvaro Herrera discussed where to put deform_bench: src/test/modules/benchmark_tools, src/benchmark/tuple_deform, or a single extension for micro-benchmarks. Andres argued for merging useful tools incrementally rather than waiting for a full suite. David prefers to focus on the deformation patches first; deform_bench may be committed separately.

Code Review: Chao Li's Feedback

Chao Li reviewed the patch and raised several points:

  1. NULL bitmap mask: Add a comment to clarify no overflow/OOB risk when natts & 7 == 0
  2. Uninitialized TupleDesc: Initialize firstNonCachedOffAttr to -1 in TupleDesc creation; Assert >= 0 in nocachegetattr()
  3. Semantic consistency: Use 0 for "no cached attributes," >0 for "some cached"
  4. Typo: "possibily" → "possibly"

David addressed all in v4.

Current Status

  • v10 (February 2026) is the latest patch set: 0001 (deform_bench), 0002 (TupleDescFinalize stub), 0003 (main optimization), 0004 (sibling-call + NULL bitmap→isnull), 0005 (8-byte CompactAttribute)
  • Andres Freund supports merging 0004 as a clear win; 0005's benefit is less certain (helps with LEA addressing when deforming few columns)
  • Active review from Zsolt Parragi (Percona), Álvaro Herrera, John Naylor, Amit Langote
  • deform_bench placement (src/test/modules vs. src/benchmark) still under discussion; David prefers to land the optimization patches first

Conclusion

Precalculating attcacheoff in TupleDescFinalize() and using a dedicated tight loop for attributes with cached offsets yields meaningful speedups for tuple deformation on modern CPUs. The optimization is most effective when tuples have many fixed-width columns and few or late NULLs. With Andres Freund's NULL-bitmap-to-isnull conversion (the "0x204081" SWAR trick), Apple M2 sees up to 63% speedup excluding edge cases. The sibling-call optimization in slot_getsomeattrs further reduces overhead. Results depend on hardware and compiler; GCC can over-vectorize some loops, addressed with pragmas or size_t for loop indices. The patch set (v10) has been refined through extensive review from Andres, John Naylor, Zsolt Parragi, Álvaro Herrera, and Amit Langote, and is progressing toward integration.

References

Week 08 (2026)

PostgreSQL mailing list discussions for Week 08, 2026.

🇨🇳 中文版本

Articles

Eliminating SPI from RI Triggers: A Fast Path for Foreign Key Checks

Introduction

Referential Integrity (RI) triggers in PostgreSQL traditionally execute SQL queries via SPI (Server Programming Interface) to verify that inserted or updated rows in a referencing table have matching rows in the referenced (primary key) table. For bulk operations—large INSERT or UPDATE statements—this means starting and tearing down a full executor plan for each row, with significant overhead from ExecutorStart() and ExecutorEnd().

Amit Langote has been working on eliminating this overhead by performing RI checks as direct index probes instead of SQL plans. The latest iteration of this work, "Eliminating SPI / SQL from some RI triggers - take 3," achieves up to 57% speedup for bulk foreign key checks by bypassing the SPI executor and calling the index access method directly when the constraint semantics allow it.

The patch set has evolved through several versions, with Junwang Zhao joining the effort in late 2025. The current direction is a hybrid fast-path + fallback design: use a direct index probe for straightforward cases, and fall back to the existing SPI path when correctness requires executor behavior that would be difficult or risky to replicate.

Why This Matters

Foreign key constraints are ubiquitous. Every INSERT or UPDATE into a referencing table triggers RI checks that must verify each new or modified row against the referenced table's primary key. With the traditional approach:

CREATE TABLE pk (a int PRIMARY KEY);
CREATE TABLE fk (a int REFERENCES pk);

INSERT INTO pk SELECT generate_series(1, 1000000);
INSERT INTO fk SELECT generate_series(1, 1000000);  -- 1M RI checks

Each of the 1 million inserts triggers an RI check that:

  1. Builds a query plan to scan the PK index.
  2. Runs ExecutorStart() and ExecutorEnd().
  3. Executes the plan to find (or not find) the matching row.

This per-row plan setup/teardown dominates the cost. With Amit's v3 patches, the same bulk insert drops from ~1000 ms to ~432 ms (57% faster) on his benchmark machine—by probing the PK index directly without going through the executor.

Technical Background

The Traditional RI Path

RI trigger functions in ri_triggers.c (e.g. RI_FKey_check) call ri_PerformCheck(), which:

  1. Builds an SQL string for a query like SELECT 1 FROM pk WHERE pk.a = $1.
  2. Uses SPI_prepare and SPI_execute_plan to run it.
  3. The executor performs an index scan on the PK, returning a row if the referenced value exists.

This works correctly for all cases—partitioned tables, temporal foreign keys, concurrent updates—but pays the full plan-execution cost per row.

The Fast-Path Idea

For simple foreign keys (non-partitioned referenced table, non-temporal semantics), the check is conceptually: "probe the PK index for this value; if found and lockable, the check passes." That can be done by:

  1. Opening the PK relation and its unique index.
  2. Building a scan key from the FK column values.
  3. Calling index_getnext() (or equivalent) to find the tuple.
  4. Locking it with LockTupleKeyShare under the current snapshot.

No SQL, no plan, no executor. Just a direct index probe and tuple lock.

Patch Evolution

v1: The Original Approach (December 2024)

The first patch set (3 patches) introduced:

  • 0001: Refactoring of the PartitionDesc interface to explicitly pass the snapshot needed for omit_detached visibility (detach-pending partitions). This addressed a bug where PK lookups could return incorrect results under REPEATABLE READ because find_inheritance_children()'s visibility of detach-pending partitions depended on ActiveSnapshot, which RI lookups were manipulating.
  • 0002: Avoid using SPI in RI trigger functions by introducing a direct index probe path.
  • 0003: Avoid using an SQL query for some RI checks—the main performance optimization.

Amit noted that temporal foreign key queries would remain on the SPI path, as their plans involve range overlap and aggregation and are not amenable to a simple index probe. He also added an equivalent of EvalPlanQual() for the new path to handle concurrent updates correctly under READ COMMITTED.

v2: Junwang's Hybrid Fast Path (December 2025)

Junwang Zhao took the work forward with a hybrid design:

  • 0001: Add fast path for foreign key constraint checks. Applies when the referenced table is not partitioned and the constraint does not involve temporal semantics.
  • 0002: Cache fast-path metadata (operator hash entries, operator OIDs, strategy numbers, subtypes). At that stage, the metadata cache did not yet improve performance.

Benchmarks (1M rows, numeric PK / bigint FK):

  • Head: INSERT 13.5s, UPDATE 15s
  • Patched: INSERT 8.2s, UPDATE 10.1s

v3: Amit's Rework with Per-Statement Caching (February 2026)

Amit reworked Junwang's patches into two patches:

  • 0001: Functionally complete fast path. Includes concurrency handling, REPEATABLE READ crosscheck, cross-type operators, security context (RLS/ACL), and metadata caching. Most logic lives in ri_FastPathCheck(); RI_FKey_check just gates the call and falls back to SPI when needed.
  • 0002: Per-statement resource caching. Instead of sharing EState between trigger.c and ri_triggers.c, a new AfterTriggerBatchCallback mechanism fires at the end of each trigger-firing cycle. It allows caching the PK relation, index, scan descriptor, and snapshot across all FK trigger invocations within a single cycle, rather than opening and closing them per row.

Benchmarks on Amit's machine:

ScenarioMaster00010001+0002
1M rows, numeric/bigint2444 ms1382 ms (43% faster)1202 ms (51% faster)
1M rows, int/int1000 ms520 ms (48% faster)432 ms (57% faster)

The incremental gain from 0002 (~13–17%) comes from eliminating per-row relation open/close, scan begin/end, slot allocation/free, and replacing per-row GetSnapshotData() with a snapshot copy in the cache.

Design: When to Use Fast Path vs. SPI

The fast path applies when:

  • The referenced table is not partitioned.
  • The constraint does not involve temporal semantics (range overlap, range_agg(), etc.).
  • Multi-column keys, cross-type equality (via index opfamily), collation matching, and RLS/ACL are all handled directly in the fast path.

The code falls back to SPI when:

  1. Concurrent updates or deletes: If table_tuple_lock() reports that the target tuple was updated or deleted, the code delegates to SPI so that EvalPlanQual and visibility rules apply as today.
  2. Partitioned referenced tables: Require routing the probe through the correct partition via PartitionDirectory. Can be added later as a separate patch.
  3. Temporal foreign keys: Use range overlap and containment semantics that inherently involve aggregation; they stay on the SPI path.

Security behavior mirrors the existing SPI path: the fast path temporarily switches to the parent table's owner with SECURITY_LOCAL_USERID_CHANGE | SECURITY_NOFORCE_RLS around the probe, matching ri_PerformCheck().

Future Directions

David Rowley suggested off-list that batching multiple FK values into a single index probe could further improve performance, leveraging the ScalarArrayOp btree improvements from PostgreSQL 17. The idea: buffer FK values across trigger invocations in the per-constraint cache, build a SK_SEARCHARRAY scan key, and let the btree AM traverse matching leaf pages in one sorted pass instead of one tree descent per row. Locking and recheck would remain per-tuple. This could be explored as a separate patch on top of the current series.

Current Status

  • The series is in PG19-Drafts. Amit moved it there in October 2025; Junwang Zhao is continuing the work.
  • Amit's v3 patches (February 2026) are in reasonable shape and ready for review. He welcomes feedback, especially on concurrency handling in ri_LockPKTuple() and the snapshot lifecycle in 0002.
  • Pavel Stehule has offered to help with testing and review.

Conclusion

Eliminating SPI from RI triggers for simple foreign key checks yields substantial performance gains for bulk operations. The hybrid fast-path + fallback design addresses reviewer concerns about correctness by deferring to SPI whenever executor behavior is non-trivial to replicate. The per-statement resource caching in v3 adds a second layer of optimization by amortizing relation/index setup across many rows within a single trigger-firing cycle.

For workloads with large bulk inserts or updates on tables with foreign keys—common in ETL, staging loads, and data migrations—this work could significantly reduce runtimes. The current limitations (partitioned PKs, temporal FKs) leave those cases on the existing path, preserving correctness while optimizing the majority of FK workloads.

References

Week 07 (2026)

PostgreSQL mailing list discussions for Week 07, 2026.

🇨🇳 中文版本

Articles

SQL Property Graph Queries (SQL/PGQ): Bringing Graph Queries to PostgreSQL

Introduction

In February 2024, Peter Eisentraut announced a prototype implementation of SQL Property Graph Queries (SQL/PGQ) on the pgsql-hackers mailing list—a new way to run graph-style queries directly in PostgreSQL, following the SQL:2023 standard (ISO 9075-16). The initiative had been briefly discussed at the FOSDEM developer meeting, and community interest led Peter to share his work-in-progress.

Nearly two years later, the patch has evolved into a substantial implementation: 118 files changed, ~14,800 lines added. Peter and Ashutosh Bapat are the primary authors, with Junwang Zhao reviewing and Ajay Pal and Henson Choi testing. The latest iteration (v20260113) consolidates features from v0 through v14 and beyond—including cyclic path patterns, access permissions, RLS support, graph element functions (LABELS(), PROPERTY_NAMES()), multi-pattern path matching, ECPG support, property collation rules, and pg_overexplain integration.

SQL/PGQ lets you define property graphs over existing relational tables and query them using path patterns (vertices connected by edges), much like Cypher or GQL. Unlike dedicated graph databases, this approach maps graphs onto the relational model: graphs are views over tables, and graph queries rewrite to joins and unions. The discussion raised important architectural questions about when and how that transformation should happen—and whether the rewriter is the right place for it.

Why This Matters

Many applications have naturally graph-shaped data: social networks, supply chains, recommendation systems, fraud detection. Today, developers either:

  • Use a separate graph database (Neo4j, etc.) and maintain two systems, or
  • Encode graph traversals as recursive CTEs and complex joins in PostgreSQL.

SQL/PGQ aims to give PostgreSQL users a standard, declarative way to express graph queries without leaving SQL or duplicating data. The standard has been adopted by Oracle 23c and others; bringing it to PostgreSQL would improve interoperability and make graph capabilities available to a broad user base.

Technical Analysis

The SQL/PGQ Model

A property graph in SQL/PGQ is a virtual structure defined over existing tables:

  • Vertices are rows from one or more tables (with optional labels).
  • Edges are relationships, typically inferred from foreign keys or explicitly specified.
  • Properties are columns from those tables.
  • Labels can be shared across multiple element tables (e.g., person for both customers and employees), with each label exposing its own set of properties.

You create a graph with CREATE PROPERTY GRAPH and query it with GRAPH_TABLE(... MATCH ... COLUMNS ...). Example:

CREATE PROPERTY GRAPH myshop
    VERTEX TABLES (
        products LABEL product,
        customers LABEL customer,
        orders LABEL "order"
    )
    EDGE TABLES (
        order_items SOURCE orders DESTINATION products LABEL contains,
        customer_orders SOURCE customers DESTINATION orders LABEL has_placed
    );

SELECT customer_name FROM GRAPH_TABLE (myshop
  MATCH (c IS customer)-[IS has_placed]->(o IS "order" WHERE o.ordered_when = current_date)
  COLUMNS (c.name AS customer_name));

The MATCH clause describes a path pattern; the implementation rewrites it to joins and filters that PostgreSQL can plan and execute.

Implementation Approach: Rewrite System

The implementation uses the rewrite system for graph-to-relational transformation—the same stage where views expand. Peter explained that this aligns with the SQL/PGQ specification: graphs map to relations, and queries expand like view definitions. By the time the planner sees the query, the graph structure has been flattened into standard relational form, keeping the implementation consistent with view security (privileges, security barriers).

Patch Evolution: From v0 to v20260113

v0: Fragile Prototype (Feb 2024)

The initial patch was ~332 KB. Peter described it as "quite fragile." It introduced CREATE PROPERTY GRAPH, GRAPH_TABLE, basic path patterns, and documentation in ddl.sgml and queries.sgml.

v1 and Early Refinements (June–Aug 2024)

Peter and Ashutosh released v1 with a "fairly complete minimal feature set." Ashutosh contributed:

  • WHERE clause in graph patterns—e.g., MATCH (a)->(b)->(c) WHERE a.vname = c.vname
  • Spurious "column not found" bug fix: Attribute names were referenced from pg_attribute heap tuples; after RELCACHE_FORCE_RELEASE they could point to freed memory. The fix was to copy attribute names.
  • Compilation fixes, pgperltidy compliance, error location reporting for implicit properties/labels.

Imran Zaheer added support for EXCEPT lists in labels and properties.

v14: Cyclic Paths, Access Permissions, RLS (Aug–Oct 2024)

Ashutosh contributed major features:

  • Cyclic path patterns: Path patterns where an element variable repeats (e.g., same vertex at both ends of a path). Elements sharing a variable must have the same type and label expression; repeated edge patterns are not supported.
  • Access permissions on property graphs: The property graph acts like a security invoker view—the current user must have privileges on the underlying tables. Queries succeed only for elements the user can access. Security definer property graphs are not implemented.
  • Row Level Security (RLS): Regression tests in graph_table_rls.sql verify RLS behavior with property graphs.
  • Property collation and edge-vertex links: Same-named properties across elements must have the same collation. Edge key columns and referenced vertex keys must have compatible collations. Edge-vertex link quals use equality operators, with dependencies so they cannot be dropped without dropping the edge.
  • \d and \dG variants: \d on a property graph shows elements, tables, kinds, and end-vertex aliases; \d+ adds labels and properties. \dG lists property graphs; \dG+ adds owner and description.

Henson Choi: LABELS(), PROPERTY_NAMES(), Multi-Pattern (Dec 2025)

Henson Choi contributed three patches:

  • LABELS() graph element function: Returns all labels of a graph element as text[]. Implemented by wrapping element tables in subqueries with a virtual __labels__ column, enabling the planner to prune Append branches when filtering by labels (e.g., WHERE 'Person' = ANY(LABELS(v))).
  • PROPERTY_NAMES() graph element function: Returns all property names as text[], with similar planner pruning for property-based filters.
  • Multi-pattern path matching: Support for comma-separated path patterns in MATCH, e.g., MATCH (a)->(b), (b)->(c). Patterns with shared variables merge into one join; disconnected patterns produce a Cartesian product (aligned with SQL/PGQ and Neo4j Cypher).

v20260113: Consolidated Implementation (Jan 2026)

The latest patch (v20260113) merges all prior work into a single WIP patch:

  • ECPG support: SQL/PGQ in ECPG—basic queries, prepared statements, cursors, dynamic queries. Label disjunction in ECPG required changes to the ecpg lexer.
  • pg_overexplain integration: Property graph RTEs and RELKIND_PROPGRAPH are recognized for EXPLAIN (RANGE_TABLE, ...).
  • Extended test coverage: create_property_graph.sql (365 lines), graph_table.sql (561 lines), graph_table_rls.sql (363 lines), privileges.sql (58 lines).
  • rewriteGraphTable.c: Grown from ~420 to ~1,330 lines; propgraphcmds.c from ~1,000 to ~1,860 lines.

Community Insights

Andres Freund: Concerns About the Rewriter

Andres Freund raised a structural concern: transformation via the rewrite system bars the planner from benefiting from graph semantics and increases rewrite-system usage. Peter responded that PGQ is designed as relational-at-core (like view expansion), and the standard and other implementations follow this model. Tomas Vondra wondered whether retaining graph structure longer could enable graph-specific indexes or executor nodes; Ashutosh Bapat noted that many optimizations would improve the underlying joins anyway, and aligning with view expansion makes sense for security.

Florents Tselai: Documentation

Florents Tselai suggested reordering the docs to answer "I have some data modeled as a graph G=(V, E). Can Postgres help me?" first, and to use runnable examples from graph_table.sql. He compared with Apache Age’s jsonpath-style approach but agreed the standard’s relational mapping fits core PostgreSQL.

Technical Details

Architecture

  • Parser: Grammar for CREATE PROPERTY GRAPH, ALTER PROPERTY GRAPH, DROP PROPERTY GRAPH, and GRAPH_TABLE(... MATCH ... COLUMNS ...).
  • Catalogs: pg_propgraph_element, pg_propgraph_element_label, pg_propgraph_label, pg_propgraph_label_property, pg_propgraph_property.
  • Rewrite: rewriteGraphTable.c transforms graph patterns into joins and unions.
  • Utilities: pg_dump, psql \d/\dG, tab completion; pg_get_propgraphdef() for introspection.
  • ECPG: Full support in embedded SQL.

Access Control

  1. A user needs SELECT on the property graph.
  2. The property graph is security invoker: the current user must have privileges on the underlying tables. Queries only succeed for elements the user can access.
  3. In a security definer view, property graph access uses the view owner’s privileges; base relation access uses the executing user’s privileges.
  4. Security definer property graphs are not implemented (the standard does not mention them).
  • Properties with the same name must have the same collation across elements.
  • Edge key and vertex key collations must match when keys are explicitly specified (foreign-key-derived links rely on the constraint).
  • Edge-vertex link quals use equality operators; the edge depends on those operators so they cannot be dropped independently.

Current Status

The v20260113 patch is a consolidated WIP. It includes:

  • Full CREATE PROPERTY GRAPH / ALTER / DROP with labels, properties, KEY clauses, SOURCE/DESTINATION REFERENCES
  • GRAPH_TABLE with path patterns, WHERE in patterns, cyclic paths, multi-pattern
  • LABELS() and PROPERTY_NAMES() graph element functions
  • Access permissions, RLS, privileges tests
  • ECPG support, pg_overexplain integration
  • Documentation and regression tests

The patch is not yet committed. Peter and Ashutosh continue to refine it; the rewriter-based design remains the chosen approach, with ongoing review and testing.

Conclusion

SQL/PGQ would bring standardized graph query syntax to PostgreSQL. The implementation has grown from a fragile prototype to a feature-rich patch with cyclic paths, graph element functions, multi-pattern matching, access control, RLS, ECPG support, and comprehensive tests. The main architectural choice—rewriting graph queries to relations in the rewrite system—aligns with the standard and view semantics. If committed, PostgreSQL users could express path patterns over relational data without leaving SQL or maintaining a separate graph database.

References

Reducing LEFT JOIN to ANTI JOIN: A Planner Optimization for "WHERE col IS NULL"

Introduction

In late December 2025, Nicolas Adenis-Lamarre raised a planner optimization on the pgsql-hackers list: automatically detect anti-join patterns in queries that use LEFT JOIN ... WHERE right_table.column IS NULL when that column is known to be non-nullable (e.g. NOT NULL or primary key). Such queries mean "rows from the left side with no matching row on the right," which is exactly what an anti-join expresses. Recognizing this lets the planner choose a dedicated anti-join plan instead of a generic left join + filter, often with better performance.

The discussion drew in Tom Lane, David Rowley, Tender Wang, Richard Guo, and others. A patch evolved through several versions, was submitted to CommitFest, and received detailed review that uncovered correctness issues with nested outer joins and inheritance. This post summarizes the idea, the implementation approach, and the current status.

Why This Matters

Many developers write "find rows in A with no match in B" as:

SELECT a.*
FROM a
LEFT JOIN b ON a.id = b.a_id
WHERE b.some_not_null_col IS NULL;

Because the join is a LEFT JOIN, unmatched rows from a have NULL in all columns from b. Filtering on b.some_not_null_col IS NULL (when that column is NOT NULL in b) therefore keeps only those unmatched rows. Semantically this is an anti-join: "rows in A that do not have a matching row in B."

If the planner does not recognize this pattern, it may implement it as a normal left join plus a filter. If it does recognize it, it can use an explicit Hash Anti Join (or similar), which can be more efficient and can unlock better join order choices. The optimization is "non-mandatory" in the sense that a skilled user could rewrite the query as NOT EXISTS or NOT IN (with due care for NULLs), but automatic detection helps all users and keeps the original SQL readable.

Technical Background

PostgreSQL already has logic to reduce outer joins in certain cases. In particular:

  • Commits 904f6a593 and e2debb643 added infrastructure that the planner can use for this kind of reduction.
  • In reduce_outer_joins_pass2, the planner already tries to reduce JOIN_LEFT to JOIN_ANTI when the join's own quals are strict for some var that was forced null by higher qual levels (e.g. by an upper WHERE).

The existing comment in that area noted that there are other ways to detect an anti-join—for example, checking whether vars from the right-hand side are non-null because of table constraints (NOT NULL, etc.). That was left for later; Nicolas's proposal and Tender's patch implement exactly that: use NOT NULL (and related) information to detect when WHERE rhs_col IS NULL implies "no match," and thus when a LEFT JOIN can be reduced to an ANTI JOIN.

Patch Evolution

Nicolas's initial patch

Nicolas sent a draft patch that:

  • Detected the pattern "left join b where x is null" when x is a non-null var from the right-hand side (RTE).
  • Was intentionally quick-and-dirty to see if the change was feasible.

He also listed other ideas (removing redundant DISTINCT/GROUP BY, folding double ORDER BY, anti-join on NOT IN, and a way to "view the rewritten query"). Those were discussed briefly but are not the focus of this post.

Tom Lane and David Rowley

Tom Lane replied that:

  • The optimization is reasonable and the new infrastructure (904f6a593, e2debb643) should be used.
  • The draft should not leave nearby comments outdated; keeping comments accurate is mandatory.

David Rowley suggested:

  • Using find_relation_notnullatts() and comparing with forced_null_vars, with care for FirstLowInvalidHeapAttributeNumber.
  • Searching the archives for prior work on UniqueKeys (for redundant DISTINCT removal).
  • Being cautious about "remove double order" and "NOT IN" anti-join; both have been discussed before and have subtle edge cases.
  • Noting that "view the rewritten query" is ambiguous—many optimizations cannot be expressed back as a single SQL statement.

Tender Wang's implementation (v2–v4)

Tender Wang provided a patch that:

  • Used the infrastructure from 904f6a593 and e2debb643.
  • Updated the comments in reduce_outer_joins_pass2 to describe the new case (detecting anti-join via NOT NULL constraints on the RHS).
  • Added regression tests.

Nicolas then:

  • Confirmed that Tender's patch was correct (after re-testing).
  • Suggested an early exit: only run the new logic when forced_null_vars != NIL, to avoid calling find_nonnullable_vars and have_var_is_notnull on every left join when most have no forced-null vars.
  • Contributed extra regression tests using new tables (with NOT NULL constraints) instead of modifying existing test tables like tenk1.

Tom Lane clarified that modifying common test objects (e.g. in test_setup.sql) is a bad idea, as it can change planner behavior and break or alter other tests. New tests should use new tables or existing ones that already match the needed properties.

Tender incorporated Nicolas's early-exit and regression tests into a single v4 patch and submitted it to CommitFest.

Richard Guo's review: correctness issues

Richard Guo reviewed the v4 patch and found two correctness problems.

1. Nested outer joins

When the right-hand side of the left join itself contains an outer join, a column that is NOT NULL in its base table can still become NULL in the join result. Reducing the outer join to an anti-join in that case is wrong.

Example (tables t1, t2, t3 with columns e.g. (a NOT NULL, b, c)):

EXPLAIN (COSTS OFF)
SELECT * FROM t1
LEFT JOIN (t2 LEFT JOIN t3 ON t2.c = t3.c) ON t1.b = t2.b
WHERE t3.a IS NULL;

Here t3.a is NOT NULL in t3, but because of the inner t2 LEFT JOIN t3, a row from t1 can be joined to the subquery and still have t3.a NULL (when there is no matching row in t3). So the upper join must remain a left join; converting it to an anti-join would drop rows incorrectly.

The patch was treating any var from a NOT NULL column as "safe" for the anti-join reduction without considering whether that var could be nulled by a lower-level outer join. Richard noted that we don't currently record varnullingrels in forced_null_vars, so a simple fix would be to only do this optimization when the RHS has no outer joins (right_state->contains_outer false), but that would be too restrictive.

His proposed direction: in reduce_outer_joins_pass1_state, record the relids of base rels that are nullable within each subtree. Then, when checking NOT NULL constraints, skip vars that come from those rels. He attached a v5 patch illustrating this idea.

2. Inheritance

For inheritance parent tables, some child tables might have a NOT NULL constraint on a column while others do not. The patch did not account for that; the second issue is more straightforward to fix than the nested-outer-join case.

Other discussion points

  • Pavel Stehule asked participants to avoid top-posting on the list; the PostgreSQL wiki has guidelines on mailing list style.
  • Constants from subqueries: Nicolas noted that a case like SELECT * FROM a LEFT JOIN (SELECT 1 AS const1 FROM b) x WHERE x.const1 IS NULL is not handled; he considered it not worth handling.

Current Status

  • The v4 patch (with early exit and regression tests) was submitted to CommitFest (patch 6375).
  • Richard Guo's v5 patch addresses the nested-outer-join and inheritance issues by tracking nullable base rels and tightening when NOT NULL can be used for the reduction.
  • As of the thread, the discussion was ongoing; the final resolution (e.g. commit of a revised patch) would be tracked on the list and in CommitFest.

Conclusion

Automatically reducing LEFT JOIN ... WHERE rhs_not_null_col IS NULL to an anti-join when the column is provably non-nullable is a useful planner optimization that can improve performance without requiring users to rewrite queries. The patch has evolved from a draft to an implementation using the existing planner infrastructure, with regression tests and an early-exit optimization. Reviewer feedback has identified important correctness constraints: the RHS may contain nested outer joins or inheritance, so NOT NULL must be applied only when the var cannot be nulled by lower joins or by inheritance. Follow-up work centers on Richard's approach (recording nullable base rels and restricting the NOT NULL check accordingly) and on handling inheritance safely.

References

Week 06 (2026)

PostgreSQL mailing list discussions for Week 06, 2026.

🇨🇳 中文版本

Articles

COPY TO with JSON Format: Native JSON Export from PostgreSQL

Introduction

In November 2023, Davin Shearer asked on the pgsql-general list how to emit JSON from PostgreSQL to a file using COPY TO. When he used COPY TO with a query that produced a single JSON column (e.g. json_agg(row_to_json(t))), the text format applied its own quoting rules: double quotes inside the JSON were escaped again, producing invalid JSON that tools like jq could not parse. The community agreed that a proper solution would be a native JSON format for COPY TO—so that one column of JSON (or a row rendered as one JSON object) is written as valid JSON without an extra layer of text/CSV escaping.

Base on the idea and, in a long-running thread that also references Joe Conway’s earlier COPY/JSON discussion, posted a series of patches. The design that emerged is: add a FORMAT json option for COPY TO only, and an optional FORCE_ARRAY option to wrap the output in a JSON array. The thread has seen many revisions (v8 through v23), with feedback from Tom Lane, Joe Conway, Alvaro Herrera, Joel Jacobson, Jian He, Junwang Zhao, and others. This post summarizes the discussion, the implementation, and the current status.

Why This Matters

  • Correct JSON export: Today, exporting a query result as JSON from the server usually means using COPY TO in text or CSV format. Text format treats the result as a string and escapes quotes and backslashes, which breaks JSON. A dedicated JSON format writes each row as one JSON object (or one JSON value per column) with proper escaping, so the output is valid JSON.
  • Interoperability: Many pipelines expect JSON (e.g. one JSON object per line, or a single JSON array). Native COPY TO ... (FORMAT json) and FORCE_ARRAY allow exporting directly from the database without client-side formatting or workarounds (e.g. psql -t -A or LO API).
  • Consistency with existing formats: COPY already supports text, csv, and binary. Adding json keeps the same mental model: choose a format and get correctly encoded output.

Technical Analysis

Design Decisions

The patch set makes these choices:

  1. JSON format is COPY TO only. COPY FROM with JSON is not supported (parsing arbitrary JSON is a larger feature). The grammar and option validation reject FORMAT json for COPY FROM.
  2. No HEADER with JSON. The documentation and code disallow HEADER when using JSON format, to avoid mixing a header line with JSON lines/array.
  3. One logical column in protocol. In JSON mode, the frontend/backend Copy protocol sends a single (non-binary) column; the row is rendered as one JSON value (e.g. one object per row).
  4. FORCE_ARRAY only with JSON. The FORCE_ARRAY option wraps the entire COPY output in [ ... ] and inserts commas between rows, so the result is a single JSON array. It is only valid with FORMAT json.

Patch Structure

  • Patch 1 (from v13) — CopyFormat refactor Joel Jacobson introduced an enum CopyFormat (e.g. COPY_FORMAT_TEXT, COPY_FORMAT_CSV, COPY_FORMAT_BINARY) and replaced the two booleans csv_mode and binary in CopyFormatOptions with a single format field. This makes adding new formats (like JSON) cleaner. jian he later refactored this patch to address review feedback; Junwang Zhao adapted it to the new CopyToRoutine structure in the executor.

  • Patch 2 — JSON format for COPY TO

    • Grammar (gram.y): Add JSON as a format option and allow FORMAT json in COPY options.
    • Options (copy.c, copy.h): From v13, format is represented by CopyFormat; JSON adds COPY_FORMAT_JSON and the same checks: no HEADER/default/null/delimiter with JSON, no JSON with COPY FROM.
    • Copy protocol (copyto.c): In SendCopyBegin, when in JSON mode, send a single column with format 0 (text) instead of per-column formats.
    • Row output (copyto.c): In CopyOneRowTo, when json_mode is set, the row is converted to JSON via composite_to_json() (from utils/adt/json.c) and the resulting string is sent. For query-based COPY (no relation), the patch ensures the slot’s tuple descriptor matches the query’s so that composite_to_json sees the correct attribute metadata for key names.
    • json.c: composite_to_json() is changed from static to exported and declared in utils/json.h so COPY can call it.
  • Patch 3 — FORCE_ARRAY for COPY TO

    • Options (copy.c, copy.h): Add force_array and parse force_array / force_array true|false. Validation: FORCE_ARRAY is only allowed with JSON mode (v12+ uses ERRCODE_INVALID_PARAMETER_VALUE for the error).
    • Output (copyto.c): Before the first row, if JSON mode and force_array, send [ and a newline; between rows, send , before each JSON object (using a json_row_delim_needed flag); after the last row, send ] and newline. Default output (without FORCE_ARRAY) remains one JSON object per line.

Evolution: v8 Through v23

  • v8 took a larger approach: extracting COPY TO/FROM format implementations and adding a pluggable mechanism (including a contrib module pg_copy_json). Reviewers preferred a smaller, in-core change.
  • v9–v10 simplified to adding only the JSON format for COPY TO (no pluggable API). v10 introduced the json_mode flag and the use of composite_to_json.
  • v11 added the FORCE_ARRAY option and the corresponding tests; it also fixed the error code for “COPY FROM with json” and tightened option validation.
  • v12 (August 2024): only patch 2 (FORCE_ARRAY) was resent; the error when FORCE_ARRAY is used without JSON was changed to ERRCODE_INVALID_PARAMETER_VALUE.
  • v13 (October 2024): Joel Jacobson contributed patch 0001 — introduce CopyFormat enum and replace csv_mode and binary in CopyFormatOptions with a single format field. Patches 0002 (json format) and 0003 (force_array) were rebased on top; the docs explicitly state that JSON cannot be used with header, default, null, or delimiter.
  • v14–v22: Mostly rebase and adaptation to upstream. v14 dropped the separate CopyFormat patch in some postings (rebase on different bases). Junwang Zhao (v15, March 2025) adapted the JSON implementation to the new CopyToRoutine struct (commit 2e4127b6d2). Subsequent versions (v16–v22) continued to rebase and address review comments without changing the core design.
  • v23 (January 2026): Current series. Three patches: (1) CopyFormat refactor (originally Joel Jacobson; refactored by jian he), (2) json format for COPY TO (Author: Joe Conway; Reviewed-by multiple contributors including Andrey M. Borodin, Dean Rasheed, Daniel Verite, Andrew Dunstan, Davin Shearer, Masahiko Sawada, Alvaro Herrera), (3) FORCE_ARRAY for COPY JSON format. The feature set is unchanged; the patch set is rebased and has gathered substantial review.

Code Highlights

Row as JSON (patch 2) Each row is turned into a single JSON object via the existing composite_to_json():

rowdata = ExecFetchSlotHeapTupleDatum(slot);
result = makeStringInfo();
composite_to_json(rowdata, result, false);
CopySendData(cstate, result->data, result->len);

FORCE_ARRAY framing (patch 3) Before the row loop, send [; for each row after the first, send , then the object; after the loop, send ]:

if (cstate->opts.json_mode && cstate->opts.force_array)
{
    CopySendChar(cstate, '[');
    CopySendEndOfRow(cstate);
}
// ... row loop: first row no comma, then CopySendChar(cstate, ','); then object ...
if (cstate->opts.json_mode && cstate->opts.force_array)
{
    CopySendChar(cstate, ']');
    CopySendEndOfRow(cstate);
}

Example usage (from regression tests)

COPY copytest TO STDOUT (FORMAT json);
-- One JSON object per line.

COPY copytest TO STDOUT (FORMAT json, force_array true);
-- Single JSON array: [ {"col":1,...}, {"col":2,...} ]

Community Insights

Original Problem and Workarounds

Davin’s initial issue was that COPY’s text format escaped his JSON again, breaking it. David G. Johnston and Adrian Klaver suggested using psql to write query results to a file instead of COPY. Dominique Devienne and David G. Johnston agreed that a “not formatted” or raw option for COPY would help when dumping a single column (e.g. JSON) as-is. Tom Lane concurred that a copy option for unformatted output would address the use case. That consensus led to the idea of a dedicated JSON format rather than overloading text/CSV.

Reviewer Feedback and Refinements

  • Tom Lane and others emphasized keeping the change minimal: add JSON output for COPY TO without a large refactor. That pushed the design from the v8-style pluggable formats to the current in-core JSON path.
  • Joe Conway had previously discussed COPY and JSON in a separate thread; jian he’s patches reference that discussion and align with the idea of first-class JSON output for COPY TO.
  • Alvaro Herrera and other reviewers have commented on the thread; the evolution from v8 to v9/v10 reflects their preference for a smaller, focused patch set.

Edge Cases in the Patch

  • Query-based COPY: When the source is a query (no relation), the tuple descriptor from the slot can differ from the query’s. The patch copies the query’s attribute metadata into the slot’s tuple descriptor so composite_to_json generates correct key names.
  • Protocol: In JSON mode, the Copy protocol sends one column; the backend still produces one JSON “value” per row (one object, or one element inside the array when FORCE_ARRAY is used).

Technical Details

Implementation Notes

  • Escaping: JSON string escaping is handled by composite_to_json() and the existing escape_json-style helpers in json.c, so quotes, backslashes, and control characters in column values are encoded correctly.
  • HEADER: Explicitly disallowed with JSON to keep the stream either pure JSON lines or a single JSON array.
  • FORCE_ARRAY output: The regression tests show that with force_array, the output is [, then newline, then the first object; then for each next row, , followed by the object; then ]. So the result is a single valid JSON array (with optional whitespace/newlines between elements).

Limitations

  • COPY FROM: No JSON import; only COPY TO is extended.
  • HEADER: Not supported in JSON mode.
  • Binary: JSON format is text-only (no binary JSON in this patch).

Current Status

  • The thread has seen activity from 2023 through early 2026. The latest series is v23 (January 2026): three patches (CopyFormat refactor, json format for COPY TO, FORCE_ARRAY).
  • As of the thread snapshot, the patches have not been committed; they remain under discussion. v23 reflects the current design and review state.
  • Design: COPY TO ... (FORMAT json) and optional (FORMAT json, force_array true) for a single JSON array; JSON is incompatible with HEADER, DEFAULT, NULL, DELIMITER, and COPY FROM.

Conclusion

The “Emitting JSON to file using COPY TO” thread started with a user hitting double-escaping when exporting JSON via COPY. The community agreed that a native JSON format for COPY TO was the right fix. jian he (and later Junwang Zhao) implemented FORMAT json (COPY TO only; incompatible with HEADER, DEFAULT, NULL, DELIMITER) and FORCE_ARRAY, reusing composite_to_json() and the existing JSON escaping. Joel Jacobson’s CopyFormat refactor (v13+) replaced the format booleans with an enum, making the codebase ready for JSON and future formats. The series has evolved through v23 with rebases and adaptations (e.g. to CopyToRoutine) and has received Reviewed-by from several committers.

References

Week 05 (2026)

PostgreSQL mailing list discussions for Week 05, 2026.

🇨🇳 中文版本

Articles

GOO: A Greedy Join Search Algorithm for Large Join Problems

Introduction

PostgreSQL uses different strategies for join ordering depending on query complexity. For queries with fewer than geqo_threshold relations (default 12), the planner uses dynamic programming (DP) to find the optimal join order. For larger join graphs, it falls back to GEQO (Genetic Query Optimizer), which uses a genetic algorithm to search the space of possible join orders in a more scalable but heuristic way. GEQO has well-known drawbacks: it can be slower than DP for moderate join counts due to overhead, and it lacks tuning knobs like a reproducible seed for debugging.

In December 2025, Chengpeng Yan proposed GOO (Greedy Operator Ordering) on the pgsql-hackers mailing list—a deterministic greedy join-order search method intended as an alternative to GEQO for large join problems. The algorithm is based on the 1998 DEXA paper by Leonidas Fegaras, "A New Heuristic for Optimizing Large Queries."

Why This Matters

  • Planning time: On star/snowflake and TPC-DS style workloads, GOO can plan much faster than GEQO (e.g., ~5s vs ~20s for EXPLAIN on 99 TPC-DS queries in one test), while DP remains fastest when the join count is below the threshold.
  • Plan quality: The goal is to be "good enough" where GEQO is used today—reducing tail regressions and offering more predictable behavior than a genetic search.
  • Memory: GOO’s structure (iterative, commit-to-one-join-per-step) suggests lower memory use than full DP and different characteristics than GEQO; the author plans to measure this.

Understanding this thread helps you see how PostgreSQL might evolve for queries with many joins and where GOO fits next to DP and GEQO.

Technical Analysis

The GOO Algorithm

GOO builds the join order incrementally:

  1. Start with one "clump" per base relation.
  2. At each step, consider all legal join pairs (clumps that can be joined under the query’s join constraints).
  3. For each pair, build the join relation and use the planner’s existing cost model to get a total cost.
  4. Choose the pair with the lowest estimated total cost, merge them into one clump, and repeat.
  5. Stop when a single clump remains; that clump’s best path is the chosen plan.

So the "greedy" choice is: among all current clumps, always do the join that looks cheapest right now. The paper uses estimated result size; the patch uses the planner’s existing total_cost for consistency with the rest of PostgreSQL.

Complexity: Time is O(n³) in the number of base relations n: (n−1) merges, and at each step O(k²) pairs for k clumps.

Integration with the Planner

The patch adds:

  • enable_goo_join_search — GUC to turn GOO on (default off).
  • Threshold: For now, GOO reuses geqo_threshold: when the number of join levels is ≥ geqo_threshold, and GOO is enabled, the planner calls goo_join_search() instead of GEQO. So GOO is positioned as a GEQO replacement, not a replacement for DP.

Relevant code in allpaths.c:

else if (enable_goo_join_search && levels_needed >= geqo_threshold)
    return goo_join_search(root, levels_needed, initial_rels);
else if (enable_geqo && levels_needed >= geqo_threshold)
    return geqo(root, levels_needed, initial_rels);

Greedy Strategies Under Experiment

The author experimented with different signals for "cheapest" in the greedy step:

StrategyDescription
CostUse the planner’s total_cost for the join (baseline).
Result sizeUse estimated output size in bytes (reltarget->width * rows).
RowsUse estimated output row count.
SelectivityUse join selectivity (output rows / (left_rows × right_rows)).
CombinedRun GOO twice (cost and result_size), then pick the plan with lower final estimated cost.

Findings:

  • Cost alone can produce very bad tail cases (e.g., JOB: max 431×, many ≥10× regressions).
  • Result size is better on average but still has bad tails (e.g., 67× max on JOB).
  • Combined (cost + result_size, pick cheaper) improves robustness: best geometric mean, no ≥10× regressions in the JOB subset, worst case 8.68×.

So the discussion shifted from "which single metric?" to "how to reduce tail risk?"—e.g., by combining multiple greedy strategies and choosing the better plan.

Community Insights

Benchmark Clarification

Dilip Kumar initially questioned the use of pgbench, since default pgbench queries do not exercise join search. The author clarified: the numbers came from custom star-join and snowflake workloads from Tomas Vondra’s earlier thread, not from default pgbench. Those workloads use multi-table joins and empty tables, so the reported throughput mainly reflects planning time of DP vs GEQO vs GOO.

Crashes and the Eager-Aggregation Fix (v1 → v2)

Tomas Vondra reported crashes when running EXPLAIN on TPC-DS queries. The backtrace showed sort_inner_and_outer() with inner_path = NULL. Root cause: in some paths, the GOO code built join relations without calling set_cheapest(), so later code saw missing cheapest path. Tomas also narrowed it to queries with aggregates in the select list (e.g., TPC-DS Q7).

The fix in v2 was to correctly handle eager aggregation: the planner can create grouped/base relations that require a proper cheapest path. After the v2 fix, all 99 TPC-DS queries could be planned without crashes.

GEQO Slower Than DP for Moderate Join Counts

Tomas reported EXPLAIN times for 99 TPC-DS queries (3 scales × 0/4 workers):

  • master (DP): 8s
  • master/geqo: 20s
  • master/goo: 5s

So for that workload, GEQO was slower than DP, and GOO was fastest. John Naylor and Pavel Stehule noted that GEQO is designed to win only when the join problem is large enough; for smaller or moderate join counts, its overhead can dominate. So the comparison should focus on the range where GEQO is actually used (e.g., relation count above geqo_threshold).

TPC-DS Execution Results (Tomas Vondra)

Tomas shared full TPC-DS run results (scale 1 and 10, 0 and 4 workers). Summary of total duration (all 99 queries):

  • Scale 1: GOO was slower than both master and GEQO (e.g., ~1124s vs 399s geqo vs 816s master).
  • Scale 10: GOO was faster (e.g., ~1859s vs 2325s geqo vs 2439s master).

So GOO behaved worse at small scale and better at larger scale—suggesting workload- and scale-dependent behavior. Tomas suggested inspecting queries that got worse to refine heuristics and to test with larger data sets and cold cache.

TPC-H: Failure Modes of Single-Metric Greedy (v3)

The author ran TPC-H SF=1 with four strategies: rows, selectivity, result size, and cost. Main lessons:

  • Q20: Join between partsupp and an aggregated lineitem subquery. Row count was misestimated by orders of magnitude (tens vs hundreds of thousands). Output-oriented rules (rows, selectivity, result size) favored this join very early because it "looked" very shrinking; in reality it produced a huge intermediate and blew up downstream cost. So bad estimates can make output-oriented greedy rules fail badly.
  • Q7: Cost-based greedy chose a locally cheap join that created a large many-to-many intermediate, which made later joins much more expensive. So locally optimal cost can be globally bad.

Tomas pointed out: Q20 is largely an estimation problem (garbage in, garbage out); Q7 is inherent to greediness—locally good choices can be globally poor, and that’s not fixable by picking a different single metric.

JOB and Combined Strategy (v4)

On the full JOB workload, the combined strategy (cost + result_size, pick cheaper) gave:

  • Best geometric mean (0.953 vs DP).
  • No regressions ≥10×; max 8.68×.
  • Fewer bad tail cases than GOO(cost) or GOO(result_size) alone.

So increasing plan diversity by running two greedy criteria and selecting the cheaper plan helps avoid catastrophic plans without much extra planning cost.

Scope: GOO as GEQO Replacement

Tomas asked whether the goal was to replace DP or GEQO. The author confirmed: GOO is intended as a GEQO replacement, not to replace DP. When the join count is below geqo_threshold, DP should remain in use.

Literature and Next Steps

Tomas pointed to the CIDR 2021 paper "Simplicity Done Right for Join Ordering" (Hertzschuch et al.), which focuses on robustness (e.g., upper bound / worst-case join orders) and trusting base relation estimates—potentially relevant to nestloop blowups when cardinality is over-optimistic. The author plans to establish a solid baseline with the current approach, then incorporate ideas from such work incrementally.

Technical Details

Implementation Approach

  • New files: src/backend/optimizer/path/goo.c, src/include/optimizer/goo.h.
  • GOO builds join relations by repeatedly calling the existing planner routines (e.g., make_join_rel, path creation), so it reuses the same cost model and path types as DP/GEQO.
  • Memory is managed with multiple memory contexts to limit usage during candidate evaluation.

Edge Cases and Robustness

  • Eager aggregation: v2 fixed crashes by ensuring join relations created during GOO have proper cheapest paths set (so code like sort_inner_and_outer never sees NULL inner_path).
  • Cardinality misestimation: All approaches suffer when estimates are wrong; GOO’s sensitivity differs by strategy (e.g., output-oriented rules can be worse when row estimates are off). The combined strategy is aimed at reducing tail risk rather than fixing estimation.
  • Structural limits: Some shapes (e.g., star with fan-out) cause both cost and result_size to pick similarly bad plans; that’s a limitation of myopic greedy enumeration.

Performance Considerations

  • Planning time: GOO is O(n³) and in practice was faster than GEQO in the reported benchmarks; the author plans to add explicit planning-time and memory measurements.
  • Execution time: Highly workload-dependent; GOO can be better or worse than GEQO/DP depending on scale and query mix (e.g., TPC-DS scale 1 vs 10, JOB GEQO-relevant subset).

Current Status

  • Patch: v4 was the latest at thread closure. v4-0001 is the core GOO implementation (unchanged from v3-0001); v4-0002 adds test-only GUCs and machinery for trying different greedy strategies (e.g., combined).
  • Goal: Establish GOO as a viable GEQO replacement—same threshold, better or comparable plan quality and planning time, with reduced tail regressions.
  • Next steps (from the author): Broader evaluation (more workloads, larger join graphs, cold cache, larger scale factors); consider adding selectivity to the combined strategy; measure planning time and memory; look at tunability and graceful degradation (e.g., DP up to a resource limit then greedy).

Conclusion

The GOO thread shows a serious effort to replace GEQO with a deterministic, greedy join ordering algorithm that:

  • Reuses the existing cost model and planner infrastructure.
  • Improves planning time in several benchmarks compared to GEQO.
  • Reduces worst-case plan quality by combining multiple greedy strategies (e.g., cost and result_size) and picking the cheaper plan.

Limitations are acknowledged: greedy methods are inherently local and can produce bad plans when estimates are wrong or when the join graph has unfavorable structure. The focus has shifted to robustness and tail behavior rather than perfect single-metric tuning. For PostgreSQL users, this is a patch to watch: if committed, it would offer an alternative to GEQO for complex queries, with different trade-offs and potentially better predictability and planning performance.

References

Batching in Executor: Batch-Oriented Tuple Processing

Introduction

PostgreSQL’s executor has long been tuple-at-a-time: each plan node typically requests one tuple from its child, processes it, and passes one result tuple upward. That design is simple and works well for OLTP, but for analytical and bulk workloads the per-tuple overhead—especially repeated function-call and expression-evaluation cost—can dominate. At PGConf.dev 2025, the community discussed whether PostgreSQL could support an additional batch executor that moves batches of tuples between nodes instead of one slot at a time.

Following that discussion and off-list input from Andres Freund and David Rowley, Amit Langote posted a patch series on the pgsql-hackers list in September 2025 titled “Batching in executor”. The series introduces a batch table AM API, extends the executor with batch-capable interfaces (ExecProcNodeBatch, TupleBatch), and prototypes batch-aware expression evaluation (including batched quals and aggregate transitions). The goal is to reduce per-tuple overhead, enable future optimizations such as SIMD in aggregate functions, and lay groundwork for columnar or compressed table AMs that benefit from batch-oriented execution.

Why This Matters

  • Executor overhead: In CPU-bound, IO-minimal workloads (e.g., fully cached tables), a large share of time goes into the executor. Batching reduces calls into the table AM and expression interpreter, and can cut function-call overhead by evaluating expressions over many rows at once.
  • Aggregates and analytics: Batched transition evaluation (e.g., count(*), sum(), avg()) can pay fmgr cost per batch instead of per row and opens the door to vectorized or SIMD-friendly code paths.
  • Future table AMs: A batch-oriented executor makes it easier for columnar or compressed table AMs (e.g., Parquet-style) to pass native batch formats without forcing early materialization into heap tuples.
  • OLTP safety: The design keeps the existing row-at-a-time path unchanged; batching is opt-in (e.g., via executor_batching GUC) so OLTP workloads are not affected.

Understanding this thread helps you see how PostgreSQL might gain a second, batch-oriented execution path and what trade-offs (materialization, ExprContext, EEOP design) the community is working through.

Technical Analysis

Patch Structure

The series is split into two parts:

  1. Patches 0001–0003 — Foundation: batch table AM API, heapam batch implementation, and executor batch interface wired to SeqScan.
  2. Patches 0004–0008 — Prototype: batch-aware Agg node, new EEOPs for TupleBatch processing, batched qual evaluation, and batched aggregate transition (row-loop and “direct” per-batch fmgr).

Patches 0001–0003 are intended as the first candidates for review and eventual commit; 0004–0008 are marked WIP/PoC.

Key Abstractions

Table AM batch API (0001) New callbacks let a table AM return multiple tuples per call instead of one. For heap:

  • HeapBatch holds tuples from a single page; size is limited by EXEC_BATCH_ROWS (currently 64) and by not crossing page boundaries.
  • heapgettup_pagemode_batch() fills a HeapTupleData array from the current page, mirroring the logic of heapgettup_pagemode() but for a batch. Visibility and scan direction are handled the same way.

The generic layer introduces a batch type and ops in tableam.h so other AMs can supply their own batch format and implementation.

Executor batch path (0002–0003)

  • TupleBatch is the container passed between nodes when running in batch mode. It can hold the AM’s native batch (e.g., heap tuples) or materialized slots, depending on the path.
  • ExecProcNodeBatch() is the batch analogue of ExecProcNode(): it returns a TupleBatch* instead of a TupleTableSlot*. PlanState gains an ExecProcNodeBatch function pointer, with the same “first call” and instrumentation wrappers as the row path.
  • SeqScan gets:
    • Batch-driven slot path: still returns one slot per call, but fills it from an internal batch (fewer AM calls).
    • Batch path: when the parent supports batching, SeqScan’s ExecProcNodeBatch returns a TupleBatch directly (e.g., from ExecSeqScanBatch*).

So the first three patches give: (1) table AMs that can produce batches, (2) an executor API to request and pass batches, and (3) SeqScan as the first node that can both consume and produce batches.

Batch-Aware Expression Evaluation (0004–0008)

The later patches experiment with evaluating expressions over a batch of rows:

  • Batch input to Agg: Agg can pull TupleBatch from its child via ExecProcNodeBatch() and feed rows into the aggregate transition in bulk.
  • New EEOPs: Expression interpreter gains steps that operate on TupleBatch data—e.g., fetching attributes into batch vectors, evaluating a qual over a batch, and running aggregate transitions either by looping over rows inside the interpreter (ROWLOOP) or by calling the transition function once per batch with bulk arguments (DIRECT).
  • Batched qual evaluation: A batch of tuples can be filtered with a single pass over the batch (ExecQualBatch and related EEOPs), reducing per-row interpreter and fmgr overhead.

Two prototype paths for batched aggregation are provided: one that iterates over rows in the interpreter (per-row transition), and one that invokes the transition function once per batch (per-batch fmgr). The latter shows larger gains in Amit’s benchmarks when executor cost dominates.

Design Choices and Open Points

  • Single-page batches: Heap batches are limited to one page. So batches may be smaller than EXEC_BATCH_ROWS (e.g., with few tuples per page or selective quals). The thread mentions possible future improvements: batches spanning pages or the scan requesting more tuples when the batch is not full.
  • TupleBatch vs ExprContext: The patches extend ExprContext with scan_batch, inner_batch, and outer_batch. Per-batch expression evaluation still uses ecxt_per_tuple_memory, which Amit notes is “arguably an abuse” of the per-tuple contract. A clearer model for batch-scoped memory is still needed.
  • Materialization: Today, batch-aware expression evaluation typically works on tuples materialized into slots (or heap tuple arrays). The long-term goal is to allow expression evaluation on native batch formats (e.g., columnar or compressed) without forcing materialization; that would require more infrastructure (e.g., AM-controlled expression evaluation or batch-aware operators).

Community Insights

Tomas Vondra: Batch Design vs Index Prefetching

Tomas compared the patch to index prefetching work (which he is involved in), which also introduces a “batch” concept for passing data between the index AM and the executor. He noted the designs differ on purpose:

  • Index prefetching: A shared batch struct is filled by the index AM and then managed by indexam.c; the batch is AM-agnostic after that.
  • Executor batching: Each table AM can produce its own batch format (e.g., HeapBatch) wrapped in a generic TupleBatch with AM-specific ops. The executor retains TAM-specific optimizations and relies on the TAM for operations on batch contents.

Amit agreed: for executor batching the aim is to keep TAM-specific behavior and avoid early materialization where possible; for prefetching the aim is a single, indexam-driven batch format. Both designs are consistent with their goals.

Tomas also asked: (1) When must a TupleBatch be materialized into a generic format (e.g., slots)? (2) Can expressions run directly on “custom” batches (e.g., compressed/columnar)? Amit replied that materialization is currently required for expression evaluation but that the design should not block future work to evaluate expressions on native batch data (e.g., columnar or Parquet-style). Giving the table AM more control over how expressions are evaluated on its batch data is a possible future extension.

Tomas Vondra: TPC-H Q22 Segfault and Fix (v3)

Tomas reported a segfault when running TPC-H with batching enabled, only on Q22, with backtraces always pointing to the same place: numeric_avg_accum with a NULL datum (DatumGetNumeric(X=0)), called from ExecAggPlainTransBatch and then agg_retrieve_direct_batch. So the bug was in the batched aggregate path: a NULL was being passed where the transition function expected a valid value.

Amit tracked the crash to the expression interpreter. Two different EEOPs (for the ROWLOOP and DIRECT batched aggregate paths) both called the same helper function. That helper re-derived the opcode at execution time (e.g., via ExecExprEvalOp(op)). In some builds (e.g., clang-17 on macOS), the two EEOP cases compiled to identical code, so their dispatch labels had the same address. The interpreter’s reverse lookup by label address could then return the wrong EEOP; the init path could think it was running the ROWLOOP EEOP while the exec path behaved like the DIRECT EEOP, leading to incorrect state and the NULL/crash.

The fix (in v3, patch 0009) was to split the shared helper into two separate functions, one per EEOP, so the helper no longer re-derives the opcode. With that change, Amit could not reproduce the crash on macOS with clang-17. The same fix addresses the TPC-H Q22 segfault that Tomas saw.

Bruce Momjian: POSETTE Talks and OLTP

Bruce pointed to two POSETTE 2025 talks for context: one on data warehouse needs and one on “Hacking Postgres Executor For Performance”. Amit (who gave the second talk) confirmed that batching is designed to avoid adding meaningful overhead to the OLTP path; the row-at-a-time path remains default and unchanged.

Regression When Batching Is Off

Tomas had observed that with batching disabled (executor_batching=off), the patched tree could be slower than unpatched master—i.e., a regression when the new code path is not used. Amit reproduced this: for example, single-aggregate SELECT count(*) FROM bar and multi-aggregate SELECT avg(a), … FROM bar showed roughly 3–18% slowdown with batching off vs master, depending on row count and parallelism. He acknowledged the regression and said he was looking into it. Ensuring zero or minimal cost when batching is disabled is important for committable patches.

Technical Details

Implementation Highlights

  • Batch size: EXEC_BATCH_ROWS is 64. Heap batches are further limited to one page, so effective batch size can be smaller (e.g., ~43 rows per page in Amit’s 10M-row test table).
  • Instrumentation: ExecProcNodeBatch uses the same instrumentation hooks as the row path; the “tuple” count for a batch call is recorded as the number of valid rows in the returned TupleBatch (b->nvalid), so EXPLAIN ANALYZE-style stats remain meaningful.
  • GUC: In v4/v5 the GUC is executor_batch_rows (0 = batching off; e.g. 64 = batch size).

Edge Cases and Limitations

  • Sparse batches: With selective quals, batches can end up with few valid rows after filtering. The thread suggests future work: cross-page batches or the scan refilling the batch when it is not full.
  • ExprContext and batch lifetime: Reusing ecxt_per_tuple_memory for per-batch work is a known design debt; a dedicated batch-scoped allocator or context would be cleaner.
  • Parallel and nested Agg: The backtrace from Tomas’s crash showed parallel workers (Gather/GatherMerge) and nested aggregation (e.g., Agg over subplan). The NULL-datum bug was in the batched transition path used in that setting; the v3 fix (split EEOP helpers) addresses the root cause rather than a single query.

Benchmark Summary (from Amit’s v1 post)

All runs were on fully VACUUMed tables with large shared_buffers and prewarmed cache; timings in ms, “off” = batching off, “on” = batching on. Negative %diff means “on” is faster.

  • Single aggregate, no WHERE (e.g., SELECT count(*) FROM bar_N): With only batched SeqScan (0001–0003), ~8–22% faster; with batched agg (0001–0007), ~33–49% faster in several cases.
  • Single aggregate, with WHERE: With batched agg and batched qual (0001–0008), ~31–40% faster.
  • Five aggregates, no WHERE: Batched transitions (per-batch fmgr, 0001–0007) ~22–31% faster.
  • Five aggregates, with WHERE: Batched transitions + batched qual (0001–0008) ~18–32% faster.

So once the executor dominates (minimal IO), batching consistently reduces CPU time, with the largest gains from avoiding per-row fmgr calls and evaluating quals over batches.

Evolution: v4 and v5

Later revisions refined the foundation and added observability and batch qual work:

  • v4 (Oct 2025): Adds EXPLAIN (BATCHES) (patch 0003) to show tuple-batching statistics, addressing the earlier “instrumentation” open point. Amit reported that the regression when batching is off (vs unpatched master) was no longer seen in v4—likely due to removing stray fields from HeapScanData and avoiding mixed compiler (gcc vs clang) comparisons. New benchmarks use SELECT * FROM t LIMIT 1 OFFSET n; with batch=64, improvements are ~22–26% for no-WHERE and ~21–48% for WHERE a > 0; deform-heavy cases (e.g. qual on last column) show smaller gains. Daniil Davydov reviewed the heap batch code (e.g. SO_ALLOW_PAGEMODE assertion, heapgettup_pagemode_batch logic, style); Amit addressed these in v4.

  • v5 (Jan 2026): Keeps 0001–0003 as core (batch AM API, SeqScan + TupleBatch, EXPLAIN BATCHES). 0004 adds ExecQualBatch for batched qual evaluation (WIP); 0005 moves batch qual opcodes into a dedicated interpreter so the per-tuple path (ExecInterpExpr) is not modified, aiming to avoid any cost when executor_batch_rows=0. Amit removed the BatchVector intermediate (quals now read batch slots’ tts_values directly). Two open issues: (1) With 0% selectivity (all rows fail the qual), the per-tuple path is still hotter with the batch qual patches applied even when batching is off; (2) Quals on late columns (deform-heavy) get little or no benefit from batching. The GUC in recent patches is executor_batch_rows (0 = off).

Current Status

  • The thread is active; the latest messages are from January 2026. The series is still work in progress.
  • v5 is the current revision. Patches 0001–0003 (table AM batch API, heapam batch, SeqScan + TupleBatch, EXPLAIN BATCHES) are the intended first step for review and possible commit.
  • Patches 0004–0005 in v5 are experimental (ExecQualBatch, dedicated interpreter for batch qual).
  • v3 had the segfault fix (split EEOP helpers) for the TPC-H Q22 / batched-agg crash; the v4/v5 series builds on that.
  • Open items: (1) Per-tuple path regression when batch qual (0004–0005) is in the tree but executor_batch_rows=0 (e.g. 0% selectivity); (2) batch-scoped memory and ExprContext; (3) future work on cross-page batches and expression evaluation on native/compressed batch formats.

Conclusion

Amit Langote’s “Batching in executor” series introduces a batch-oriented path in the PostgreSQL executor: table AMs can return batches of tuples, the executor can request and pass them via TupleBatch, and SeqScan is the first node wired to this path. Revisions v4 and v5 add EXPLAIN (BATCHES) for observability and prototype batched qual evaluation with a dedicated interpreter to keep the row-at-a-time path unchanged. Benchmarks show substantial gains (often 20–50%) when batching is on; the earlier “batching off” regression was addressed in v4, but a remaining issue is per-tuple path cost when the batch qual patches are applied and batching is disabled (e.g. 0% selectivity).

Reviewers have raised important points: alignment with other batch-like work (e.g. index prefetching), materialization and future expression-on-batch design, the TPC-H Q22 segfault (fixed in v3), and Daniil’s heap-batch review (addressed in v4). The foundation (0001–0003) plus EXPLAIN BATCHES is the current focus for review and possible commit.

References

Week 04 (2026)

PostgreSQL mailing list discussions for Week 04, 2026.

🇨🇳 中文版本

Articles

PostgreSQL Planner Optimization: Automatic COUNT(*) Conversion

Introduction

In October 2025, PostgreSQL committer David Rowley proposed a significant query planner optimization that automatically converts COUNT(1) and COUNT(not_null_col) expressions to COUNT(*). This optimization addresses a common performance anti-pattern where developers write COUNT(1) thinking it's equivalent to COUNT(*), when in fact COUNT(*) is more efficient. The patch was committed in November 2025 and introduces new infrastructure for aggregate function simplification.

Why This Matters

The performance difference between COUNT(*) and COUNT(column) can be substantial, especially for large tables. When counting a specific column, PostgreSQL must:

  1. Deform the tuple to extract the column value
  2. Check for NULL values (even for NOT NULL columns, the check still occurs)
  3. Process the column data through the aggregate function

In contrast, COUNT(*) can count rows without accessing individual column values, resulting in significantly better performance. David Rowley's benchmarks showed approximately 37% performance improvement when using COUNT(*) instead of COUNT(not_null_col) on a table with 1 million rows.

Technical Analysis

The Infrastructure: SupportRequestSimplifyAggref

The patch introduces a new infrastructure called SupportRequestSimplifyAggref, which is similar to the existing SupportRequestSimplify used for regular function expressions (FuncExpr). Since aggregates use Aggref nodes, a separate mechanism was needed.

The key components include:

  1. New support node type: SupportRequestSimplifyAggref in supportnodes.h
  2. Simplification function: simplify_aggref() in clauses.c that calls the aggregate's support function during constant folding
  3. Enhanced nullability checking: Extended expr_is_nonnullable() to handle Const nodes, not just Var nodes

Implementation Details

The optimization is performed during the constant folding phase of query planning, specifically in eval_const_expressions_mutator(). When an Aggref node is encountered, the planner:

  1. Checks if the aggregate function has a support function registered via pg_proc.prosupport
  2. Calls the support function with a SupportRequestSimplifyAggref request
  3. If the support function returns a simplified node, replaces the original Aggref

For the COUNT aggregate specifically, the support function (int8_agg_support_simplify()) checks:

  • Whether the argument is non-nullable (using expr_is_nonnullable())
  • Whether there are no ORDER BY or DISTINCT clauses in the aggregate
  • If both conditions are met, converts COUNT(ANY) to COUNT(*)

Code Example

The core simplification logic in int8.c:

static Node *
int8_agg_support_simplify(SupportRequestSimplifyAggref *req)
{
    Aggref    *aggref = req->aggref;

    /* Only handle COUNT */
    if (aggref->aggfnoid != INT8_AGG_COUNT_OID)
        return NULL;

    /* Must have exactly one argument */
    if (list_length(aggref->args) != 1)
        return NULL;

    /* No ORDER BY or DISTINCT */
    if (aggref->aggorder != NIL || aggref->aggdistinct != NIL)
        return NULL;

    /* Check if argument is non-nullable */
    if (!expr_is_nonnullable(req->root,
                             (Expr *) linitial(aggref->args),
                             true))
        return NULL;

    /* Convert to COUNT(*) */
    return make_count_star_aggref(aggref);
}

Patch Evolution

The patch went through four iterations, each refining the implementation:

Version 1 (Initial Proposal)

  • Introduced the basic infrastructure
  • Used SysCache to fetch pg_proc tuples

Version 2 (Code Cleanup)

  • Replaced SysCache lookup with get_func_support() function
  • Cleaner and more efficient approach

Version 3 (Removed Experimental Code)

  • Removed #ifdef NOT_USED block that handled COUNT(NULL) optimization
  • Cleaned up unused includes
  • Improved comments

Version 4 (Final Version)

  • Rebased after commit b140c8d7a
  • Fixed assumption that support function always returns an Aggref
  • Allows support functions to return other node types (e.g., constants) for more aggressive optimizations
  • This flexibility enables future optimizations like converting COUNT(NULL) to '0'::bigint

Community Insights

Reviewer Feedback

Corey Huinker provided positive feedback:

  • +1 for the automatic query improvement
  • Noted that we can't educate everyone that COUNT(1) is an anti-pattern, so making it not an anti-pattern is the right approach
  • Confirmed the patch applies cleanly and tests pass

Matheus Alcantara also reviewed and tested:

  • Confirmed ~30% performance improvement in benchmarks
  • Validated that the code placement is consistent with existing SupportRequestSimplify infrastructure
  • +1 for the idea

Design Decisions

Timing of Optimization: The optimization happens during constant folding, which is early in the planning process. David considered whether it should happen later (after add_base_clause_to_rel()) to catch cases like:

SELECT count(nullable_col) FROM t WHERE nullable_col IS NOT NULL;

However, it must happen before preprocess_aggref(), which groups aggregates with the same transition function. The current placement is consistent with SupportRequestSimplify for regular functions.

Support Function Return Type: The infrastructure allows support functions to return nodes other than Aggref. This design decision enables future optimizations, such as:

  • Converting COUNT(NULL) to '0'::bigint
  • More aggressive constant folding for aggregates

Performance Considerations

The optimization provides significant performance benefits:

  1. Reduced tuple deformation: COUNT(*) doesn't need to extract column values from tuples
  2. Fewer NULL checks: No need to check individual column values
  3. Better cache utilization: Less data movement means better CPU cache usage

For tables with many columns, the performance gain can be even more substantial, as COUNT(column) might require deforming many columns to reach the target column.

Edge Cases and Limitations

The optimization only applies when:

  1. The column is provably non-nullable (NOT NULL constraint or constant)
  2. There are no ORDER BY clauses in the aggregate
  3. There are no DISTINCT clauses in the aggregate

Cases that are not optimized (yet):

  • COUNT(nullable_col) where the column might be NULL (even if filtered by WHERE nullable_col IS NOT NULL in the same query)
  • COUNT(col ORDER BY col) - the ORDER BY prevents optimization
  • COUNT(DISTINCT col) - DISTINCT prevents optimization

The limitation with WHERE clauses is due to the timing of the optimization (during constant folding, before relation information is fully available).

Current Status

The patch was committed by David Rowley on November 26, 2025. It's available in PostgreSQL master branch and will be included in PostgreSQL 18.

Conclusion

This optimization represents a significant improvement to PostgreSQL's query planner, automatically fixing a common performance anti-pattern without requiring application changes. The new SupportRequestSimplifyAggref infrastructure also opens the door for future aggregate optimizations.

For developers and DBAs:

  • No action required: The optimization happens automatically
  • Performance benefit: Existing queries using COUNT(1) or COUNT(not_null_col) will automatically get faster
  • Best practice: While the planner now optimizes these cases, COUNT(*) remains the clearest and most idiomatic way to count rows

This change demonstrates PostgreSQL's commitment to improving query performance automatically, reducing the burden on developers to know every optimization detail while still allowing experts to write optimal queries when needed.

References

Week 03 (2026)

PostgreSQL mailing list discussions for Week 03, 2026.

🇨🇳 中文版本

Articles

Extended Statistics Import/Export Functions for PostgreSQL 19

Introduction

PostgreSQL's extended statistics feature, introduced in version 10, allows the optimizer to understand correlations between columns that simple per-column statistics cannot capture. This includes n-distinct coefficients (how many distinct combinations exist), functional dependencies (when one column determines another), and Most Common Values (MCV) lists for multi-column combinations.

While PostgreSQL 17 introduced functions for importing and exporting relation and attribute statistics (pg_restore_relation_stats, pg_restore_attribute_stats), extended statistics were left out of this initial implementation. A recent thread on pgsql-hackers, initiated by Corey Huinker, addresses this gap with a comprehensive patch series that adds pg_restore_extended_stats(), pg_clear_extended_stats(), and related infrastructure.

This work is significant for several reasons:

  • Enables complete statistics preservation across pg_dump/pg_restore and pg_upgrade
  • Allows query planner experimentation with hypothetical statistics
  • Supports schema-only dumps with statistics for testing query plans without actual data

Technical Analysis

The Problem with the Original Format

The original output format for pg_ndistinct and pg_dependencies types used a JSON structure where the keys themselves contained structured data:

{"1, 2": 2323, "1, 3": 3232, "2, 3": 1500}

While technically valid JSON, this format posed several problems:

  1. Keys containing comma-separated attribute numbers require additional parsing
  2. Difficult to manipulate programmatically
  3. No working input function existed—these types were effectively output-only

The New JSON Format

The patch series introduces a cleaner, more structured JSON format. For pg_ndistinct:

[
  {"attributes": [2, 3], "ndistinct": 4},
  {"attributes": [2, -1], "ndistinct": 4},
  {"attributes": [2, 3, -1], "ndistinct": 4}
]

For pg_dependencies:

[
  {"attributes": [2], "dependency": 3, "degree": 1.000000},
  {"attributes": [2, 3], "dependency": -1, "degree": 0.850000}
]

Key improvements:

  • Proper JSON arrays with named keys for each element
  • Clear separation of attributes, values, and metadata
  • Machine-readable without custom parsing logic
  • Negative attribute numbers represent expressions in the statistics object (e.g., -1 is the first expression)

Input Function Implementation

The new input functions use PostgreSQL's JSON parser infrastructure with a custom semantic action handler. Here's a simplified view of the parsing state machine for pg_ndistinct:

typedef enum
{
    NDIST_EXPECT_START = 0,
    NDIST_EXPECT_ITEM,
    NDIST_EXPECT_KEY,
    NDIST_EXPECT_ATTNUM_LIST,
    NDIST_EXPECT_ATTNUM,
    NDIST_EXPECT_NDISTINCT,
    NDIST_EXPECT_COMPLETE
} ndistinctSemanticState;

The parser validates:

  • Proper JSON structure (array of objects)
  • Required keys (attributes and ndistinct for ndistinct statistics)
  • Attribute numbers within valid ranges (positive for columns, negative for expressions, but not beyond STATS_MAX_DIMENSIONS)
  • No duplicate attributes within a single item

Extended Statistics Functions

The patch introduces three main SQL functions:

pg_restore_extended_stats() — Imports extended statistics from a previously exported value:

SELECT pg_restore_extended_stats(
    'public',                    -- relation schema
    'my_table',                  -- relation name
    'public',                    -- statistics schema  
    'my_stats',                  -- statistics name
    false,                       -- inherited
    '{"version": ..., "ndistinct": [...], "dependencies": [...], "mcv": [...], "exprs": [...]}'::text
);

pg_clear_extended_stats() — Removes extended statistics data from pg_statistic_ext_data:

SELECT pg_clear_extended_stats(
    'public',        -- statistics schema
    'my_stats',      -- statistics name
    false            -- inherited
);

The functions follow the same patterns established for relation/attribute statistics:

  • Return boolean indicating success
  • Issue WARNING (not ERROR) on problems to avoid breaking pg_restore scripts
  • Require MAINTAIN privilege on the target relation

Validation and Safety

The implementation includes careful validation:

  1. Attribute bounds checking: Positive attnums must exist in stxkeys, negative attnums must not exceed the number of expressions
  2. Combination completeness: For pg_ndistinct, all N-choose-K combinations must be present based on the longest attribute list
  3. Soft error handling: Uses PostgreSQL's ErrorSaveContext for safe error reporting without crashing

Example validation for attribute numbers:

if (attnum == 0 || attnum < (0 - STATS_MAX_DIMENSIONS))
{
    errsave(parse->escontext,
            errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
            errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
            errdetail("Invalid \"%s\" element: %d.",
                      PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
    return JSON_SEM_ACTION_FAILED;
}

Community Insights

Key Discussion Points

Format Change Timing: Tomas Vondra initially suggested a more structured JSON format. The community recognized this was the last opportunity to change the format before a working input function locked in backward compatibility requirements.

Validation Scope: There was significant discussion about how much validation to perform:

  • Early patches had extensive checks for statistical consistency (e.g., MCV frequencies summing to 1.0)
  • Reviewers pushed back, preferring minimal validation to avoid breaking legitimate but unusual imports
  • Final consensus: validate structure and attribute references, but not statistical values

pg_dependencies Special Case: Unlike pg_ndistinct which stores all combinations, pg_dependencies may omit statistically insignificant combinations. This means the input function cannot enforce complete combination coverage for dependencies.

Reviewer Feedback Integration

Michael Paquier provided extensive review and contributed significant improvements:

  • Restructured the patch series for cleaner commits
  • Split format changes from input function additions
  • Added comprehensive regression tests achieving >90% code coverage
  • Fixed compiler warnings on older GCC versions

Tom Lane caught style issues:

  • Error detail messages converted to complete sentences
  • Replaced SOFT_ERROR_OCCURRED() macro with direct state checks to avoid warnings

Current Status

As of January 2026, the patch series has progressed significantly:

Committed:

  • Output format changes for pg_ndistinct (new JSON array format)
  • Output format changes for pg_dependencies (new JSON array format)
  • Input functions for both types with comprehensive validation
  • pg_clear_extended_stats() function

In Review (v27):

  • pg_restore_extended_stats() function
  • pg_dump integration for extended statistics export/import

The pg_dump integration supports backward compatibility to PostgreSQL 10, with version-specific SQL generation to handle format differences.

Technical Details

Internal Storage Unchanged

Importantly, the internal binary storage format remains unchanged. The new input/output functions only affect the text representation. This means:

  • No catalog changes required
  • Existing data remains valid
  • Binary COPY operations unaffected

Expression Statistics Support

Extended statistics can include expressions (e.g., CREATE STATISTICS s ON (a + b), c FROM t). The implementation handles these via negative attribute numbers:

  • -1 = first expression
  • -2 = second expression
  • etc.

The exprs element in the restore format contains per-expression statistics similar to pg_statistic entries, enabling complete round-trip preservation.

MCV List Handling

MCV (Most Common Values) lists for extended statistics are particularly complex, containing:

  • Value combinations across multiple columns
  • Frequency and base frequency arrays
  • Per-value null bitmaps

The implementation reuses infrastructure from attribute statistics import, with extensions for multi-column value arrays.

Conclusion

This patch series represents a significant enhancement to PostgreSQL's statistics infrastructure. By enabling import/export of extended statistics, it:

  1. Completes the statistics story started in PostgreSQL 17 for relation and attribute statistics
  2. Enables realistic testing with production-like statistics on sanitized schemas
  3. Improves upgrade reliability by preserving optimizer information across pg_upgrade

For DBAs and developers:

  • Extended statistics created with CREATE STATISTICS will now survive pg_dump/pg_restore
  • Query plan testing becomes more practical with --no-data dumps that include full statistics
  • The new JSON format is human-readable for debugging and hypothetical scenario testing

The target release is PostgreSQL 19, with the remaining restore function and pg_dump integration expected to land soon.

References

pg_plan_advice: A New Approach to PostgreSQL Query Plan Control

Introduction

PostgreSQL's query planner is sophisticated and generally produces excellent execution plans. However, experienced DBAs and developers occasionally encounter situations where they wish they could influence or stabilize the planner's decisions. Robert Haas from EnterpriseDB has been working on a significant new contrib module called pg_plan_advice that aims to address this long-standing need.

This article examines the pg_plan_advice thread on the pgsql-hackers mailing list, which has been actively discussed since October 2025.

What is pg_plan_advice?

pg_plan_advice is a proposed contrib module that introduces a special-purpose "advice mini-language" for controlling key planning decisions. The module can:

  • Generate advice strings from existing query plans using EXPLAIN (PLAN_ADVICE)
  • Apply advice strings via the pg_plan_advice.advice GUC parameter to reproduce or constrain future planning decisions

The advice language allows control over:

  • Join order: Which tables are joined in what sequence
  • Join methods: Nested loop, merge join, hash join
  • Scan types: Sequential scan, index scan (with specific index selection)
  • Parallelism: Where and how parallel execution is used
  • Partitionwise joins: How partitioned table joins are handled

Key Design Philosophy

Robert Haas emphasizes in the README that the principal use case is not about users "out-planning the planner" but rather about reproducing plans that worked well in the past:

"We don't need to accept the proposition that users can out-plan the planner. We only need to accept that they can tell good plans from bad plans better than the planner. That is a low bar to clear. The planner never finds out what happens when the plans that it generates are actually executed, but users do."

This positions pg_plan_advice as a plan stability tool rather than a hint system for micromanaging the optimizer.

Technical Architecture

The Relation Identifier System

One of the most innovative aspects of pg_plan_advice is its relation identifier system. This system provides unambiguous references to parts of a query, handling complex scenarios like:

  • Multiple references to the same table with different aliases
  • Subqueries and CTEs
  • Partitioned tables and their partitions

The identifier syntax uses special notation like t#2 to distinguish between the first and second occurrence of table t in a query.

Example Usage

Here's an example from Jakub Wartak's testing showing the power of the system:

-- Generate advice for a query with aliasing
EXPLAIN (PLAN_ADVICE, COSTS OFF) 
SELECT * FROM (SELECT * FROM t1 a JOIN t2 b USING (id)) a, t2 b, t3 c 
WHERE a.id = b.id AND b.id = c.id;

-- Output includes:
-- Generated Plan Advice:
--   JOIN_ORDER(a#2 b#2 c)
--   MERGE_JOIN_PLAIN(b#2 c)
--   SEQ_SCAN(c)
--   INDEX_SCAN(a#2 public.t1_pkey)
--   NO_GATHER(c a#2 b#2)

You can then selectively apply constraints:

-- Force a specific scan type
SET pg_plan_advice.advice = 'SEQ_SCAN(b#2)';

-- Re-explain to see the new plan
EXPLAIN (PLAN_ADVICE, COSTS OFF) 
SELECT * FROM (SELECT * FROM t1 a JOIN t2 b USING (id)) a, t2 b, t3 c 
WHERE a.id = b.id AND b.id = c.id;

-- The output shows:
-- Supplied Plan Advice:
--   SEQ_SCAN(b#2) /* matched */

Patch Structure (v10)

The implementation is split into five patches:

PatchDescriptionSize
0001Store information about range table flattening7.8 KB
0002Store information about elided nodes in the final plan9.8 KB
0003Store information about Append node consolidation40.4 KB
0004Allow for plugin control over path generation strategies56.1 KB
0005WIP: Add pg_plan_advice contrib module399.1 KB

The first four patches add necessary infrastructure to the planner, while the fifth contains the actual module. This separation allows the infrastructure to potentially benefit other extensions in the future.

Community Review and Testing

The thread has seen active participation from several community members:

Jakub Wartak (EDB)

Conducted extensive TPC-H benchmark testing and found several bugs:

  • Crashes in debug/ASAN builds with NULL pointer dereferences
  • Issues with semijoin uniqueness detection without statistics
  • Join order advice conflicts in complex queries

Jacob Champion (EDB)

Applied fuzzing techniques to discover edge cases:

  • Parser crashes with malformed advice strings
  • Issues with partition-related advice on non-partitioned tables
  • AST utility bugs revealed through corpus-based fuzzing

Other Contributors

  • Alastair Turner: Appreciated the ability to test alternative plans
  • Hannu Krosing (Google): Referenced VLDB research showing 20% of real-world queries have 10+ joins
  • Lukas Fittl: Interested in pg_stat_statements integration possibilities

Issues Discovered and Fixed

The collaborative review process has uncovered and fixed several issues across versions:

  1. Compiler warnings (gcc-13, clang-20) - Fixed in early versions
  2. NULL pointer crashes in pgpa_join_path_setup() when extension state wasn't allocated
  3. Join order conflict detection incorrectly treating join method advice as positive constraints
  4. Semijoin uniqueness tracking not working correctly without PLAN_ADVICE in EXPLAIN
  5. Partial match detection in nested join order specifications

Current Status

As of v10 (posted January 15, 2026):

  • The patch is registered in Commitfest
  • Still marked as WIP (Work In Progress)
  • Active testing continues, particularly with TPC-H queries
  • Robert Haas is seeking substantive code review, especially for patch 0001

Implications for PostgreSQL Users

If committed, pg_plan_advice would provide:

  1. Plan Stability: Capture and reproduce known-good query plans
  2. Debugging Aid: Understand why the planner makes specific choices
  3. Testing Tool: Experiment with alternative plan shapes without modifying queries
  4. Production Safety Net: Guard against unexpected plan regressions after statistics changes

Comparison with pg_hint_plan

Unlike the popular pg_hint_plan extension, pg_plan_advice focuses on round-trip safety:

  • Plans can be captured and reapplied reliably
  • The relation identifier system handles complex aliasing automatically
  • Designed to work with any query structure without manual identifier management

Conclusion

pg_plan_advice represents a significant step forward in PostgreSQL's planner extensibility story. Rather than replacing the optimizer's judgment, it provides a safety mechanism for preserving proven execution strategies. The active community review process has already improved the code substantially, and continued testing is helping ensure robustness.

For DBAs managing complex workloads, particularly those with queries that occasionally suffer from plan regressions, this module offers a promising solution that works with the planner rather than against it.


Thread Link: pg_plan_advice - pgsql-hackers

Commitfest Entry: CF 6184

Author: Robert Haas (EnterpriseDB)

Reviewers: Jakub Wartak, Jacob Champion, Alastair Turner, Hannu Krosing, John Naylor, and others

2026 年归档

2026 年 PostgreSQL 周报文章。

各周

第 09 周(2026)

2026 年第 09 周的 PostgreSQL 邮件列表讨论。

🇬🇧 English Version

文章

元组解构的进一步加速:预计算 attcacheoff

引言

元组解构(tuple deformation) 是将 PostgreSQL heap 元组的原始字节表示解包为 TupleTableSlot 中各个属性值的过程。它在查询执行中无时无刻不发生——每次顺序扫描、索引扫描或连接产生一行时,执行器都必须对元组进行解构才能访问列值。对于处理数百万行的负载,即便对解构热路径做小幅优化,也能带来可观的性能提升。

David Rowley 一直在持续优化元组解构。在 PostgreSQL 18 中,他已合并多项补丁:CompactAttribute(5983a4cff)、更快的偏移对齐(db448ce5a)以及内联解构循环(58a359e58)。在此基础上,他提出预计算 attcacheoff,而不是在每次属性访问时计算。讨论已演进至 v10(2026 年 2 月),Andres Freund 贡献了 NULL 位图转 isnull 的方案,使 Apple M2 在部分场景下加速达 63%。补丁集仍在积极审查中。

为什么重要

当执行器需要从元组中读取某列值时,必须:

  1. 对齐:按属性对齐要求对齐当前偏移
  2. 获取:通过 fetch_att() 读取值
  3. 前移:跳过当前属性到下一个

这些步骤形成依赖链:每个偏移都依赖前一个。指令级并行空间有限。对于定长属性,PostgreSQL 可以缓存偏移(attcacheoff)以避免重复计算对齐和长度——但此前缓存是在解构循环内部完成的。David 的想法是:在 TupleDesc 初始化完成时一次性计算,而不是对每个元组都算一遍。

技术方案

TupleDescFinalize()

核心改动是引入 TupleDescFinalize(),必须在 TupleDesc 创建或修改后调用。该函数会:

  1. 预计算所有定长属性的 attcacheoff
  2. 记录 firstNonCachedOffAttr——第一个无法缓存偏移的属性的 attnum(即首个 varlena 或 cstring 属性)
  3. 启用一个紧凑循环,先处理所有有缓存偏移的属性,再进入需要手动计算偏移的属性

如果元组在最后一个有缓存偏移的属性之前存在 NULL,则只能使用 attcacheoff 到该 NULL 为止——但对于没有早期 NULL 的元组,快速路径可以在一个紧凑循环中处理大量属性,而无需任何按属性的偏移运算。

专用解构循环

补丁添加了一个专用循环,先处理所有有预计算 attcacheoff 的属性,再进入处理 varlena/cstring 属性的循环。对于设置了 HEAP_HASNULL 的元组,当前代码会对每个属性调用 att_isnull()。进一步优化是:在遇到第一个 NULL 之前,持续解构而不调用 att_isnull()。基准测试中的场景 #5(首列 int not null、末列 int null)最能体现这一点——常表现为最大加速。

可选的 OPTIMIZE_BYVAL 循环

可选变体针对所有被解构属性均为 attbyval == true 的元组增加一个循环。此时可以内联 fetch_att(),而无需处理指针类型的分支,从而减少分支、获得更紧凑的循环。代价是:当该优化不适用时,需要额外检查 attnum < firstByRefAttr。基准测试中,不同硬件和编译器对是否启用该优化效果不一。

基准测试设计

为最大化解构工作占总 CPU 的比例,David 设计了如下基准查询:

SELECT sum(a) FROM t1;

其中 a 列几乎在最后,因此必须先解构前面的所有属性才能读取 a。八种表结构涵盖首列(int/text、null/not null)和末列(int null/not null)的组合。对每种表结构,分别在 0、10、20、30、40 个额外 INT NOT NULL 列下运行——每次基准运行包含 40 个场景,每个场景 100 万行。

基准测试结果

结果因硬件和编译器而异:

  • AMD Zen 2(3990x)+ GCC:启用 OPTIMIZE_BYVAL 时平均加速达 21%;部分测试超过 44%;无回退。
  • AMD Zen 2 + Clang:0 额外列场景下存在小幅回退。
  • Apple M2:场景 #1 和 #5 提升明显;其余提升较小;部分补丁有轻微回退。
  • Intel(Azure):在共享、少核实例上运行,因与其他负载共享 L3,结果噪声较大。

补丁演进

v1 → v3(2025 年 12 月 – 2026 年 1 月)

  • v1:三个补丁——0001(预计算 attcacheoff)、0002(实验性 NULL 位图前瞻)、0003(移除专用 hasnulls 循环)
  • v2:代码库同步、修复 0003 中 NULL 位图 Assert、JIT 修复(移除 TTS_FLAG_SLOW)、更多基准
  • v3:代码库同步、放弃 0002 和 0003(基准收益有限)、仅保留 0001

v4(2026 年 1 月)

回应 Chao Li 的代码审查:

  • NULL 位图 mask(tupmacs.h):补充注释——当 natts & 7 == 0 时 mask 为 0,代码会正确返回 natts
  • 未初始化 TupleDescfirstNonCachedOffAttr == 0 表示无缓存属性;-1 表示未初始化。添加 Assert,失败时提示调用 TupleDescFinalize()
  • 拼写:"possibily" → "possibly"
  • LLVM:修复编译警告

v5–v8(2026 年 1–2 月):Andres Freund 的 NULL 位图优化

Andres Freund 加入讨论并提出关键改进:不再对每列调用 att_isnull(),而是用 SWAR(SIMD Within A Register)技术直接从 NULL 位图计算 isnull[] 数组。思路是:将位图的一个字节乘以精心选定的值(如 0x204081),使每位扩散到独立字节,再掩码。这样无需 2KB 查找表,在多数硬件上效果良好。

David 在补丁 0004(“Various experimental changes”)中实现了该方案。0004 的其他改动包括:

  • populate_isnull_array():用乘法技巧批量将 NULL 位图转换为 tts_isnull
  • tts_isnull 大小:向上取整到 8 的倍数,使循环可一次写 8 字节(避免 memset 内联问题)
  • t_hoff:对 !hasnulls 元组,使用 MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) 替代 t_hoff
  • fetch_att_noerr():新增无 elog 的变体,用于常见的 attlen == 8 情况

John Naylor 指出当字节为 255 时 __builtin_ctz(~bits[bytenum]) 未定义;David 通过强制转换修复:pg_rightmost_one_pos32(~((uint32) bits[bytenum]))

启用 0004 后,Apple M2 平均比 master 快 53%(排除 0 额外列测试约 63%)。Andres 建议使用 pg_nounrollpg_novector pragma 防止 GCC 对 populate_isnull_array() 过度向量化,该函数曾生成低效代码。

v9(2026 年 2 月 24 日)

  • 补丁重排deform_bench 移至 0001,便于在 master 上做基准测试
  • 0004(新)slot_getsomeattrs 的 sibling-call 优化——将 slot_getmissingattrs() 移入 getsomeattrs(),使编译器可应用尾调用优化,降低开销并改善 0 额外列测试
  • 0005(新):将 CompactAttribute 从 16 字节缩小到 8 字节——attcacheoff 改为 int16(最大 2^15),布尔用位标志。Andres 指出 8 字节便于编译器使用 scale factor 8 的单条 LEA;6 字节则需两条 LEA

v10(2026 年 2 月 25 日)— 最新补丁集

基于实际 v10 补丁内容:

0003(优化元组解构)

  • firstNonCachedOffsetAttr:首个无缓存偏移的属性的索引
  • firstNonGuaranteedAttr:首个可为 NULL、缺失或 !attbyval 的属性的索引。仅解构到此属性时,无需访问 HeapTupleHeaderGetNatts(tup),减少 CPU 流水线依赖
  • TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS:保证属性优化的可选标志(部分代码在 NOT NULL 校验前即解构元组)
  • populate_isnull_array():使用 SPREAD_BITS_MULTIPLIER_32(0x204081)将反转的 NULL 位图每位扩散到独立字节;分低 4 位和高 4 位处理以避免 uint64 溢出
  • fetch_att_noerr():无 elogfetch_att() 变体;当 attlen 来自 CompactAttribute 时安全
  • first_null_attr():用 pg_rightmost_one_pos32__builtin_ctz 查找位图中首个 NULL

0004(sibling-call 优化)

  • getsomeattrs() 现负责调用 slot_getmissingattrs()
  • slot_getmissingattrs():用 for 循环替代 memset(基准显示循环更快)
  • slot_deform_heap_tuple():在 attnum < reqnatts 时于末尾调用 slot_getmissingattrs();参数由 natts 改为 reqnatts

0005(8 字节 CompactAttribute)

  • attcacheoff 改为 int16;大于 PG_INT16_MAX 的偏移不缓存
  • attispackableatthasmissingattisdroppedattgenerated 使用位标志
  • 保存 cattrs = tupleDesc->compact_attrs 以帮助 GCC 生成更优代码(避免重复 TupleDescCompactAttr() 调用)

审查修复

  • Amit Langote:修复 rebase 噪声(重复的 attcacheoff 检查)
  • Zsolt Parragi:大端序修复——在 populate_isnull_array()memcpy 前加入 pg_bswap64()
  • 拼写:"benchmaring" → "benchmarking","to info" → "into"
  • Andres:在 slot_getmissingattrs 前设置 *offp 以减少栈溢出;将 attnum 改为 size_t 以修复 GCC -fwrapv 下的代码生成

deform_bench 与基准基础设施

AndresÁlvaro Herrera 讨论了 deform_bench 的放置:src/test/modules/benchmark_toolssrc/benchmark/tuple_deform,或单一微基准扩展。Andres 主张逐步合并有用工具,而非等待完整套件。David 倾向于先完成解构优化补丁;deform_bench 可能单独提交。

代码审查:Chao Li 的反馈

Chao Li 审查了补丁并提出几点:

  1. NULL 位图 mask:补充注释说明 natts & 7 == 0 时无溢出/OOB 风险
  2. 未初始化 TupleDesc:在 TupleDesc 创建时将 firstNonCachedOffAttr 初始化为 -1;在 nocachegetattr() 中断言 >= 0
  3. 语义一致性:用 0 表示“无缓存属性”,>0 表示“有缓存”
  4. 拼写:"possibily" → "possibly"

David 在 v4 中均已处理。

当前状态

  • v10(2026 年 2 月)为最新补丁集:0001(deform_bench)、0002(TupleDescFinalize 桩)、0003(主优化)、0004(sibling-call + NULL 位图→isnull)、0005(8 字节 CompactAttribute)
  • Andres Freund 支持合并 0004,认为收益明显;0005 的收益较不确定(在解构少量列时有助于 LEA 寻址)
  • Zsolt Parragi(Percona)、Álvaro HerreraJohn NaylorAmit Langote 持续参与审查
  • deform_bench 的放置(src/test/modules 或 src/benchmark)仍在讨论;David 希望先落地优化补丁

结论

TupleDescFinalize() 中预计算 attcacheoff,并为有缓存偏移的属性使用专用紧凑循环,可为现代 CPU 上的元组解构带来可观的加速。当元组具有大量定长列且 NULL 较少或较晚出现时,优化效果最佳。结合 Andres Freund 的 NULL 位图转 isnull 方案(“0x204081” SWAR 技巧),Apple M2 在排除边缘情况后可达 63% 加速。slot_getsomeattrs 的 sibling-call 优化进一步降低开销。结果因硬件和编译器而异;GCC 可能对部分循环过度向量化,可通过 pragma 或将循环索引改为 size_t 缓解。补丁集(v10)经 Andres、John Naylor、Zsolt Parragi、Álvaro Herrera、Amit Langote 等多轮审查,正在向集成推进。

参考资料

第 08 周(2026)

2026 年第 08 周 PostgreSQL 邮件列表讨论。

🇬🇧 English Version

文章

消除 RI 触发器中的 SPI:外键检查的快速路径

引言

PostgreSQL 中的引用完整性(Referential Integrity, RI)触发器传统上通过 SPI(Server Programming Interface)执行 SQL 查询,以验证引用表(Referencing Table)中新插入或更新的行是否在被引用表(Referenced Table, 主键表)中存在匹配行。对于批量操作(大批量 INSERTUPDATE),这意味着每一行都会启动和销毁一次完整的执行计划,ExecutorStart()ExecutorEnd() 带来的开销相当可观。

Amit Langote 一直在致力于消除这一开销,通过用直接索引探测替代 SQL 计划来完成 RI 检查。这项工作最新迭代“Eliminating SPI / SQL from some RI triggers - take 3”通过绕过 SPI 执行器、在约束语义允许时直接调用索引访问方法,将批量外键检查的速度提升了最高 57%

补丁集历经多版演进,Junwang Zhao 于 2025 年底加入开发。当前方向为混合快速路径 + 回退:在简单场景下使用直接索引探测,在正确性依赖执行器复杂行为时回退到现有 SPI 路径。

为什么重要

外键约束无处不在。每次向引用表执行 INSERTUPDATE 都会触发 RI 检查,验证每一行是否在被引用表的主键中存在匹配。传统做法下:

CREATE TABLE pk (a int PRIMARY KEY);
CREATE TABLE fk (a int REFERENCES pk);

INSERT INTO pk SELECT generate_series(1, 1000000);
INSERT INTO fk SELECT generate_series(1, 1000000);  -- 100 万次 RI 检查

每一次插入都会触发 RI 检查,执行:

  1. 构建用于扫描主键索引的查询计划
  2. 调用 ExecutorStart()ExecutorEnd()
  3. 执行计划查找(或确认不存在)匹配行

每行都要经历一次执行计划的建立与销毁,主导了总耗时。在 Amit 的 v3 补丁下,同样的批量插入从约 1000 ms 降至约 432 ms(快 57%) —— 通过直接探测主键索引,而不经过执行器。

技术背景

传统 RI 路径

ri_triggers.c 中的 RI 触发器函数(如 RI_FKey_check)调用 ri_PerformCheck(),其流程为:

  1. 构建形如 SELECT 1 FROM pk WHERE pk.a = $1 的 SQL 字符串
  2. 使用 SPI_prepareSPI_execute_plan 执行
  3. 执行器在主键上执行索引扫描,若被引用值存在则返回一行

这种方式在所有场景下都正确 —— 分区表、时态外键、并发更新 —— 但每行都承担完整的计划执行成本。

快速路径思路

对于简单外键(被引用表非分区、无非时态语义),检查本质上是:“用该值探测主键索引;若找到且能加锁,则检查通过”。可通过以下方式实现:

  1. 打开主键关系和其唯一索引
  2. 根据外键列值构建扫描键
  3. 调用 index_getnext()(或等效接口)查找元组
  4. 在当前快照下用 LockTupleKeyShare 加锁

无需 SQL、计划或执行器,只需直接索引探测和元组加锁。

补丁演进

v1:原始方案(2024 年 12 月)

初版补丁集(3 个补丁)引入:

  • 0001:重构 PartitionDesc 接口,显式传递 omit_detached 可见性(已分离挂起分区)所需的快照。解决了一个 bug:在 REPEATABLE READ 下,因 RI 查找会操作 ActiveSnapshot,而 find_inheritance_children() 对已分离挂起分区的可见性依赖该快照,导致主键查找可能返回错误结果。
  • 0002:在 RI 触发器函数中避免使用 SPI,引入直接索引探测路径。
  • 0003:对部分 RI 检查避免使用 SQL 查询——主要性能优化。

Amit 指出 temporal foreign key 查询仍保留在 SPI 路径,因其计划涉及范围重叠和聚合,无法用简单索引探测处理。他还为快速路径增加了与 EvalPlanQual() 等价的逻辑,在 READ COMMITTED 下正确处理并发更新。

v2:Junwang 的混合快速路径(2025 年 12 月)

Junwang Zhao 在此基础上继续推进,采用混合设计:

  • 0001:为外键约束检查添加快速路径。适用条件:被引用表非分区,约束不涉及 temporal semantics 时。
  • 0002:缓存快速路径元数据(操作符哈希条目、操作符 OID、策略号、子类型)。当时该元数据缓存尚未带来性能提升。

基准测试(100 万行,numeric 主键 / bigint 外键):

  • 主线:INSERT 13.5s,UPDATE 15s
  • 补丁版:INSERT 8.2s,UPDATE 10.1s

v3:Amit 的重构与按语句缓存(2026 年 2 月)

Amit 将 Junwang 的补丁重构成两个补丁:

  • 0001:功能完整的快速路径。包含并发处理、REPEATABLE READ 交叉检查、跨类型操作符、安全上下文(RLS/ACL)及元数据缓存。主要逻辑集中在 ri_FastPathCheck()RI_FKey_check 仅负责分支判断并在需要时回退到 SPI。

  • 0002:按语句的资源缓存。不共享 trigger.cri_triggers.cEState,而是引入新的 AfterTriggerBatchCallback 机制,在每次触发器执行周期结束时调用。借此,可在单一周期内缓存主键关系、索引、扫描描述符和快照,从而在多次 FK 触发器调用之间复用,而不是每行都打开和关闭。

Amit 的基准测试:

场景主线00010001+0002
100 万行,numeric/bigint2444 ms1382 ms(快 43%)1202 ms(快 51%)
100 万行,int/int1000 ms520 ms(快 48%)432 ms(快 57%)

0002 的额外收益(约 13–17%)来自消除每行的关系打开/关闭、扫描开始/结束、槽分配/释放,并将每行的 GetSnapshotData() 替换为缓存中的快照副本。

设计:何时走快速路径,何时走 SPI

快速路径适用条件:

  • 被引用表非分区
  • 约束涉及 temporal semantics(范围重叠、range_agg() 等)
  • 多列键、跨类型相等(通过索引操作符族)、排序规则匹配、RLS/ACL 均在快速路径内处理

在以下情况回退到 SPI:

  1. 并发更新或删除:若 table_tuple_lock() 报告目标元组已被更新或删除,则委托给 SPI,由 EvalPlanQual 和可见性规则按现有逻辑处理。
  2. 分区被引用表:需要通过 PartitionDirectory 将探测路由到正确分区,可后续单独补丁支持。
  3. Temporal foreign keys:使用范围重叠和包含语义,本质上涉及聚合,保留在 SPI 路径。

安全行为与现有 SPI 路径一致:快速路径在探测时临时切换到父表所有者,使用 SECURITY_LOCAL_USERID_CHANGE | SECURITY_NOFORCE_RLS,与 ri_PerformCheck() 保持一致。

后续方向

David Rowley 在私下交流中建议,将多个 FK 值批量为单次索引探测可进一步提升性能,利用 PostgreSQL 17 的 ScalarArrayOp 对 btree 的改进。思路:在按约束的缓存中跨触发器调用缓冲 FK 值,构建 SK_SEARCHARRAY 扫描键,让 btree AM 在一次有序遍历中扫描匹配的叶页,而不是每行一次树下降。加锁和重检查仍按元组进行。可作为独立补丁在现有系列之上探索。

当前状态

  • 补丁系列位于 PG19-Drafts。Amit 于 2025 年 10 月移入;Junwang Zhao 正在继续推进。
  • Amit 的 v3 补丁(2026 年 2 月)已基本成型,等待审查。欢迎反馈,尤其是关于 ri_LockPKTuple() 中的并发处理及 0002 中快照生命周期的意见。
  • Pavel Stehule 表示愿意协助测试和审查。

结论

对简单外键检查消除 SPI 调用,可为批量操作带来可观的性能提升。混合快速路径 + 回退设计回应了审查者对正确性的关切:在正确性依赖执行器复杂行为时回退到 SPI。v3 中的按语句资源缓存进一步优化,将关系/索引的建立成本分摊到单一触发器执行周期内的多行上。

对于具有大量外键的批量插入或更新场景——常见于 ETL、暂存加载、数据迁移 —— 该工作有望显著缩短运行时间。当前限制(分区主键、时态外键)使这些场景仍走现有路径,在保证正确性的同时优化大多数 FK 工作负载。

参考资料

第 07 周(2026)

2026 年第 07 周的 PostgreSQL 邮件列表讨论。

🇬🇧 English Version

文章

SQL/PGQ:为 PostgreSQL 引入图查询能力

引言

2024 年 2 月,Peter Eisentraut 在 pgsql-hackers 邮件列表中宣布了 SQL 属性图查询(SQL/PGQ) 的原型实现——一种在 PostgreSQL 中直接执行图风格查询的新方式,遵循 SQL:2023 标准(ISO 9075-16)。该提议曾在 FOSDEM 开发者会议上被简要讨论,社区的兴趣促使 Peter 将进行中的工作整理并分享出来。

近两年后,补丁已演进为体量可观的实现:118 个文件变更,约 14,800 行新增代码。Peter 和 Ashutosh Bapat 是主要作者,Junwang Zhao 参与审阅,Ajay Pal 和 Henson Choi 参与测试。最新版本(v20260113)整合了从 v0 到 v14 及之后的全部功能——包括循环路径模式、访问权限、RLS 支持、图元素函数(LABELS()PROPERTY_NAMES())、多模式路径匹配、ECPG 支持、属性排序规则以及 pg_overexplain 集成。

SQL/PGQ 让你可以在现有关系表上定义属性图,并用路径模式(由边连接的顶点)进行查询,类似 Cypher 或 GQL。与专用图数据库不同,该方案将图映射到关系模型上:图是表的视图,图查询被改写为连接与并集。讨论中提出了重要的架构问题:这种转换应在 PostgreSQL 的何时以何种方式完成——以及改写器是否是合适的实现位置。

背景与意义

许多应用的数据天然具有图结构:社交网络、供应链、推荐系统、反欺诈等。目前开发者通常需要:

  • 使用独立的图数据库(如 Neo4j),维护两套系统,或
  • 在 PostgreSQL 中用递归 CTE 和复杂连接来编码图遍历。

SQL/PGQ 旨在为 PostgreSQL 用户提供一种标准的、声明式的方式来表达图查询,而无需脱离 SQL 或复制数据。该标准已被 Oracle 23c 等采用;将其引入 PostgreSQL 将提升互操作性,并使图能力惠及更广泛的用户。

技术分析

SQL/PGQ 模型

SQL/PGQ 中的属性图是在已有表上定义的虚拟结构:

  • 顶点:来自一个或多个表的行(可带可选标签)。
  • :关系,通常由外键推断或显式指定。
  • 属性:来自这些表的列。
  • 标签:可在多个元素表间共享(例如 person 同时用于 customersemployees),每个标签暴露自己的属性集合。

使用 CREATE PROPERTY GRAPH 创建图,用 GRAPH_TABLE(... MATCH ... COLUMNS ...) 查询。示例:

CREATE PROPERTY GRAPH myshop
    VERTEX TABLES (
        products LABEL product,
        customers LABEL customer,
        orders LABEL "order"
    )
    EDGE TABLES (
        order_items SOURCE orders DESTINATION products LABEL contains,
        customer_orders SOURCE customers DESTINATION orders LABEL has_placed
    );

SELECT customer_name FROM GRAPH_TABLE (myshop
  MATCH (c IS customer)-[IS has_placed]->(o IS "order" WHERE o.ordered_when = current_date)
  COLUMNS (c.name AS customer_name));

MATCH 子句描述路径模式;实现会将其改写为 PostgreSQL 可以规划和执行连接与过滤。

实现方式:改写系统

实现采用改写系统完成图到关系的转换——与视图展开处于同一阶段。Peter 解释,这与 SQL/PGQ 规范一致:图映射到关系,查询像视图定义一样被展开。当优化器看到查询时,图结构已被展开为标准关系形式,从而与视图安全(权限、安全屏障)保持一致。

补丁演进:从 v0 到 v20260113

v0:脆弱原型(2024 年 2 月)

初始补丁约 332 KB。Peter 称其「相当脆弱」。它引入了 CREATE PROPERTY GRAPHGRAPH_TABLE、基本路径模式,以及 ddl.sgmlqueries.sgml 中的文档。

v1 与早期完善(2024 年 6–8 月)

Peter 与 Ashutosh 发布了 v1,具备「相当完整的最小功能集」。Ashutosh 贡献了:

  • 图模式中的 WHERE 子句——例如 MATCH (a)->(b)->(c) WHERE a.vname = c.vname
  • 「列未找到」假阳性修复:属性名原本引用自 pg_attribute 的堆元组;在 RELCACHE_FORCE_RELEASE 后可能指向已释放内存。修复方式是将属性名复制出来。
  • 编译修复、pgperltidy 合规、隐式属性/标签的错误位置报告。

Imran Zaheer 增加了标签和属性 EXCEPT 列表的支持。

v14:循环路径、访问权限、RLS(2024 年 8–10 月)

Ashutosh 贡献了主要功能:

  • 循环路径模式:元素变量在路径中重复出现的模式(例如同一顶点既在起点又在终点)。共享变量的元素必须具有相同类型和标签表达式;重复边模式不受支持。
  • 属性图的访问权限:属性图行为类似安全调用者视图——当前用户必须对底层表有权限。仅对用户可访问的元素,查询才能成功。安全定义者属性图未实现。
  • 行级安全(RLS)graph_table_rls.sql 中的回归测试验证 RLS 与属性图的配合。
  • 属性排序规则与边-顶点链接:同名字的属性在各元素间必须具有相同排序规则。边的键列与引用顶点键的排序规则必须兼容。边-顶点链接使用等值运算符构建,并建立依赖,使这些运算符无法在未删除边的情况下被删除。
  • \d\dG 变体:对属性图执行 \d 显示元素、表、类型和终点顶点别名;\d+ 增加标签与属性。\dG 列出属性图;\dG+ 增加所有者与描述。

Henson Choi:LABELS()、PROPERTY_NAMES()、多模式(2025 年 12 月)

Henson Choi 贡献了三个补丁:

  • LABELS() 图元素函数:返回图元素的所有标签为 text[]。通过在子查询中包装元素表、添加虚拟 __labels__ 列实现,使优化器在按标签过滤时能剪枝 Append 分支(例如 WHERE 'Person' = ANY(LABELS(v)))。
  • PROPERTY_NAMES() 图元素函数:返回所有属性名为 text[],同样支持基于属性的过滤剪枝。
  • 多模式路径匹配:支持 MATCH 中逗号分隔的路径模式,如 MATCH (a)->(b), (b)->(c)。共享变量的模式合并为一个连接;不相连的模式产生笛卡尔积(与 SQL/PGQ 和 Neo4j Cypher 一致)。

v20260113:整合实现(2026 年 1 月)

最新补丁(v20260113)将此前工作合并为单一 WIP 补丁:

  • ECPG 支持:ECPG 中的 SQL/PGQ——基本查询、预处理语句、游标、动态查询。ECPG 中的标签析取需要对 ecpg 词法分析器做改动。
  • pg_overexplain 集成:属性图 RTE 与 RELKIND_PROPGRAPHEXPLAIN (RANGE_TABLE, ...) 中得到识别。
  • 扩展测试覆盖create_property_graph.sql(365 行)、graph_table.sql(561 行)、graph_table_rls.sql(363 行)、privileges.sql(58 行)。
  • rewriteGraphTable.c:从约 420 行增至约 1,330 行;propgraphcmds.c 从约 1,000 行增至约 1,860 行。

社区讨论

Andres Freund:对改写器的担忧

Andres Freund 提出结构层面的顾虑:通过改写系统转换会妨碍优化器利用图语义,并增加对改写系统的依赖。Peter 回应称,PGQ 在设计上以关系为核心(类似视图展开),标准和其他实现都遵循这一模型。Tomas Vondra 质疑是否应更长时间保留图结构以支持图专属索引或执行器节点;Ashutosh Bapat 指出,许多优化本身会改善底层连接,与视图展开保持一致对安全也有意义。

Florents Tselai:文档

Florents Tselai 建议调整文档顺序:先回答「我的数据建模为图 G=(V, E),Postgres 能帮上忙吗?」再深入实现细节,并使用 graph_table.sql 中的可运行示例。他将该方案与 Apache Age 的 jsonpath 风格做法对比,但同意标准的关系映射适合 PostgreSQL 核心。

技术细节

架构

  • 解析器CREATE PROPERTY GRAPHALTER PROPERTY GRAPHDROP PROPERTY GRAPH 以及 GRAPH_TABLE(... MATCH ... COLUMNS ...) 的语法。
  • 系统表pg_propgraph_elementpg_propgraph_element_labelpg_propgraph_labelpg_propgraph_label_propertypg_propgraph_property
  • 改写rewriteGraphTable.c 将图模式转换为连接与并集。
  • 工具pg_dumppsql\d/\dG、补全;pg_get_propgraphdef() 用于自省。
  • ECPG:嵌入式 SQL 中的完整支持。

访问控制

  1. 用户需要对属性图有 SELECT 权限。
  2. 属性图采用安全调用者:当前用户必须对底层表有权限。仅对用户可访问的元素,查询才能成功。
  3. 在安全定义者视图中,属性图访问使用视图所有者的权限;底层关系的访问使用执行者的权限。
  4. 安全定义者属性图未实现(标准未提及)。

属性排序规则与边-顶点链接

  • 同名属性在各元素间必须具有相同排序规则。
  • 显式指定键时,边的键与顶点键的排序规则必须匹配(外键推导出的链接依赖约束本身)。
  • 边-顶点链接使用等值运算符;边依赖这些运算符,因此无法独立删除。

当前状态

v20260113 补丁是一个整合后的 WIP。它包含:

  • 完整的 CREATE PROPERTY GRAPH / ALTER / DROP,支持标签、属性、KEY 子句、SOURCE/DESTINATION REFERENCES
  • GRAPH_TABLE,支持路径模式、模式内 WHERE、循环路径、多模式
  • LABELS()PROPERTY_NAMES() 图元素函数
  • 访问权限、RLS、权限测试
  • ECPG 支持、pg_overexplain 集成
  • 文档与回归测试

补丁尚未提交。Peter 和 Ashutosh 持续改进;基于改写器的设计仍是当前选择,审阅与测试仍在进行中。

小结

SQL/PGQ 将为 PostgreSQL 带来标准化的图查询语法。实现已从脆弱原型发展为功能较全的补丁,包含循环路径、图元素函数、多模式匹配、访问控制、RLS、ECPG 支持及较完整的测试。主要架构选择——在改写系统中将图查询改写为关系形式——与标准和视图语义一致。若被提交,PostgreSQL 用户将能在不离开 SQL、不维护独立图数据库的情况下,在关系数据上表达路径模式。

参考资料

将 LEFT JOIN 归约为 ANTI JOIN:针对 "WHERE col IS NULL" 的优化器优化

引言

2025 年 12 月底,Nicolas Adenis-Lamarre 在 pgsql-hackers 邮件列表中提出了一项优化器优化:当查询使用 LEFT JOIN ... WHERE right_table.column IS NULL,且该列在语义上非空(例如 NOT NULL 或主键)时,自动识别为反连接(anti-join)。这类查询的语义是“左侧有、右侧无匹配的行”,正是反连接所表达的含义。识别出该模式后,优化器可以选择显式的反连接计划(如 Hash Anti Join),而不是普通的左连接加过滤,往往能获得更好的执行效率。

讨论吸引了 Tom Lane、David Rowley、Tender Wang、Richard Guo 等人参与。补丁历经多版迭代,被提交到 CommitFest,并在代码审查中暴露出与嵌套外连接和继承相关的正确性问题。本文概述该优化的思路、实现方式以及当前状态。

为什么重要

很多开发者会这样写“在 A 中找在 B 中没有匹配的行”:

SELECT a.*
FROM a
LEFT JOIN b ON a.id = b.a_id
WHERE b.some_not_null_col IS NULL;

LEFT JOIN 会使来自 a 的无匹配行在 b 的所有列上为 NULL。当 b.some_not_null_col 在表 b 上为 NOT NULL 时,用 WHERE b.some_not_null_col IS NULL 过滤,留下的就是这些无匹配行。语义上这就是反连接:“在 A 中且不存在在 B 中匹配的行”。

若优化器不识别该模式,可能按普通左连接加过滤实现;若识别,则可以使用显式的 Hash Anti Join 等,执行更高效,也有利于选择更好的连接顺序。这类优化是“非强制”的——熟练用户可以把查询改写成 NOT EXISTSNOT IN(并注意 NULL 语义),但自动识别能惠及所有用户,同时保留原有 SQL 的可读性。

技术背景

PostgreSQL 已有在部分场景下归约外连接的逻辑,例如:

  • 提交 904f6a593 与 e2debb643 引入了优化器可用于此类归约的基础设施。
  • reduce_outer_joins_pass2 中,优化器已经会在“连接自身的条件对某些被更高层条件强制为 NULL 的变量是严格的”时,尝试将 JOIN_LEFT 归约为 JOIN_ANTI

该处原有注释提到,还存在其他识别反连接的方式——例如检查来自右侧的变量是否因表约束(NOT NULL 等)而必然非空。Nicolas 的提议与 Tender 的补丁实现的正是这一点:利用 NOT NULL 等信息,在 WHERE rhs_col IS NULL 能推出“无匹配”时,将 LEFT JOIN 归约为 ANTI JOIN。

补丁演进

Nicolas 的初版补丁

Nicolas 提交的草稿补丁实现了:

  • 在“left join b where x is null”且 x 为来自右侧(RTE)的非空变量时,识别该模式。
  • 有意采用“快速实现”以验证可行性。

他还列举了其他想法(去掉冗余 DISTINCT/GROUP BY、合并双重 ORDER BY、对 NOT IN 做反连接、以及“查看改写后查询”的方式等),邮件中略有讨论,但非本文重点。

Tom Lane 与 David Rowley

Tom Lane 指出:

  • 该优化合理,且应使用新基础设施(904f6a593、e2debb643)。
  • 草稿不应让周边注释过时;保持注释准确是必须的。

David Rowley 建议:

  • 使用 find_relation_notnullatts(),并与 forced_null_vars 比较,注意 FirstLowInvalidHeapAttributeNumber
  • 在邮件列表中检索 UniqueKeys 相关历史(用于冗余 DISTINCT 消除)。
  • 对“消除双重 ORDER”和“NOT IN 反连接”持谨慎态度,二者此前都有讨论且边界情况复杂。
  • “查看改写后查询”含义不清,很多优化无法再表达成单一 SQL。

Tender Wang 的实现(v2–v4)

Tender Wang 提供的补丁:

  • 基于 904f6a593 和 e2debb643 的基础设施实现。
  • 更新了 reduce_outer_joins_pass2 中的注释,说明通过右侧 NOT NULL 约束识别反连接的新情况。
  • 增加了回归测试。

随后 Nicolas:

  • 确认 Tender 的补丁是正确的(经重新测试)。
  • 建议增加提前退出:仅当 forced_null_vars != NIL 时才执行新逻辑,避免在大多数没有“强制为 NULL”变量的左连接上调用 find_nonnullable_varshave_var_is_notnull
  • 贡献了额外回归测试,使用新表(带 NOT NULL 约束),而不是修改 tenk1 等现有测试表。

Tom Lane 明确:不应修改通用测试对象(如 test_setup.sql 中的表),否则可能改变规划行为并影响其他测试。新测试应使用新表或已有且属性合适的表。

Tender 将 Nicolas 的提前退出与回归测试合并为 v4 单补丁,并提交到 CommitFest

Richard Guo 的审查:正确性问题

Richard Guo 对 v4 的审查发现了两个正确性问题。

1. 嵌套外连接

当左连接的右侧本身又包含外连接时,即使某列在其基表上为 NOT NULL,在连接结果中仍可能为 NULL。此时将该外连接归约为反连接会出错。

例如(表 t1t2t3,列如 (a NOT NULL, b, c)):

EXPLAIN (COSTS OFF)
SELECT * FROM t1
LEFT JOIN (t2 LEFT JOIN t3 ON t2.c = t3.c) ON t1.b = t2.b
WHERE t3.a IS NULL;

这里 t3.at3 上为 NOT NULL,但由于内层 t2 LEFT JOIN t3,来自 t1 的一行在与子查询连接后仍可能使 t3.a 为 NULL(当在 t3 中无匹配时)。因此上层连接必须保持为左连接;若错误地转为反连接,会错误地丢弃行。

补丁在判断“非空列”时没有考虑该变量是否会被下层外连接变为 NULL。Richard 指出当前 forced_null_vars 中并未记录 varnullingrels,一种简单修复是仅当右侧无外连接(right_state->contains_outer 为 false)时做此优化,但这会过于保守。

他提出的方向是:在 reduce_outer_joins_pass1_state 中记录每个子树下可为空的基表 relid;在检查 NOT NULL 约束时,跳过来自这些 rel 的变量。他附上了 v5 补丁以说明该思路。

2. 继承

对继承父表而言,某些子表可能在某列上有 NOT NULL,而其他子表没有。补丁未考虑这种情况;相比嵌套外连接,这一点相对容易修复。

其他讨论

  • Pavel Stehule 提醒避免在列表中 top-posting;PostgreSQL 维基有邮件列表风格说明。
  • 子查询中的常量:Nicolas 提到像 SELECT * FROM a LEFT JOIN (SELECT 1 AS const1 FROM b) x WHERE x.const1 IS NULL 这类情况未被处理,他认为不值得专门处理。

当前状态

  • v4 补丁(含提前退出与回归测试)已提交至 CommitFest(patch 6375)。
  • Richard Guo 的 v5 通过记录可为空的基表 rel 并收紧 NOT NULL 的使用条件,针对嵌套外连接与继承问题做了修正。
  • 截至该讨论,工作仍在进行;最终是否合入以及以何种形式合入,以邮件列表与 CommitFest 为准。

小结

在能够证明右侧某列为非空的前提下,将 LEFT JOIN ... WHERE rhs_not_null_col IS NULL 自动归约为反连接,是一项有用的优化器优化,可在不要求用户改写 SQL 的情况下提升性能。补丁从草稿发展到基于现有基础设施的实现,并加入了回归测试与提前退出。审查反馈指出了重要的正确性约束:右侧可能存在嵌套外连接或继承,因此只有在变量不会被下层连接或继承变为 NULL 时,才能基于 NOT NULL 做归约。后续工作集中在 Richard 的方案(记录可为空的基表 rel、限制 NOT NULL 检查)以及对继承的安全处理上。

参考

第 06 周(2026)

2026 年第 06 周的 PostgreSQL 邮件列表讨论。

🇬🇧 English Version

文章

COPY TO 的 JSON 格式:PostgreSQL 原生 JSON 导出

引言

2023 年 11 月,Davin Shearer 在 pgsql-general 上询问如何用 COPY TO 将 JSON 从 PostgreSQL 导出到文件。当他用 COPY TO 导出只产生一列 JSON 的查询(例如 json_agg(row_to_json(t)))时,文本格式会再次对内容做转义:JSON 内部的双引号被多转义一层,得到的不是合法 JSON,用 jq 等工具无法解析。社区共识是:更合理的做法是为 COPY TO 提供原生 JSON 格式——这样单列 JSON(或整行渲染成一个 JSON 对象)会按合法 JSON 写出,不再叠加一层 text/CSV 转义。

在此基础上,结合 Joe Conway 早前关于 COPY/JSON 的讨论,提交了一系列补丁。最终设计是:为 COPY TO 增加 FORMAT json 选项,以及可选的 FORCE_ARRAY,用于将整段输出包在一个 JSON 数组中。该线程历经多个版本(v8 到 v23),Tom LaneJoe ConwayAlvaro HerreraJoel JacobsonJian HeJunwang Zhao 等人都参与过讨论。本文总结讨论要点、实现方式和当前状态。

为何重要

  • 正确的 JSON 导出:目前从库内把查询结果当 JSON 导出,通常只能用 COPY TO 的 text 或 csv 格式。文本格式会把结果当普通字符串再转义引号和反斜杠,导致 JSON 被破坏。专门的 JSON 格式会按行输出一个 JSON 对象(或每列一个 JSON 值),并做正确转义,输出即为合法 JSON。
  • 与上下游对接:很多流水线期望 JSON(例如每行一个 JSON 对象,或一个 JSON 数组)。原生 COPY TO ... (FORMAT json)FORCE_ARRAY 可以直接在库内完成导出,无需在客户端再格式化或借助 psql -t -A、LO API 等变通手段。
  • 与现有格式一致:COPY 已有 textcsvbinary,增加 json 后,仍是「选一种格式、得到正确编码」的同一套思路。

技术分析

设计取舍

补丁做了这些约定:

  1. JSON 格式仅用于 COPY TO。不支持 COPY FROM 的 JSON(解析任意 JSON 属于更大功能)。语法和选项校验会拒绝在 COPY FROM 时使用 FORMAT json
  2. JSON 下不使用 HEADER。文档和代码都禁止在 JSON 格式下使用 HEADER,避免在 JSON 行/数组前多出一行表头。
  3. 协议上视为一列。JSON 模式下,前后端 Copy 协议只发送一列(非二进制);每一行被渲染成一个 JSON 值(例如每行一个对象)。
  4. FORCE_ARRAY 仅在与 JSON 一起时有效FORCE_ARRAY 会在整段 COPY 输出外包上 [ ... ] 并在行间插入逗号,得到单个 JSON 数组;仅在与 FORMAT json 同时使用时合法。

补丁结构

  • 补丁 1(自 v13)— CopyFormat 重构 Joel Jacobson 引入枚举 CopyFormat(如 COPY_FORMAT_TEXTCOPY_FORMAT_CSVCOPY_FORMAT_BINARY),用 CopyFormatOptions 中的单一 format 字段替代原来的 csv_modebinary 两个布尔,便于后续增加新格式(如 JSON)。jian he 随后根据评审意见做了重构;Junwang Zhao 又针对执行器中新的 CopyToRoutine 做了适配。

  • 补丁 2 — COPY TO 的 JSON 格式

    • 语法gram.y):增加 JSON 为格式选项,并在 COPY 选项中允许 FORMAT json
    • 选项copy.ccopy.h):自 v13 起格式由 CopyFormat 表示,JSON 对应 COPY_FORMAT_JSON;校验不变:JSON 下不能有 HEADER/default/null/delimiter,COPY FROM 不能使用 JSON。
    • Copy 协议copyto.c):在 SendCopyBegin 中,JSON 模式下只发送一列、格式 0(text),不再按列送格式。
    • 行输出copyto.c):在 CopyOneRowTo 中,若开启 json_mode,则通过 composite_to_json()(来自 utils/adt/json.c)把整行转成 JSON 字符串并发送。对基于查询的 COPY(无关系),补丁会保证 slot 的 tuple descriptor 与查询一致,以便 composite_to_json 用正确的属性元数据生成键名。
    • json.c:将 composite_to_json()static 改为导出,并在 utils/json.h 中声明,供 COPY 调用。
  • 补丁 3 — COPY TO 的 FORCE_ARRAY

    • 选项copy.ccopy.h):增加 force_array,解析 force_array / force_array true|false。校验:FORCE_ARRAY 仅允许在 JSON 模式下使用(v12 起使用 ERRCODE_INVALID_PARAMETER_VALUE)。
    • 输出copyto.c):在首行前,若 JSON 模式且 force_array,先发送 [ 和换行;行与行之间在每行 JSON 对象前发送 ,(用 json_row_delim_needed 标记);最后一行之后发送 ] 和换行。默认(不加 FORCE_ARRAY)仍是每行一个 JSON 对象。

版本演进:v8 到 v23

  • v8 曾采用更大改动:抽出 COPY TO/FROM 的格式实现并做可插拔机制(含 contrib 模块 pg_copy_json)。评审意见倾向于在核心内做更小、直接的改动。
  • v9–v10 收窄为只增加 COPY TO 的 JSON 格式(不做可插拔 API)。v10 引入 json_mode 并采用 composite_to_json
  • v11 增加 FORCE_ARRAY 选项及相应测试,并修正「COPY FROM 与 json」的错误码,加强选项校验。
  • v12(2024 年 8 月):只重发补丁 2(FORCE_ARRAY);在非 JSON 模式下使用 FORCE_ARRAY 时错误码改为 ERRCODE_INVALID_PARAMETER_VALUE
  • v13(2024 年 10 月):Joel Jacobson 贡献 0001 — 引入 CopyFormat 枚举,用 CopyFormatOptions 中的单一 format 字段替代 csv_modebinary。0002(json 格式)、0003(force_array)在其上 rebase;文档明确 JSON 不能与 headerdefaultnulldelimiter 同用。
  • v14–v22:主要为 rebase 与上游适配。v14 在部分投稿中不再单独发 CopyFormat 片(因 rebase 基准不同)。Junwang Zhao(v15,2025 年 3 月)针对新的 CopyToRoutine 结构(commit 2e4127b6d2)做了适配。v16–v22 继续 rebase 并回应评审,核心设计未变。
  • v23(2026 年 1 月):当前系列。三片:(1) CopyFormat 重构(原创 Joel Jacobson,jian he 重构),(2) json format for COPY TO(Author: Joe Conway;Reviewed-by 包括 Andrey M. Borodin、Dean Rasheed、Daniel Verite、Andrew Dunstan、Davin Shearer、Masahiko Sawada、Alvaro Herrera 等),(3) FORCE_ARRAY。功能集不变,补丁已 rebase 并获较多评审。

代码要点

行转 JSON(补丁 2) 每行通过已有的 composite_to_json() 转成一个 JSON 对象:

rowdata = ExecFetchSlotHeapTupleDatum(slot);
result = makeStringInfo();
composite_to_json(rowdata, result, false);
CopySendData(cstate, result->data, result->len);

FORCE_ARRAY 的框定(补丁 3) 在行循环前发 [;从第二行起每行先发 , 再发对象;循环结束后发 ]

if (cstate->opts.json_mode && cstate->opts.force_array)
{
    CopySendChar(cstate, '[');
    CopySendEndOfRow(cstate);
}
// ... 行循环:首行前不发逗号,之后 CopySendChar(cstate, ','); 再发对象 ...
if (cstate->opts.json_mode && cstate->opts.force_array)
{
    CopySendChar(cstate, ']');
    CopySendEndOfRow(cstate);
}

用法示例(来自回归测试)

COPY copytest TO STDOUT (FORMAT json);
-- 每行一个 JSON 对象。

COPY copytest TO STDOUT (FORMAT json, force_array true);
-- 单个 JSON 数组:[ {"col":1,...}, {"col":2,...} ]

社区讨论要点

原始问题与权宜方案

Davin 最初遇到的是 COPY 文本格式对 JSON 的二次转义。David G. JohnstonAdrian Klaver 建议用 psql 把查询结果写到文件而不是 COPY。Dominique DevienneDavid G. Johnston 都认为,若 COPY 能提供「不格式化」或原始输出选项,在只导出一列(如 JSON)时即可按原样写出。Tom Lane 也同意,增加这样的选项可以满足该需求。这一共识促成了「专门做 COPY TO 的 JSON 格式」而不是在 text/CSV 上打补丁。

评审意见与调整

  • Tom Lane 等强调改动要尽量小:只增加 COPY TO 的 JSON 输出,不做大重构。这推动了从 v8 的可插拔格式方案收缩到当前的内核 JSON 路径。
  • Joe Conway 曾在另一线程讨论 COPY 与 JSON;jian he 的补丁引用了该讨论,并与「为 COPY TO 提供一等 JSON 输出」的思路一致。
  • Alvaro Herrera 等也在线程中提出意见;从 v8 到 v9/v10 的收缩即体现了「更小、更聚焦的补丁集」的偏好。

补丁中的边界情况

  • 基于查询的 COPY:数据源是查询(无表)时,slot 的 tuple descriptor 可能与查询的不一致。补丁会把查询的属性元数据同步到 slot 的 tuple descriptor,保证 composite_to_json 生成正确的键名。
  • 协议:JSON 模式下 Copy 协议只发送一列;后端仍按行产生一个 JSON「值」(一个对象,或在 FORCE_ARRAY 时数组里的一个元素)。

技术细节

实现要点

  • 转义:JSON 字符串转义由 composite_to_json()json.c 中已有的 escape_json 等完成,列值中的引号、反斜杠和控制字符会被正确编码。
  • HEADER:与 JSON 明确不兼容,保证输出要么是纯 JSON 行,要么是单个 JSON 数组。
  • FORCE_ARRAY 输出:回归测试显示,开启 force_array 时输出为 [、换行、第一个对象;之后每行先 , 再对象;最后 ],即一个合法 JSON 数组(元素间可有换行等空白)。

当前限制

  • COPY FROM:不提供 JSON 导入,仅扩展 COPY TO。
  • HEADER:JSON 模式下不支持。
  • 二进制:JSON 格式仅为文本,本补丁不涉及二进制 JSON。

当前状态

  • 该线程从 2023 年延续到 2026 年初。当前最新系列为 v23(2026 年 1 月):三片(CopyFormat 重构、json format for COPY TO、FORCE_ARRAY)。
  • 截至该线程快照,补丁尚未提交,仍处于讨论阶段;v23 代表当前设计与评审状态。
  • 设计COPY TO ... (FORMAT json) 及可选的 (FORMAT json, force_array true) 得到单个 JSON 数组;JSON 与 HEADER、DEFAULT、NULL、DELIMITER 及 COPY FROM 不兼容。

小结

「Emitting JSON to file using COPY TO」线程始于用户在使用 COPY 导出 JSON 时遇到的二次转义问题。社区认同应为 COPY TO 提供原生 JSON 格式jian he(及后来的 Junwang Zhao)实现了 FORMAT json(仅 COPY TO;与 HEADER、DEFAULT、NULL、DELIMITER 不兼容)和 FORCE_ARRAY,复用 composite_to_json() 与现有 JSON 转义。Joel Jacobson 的 CopyFormat 重构(v13 起)用枚举替代格式布尔,为 JSON 及未来格式打下基础。

参考

第 05 周(2026)

2026 年第 05 周的 PostgreSQL 邮件列表讨论。

🇬🇧 English Version

文章

GOO:面向大规模连接问题的贪心连接顺序搜索算法

引言

PostgreSQL 根据查询复杂度采用不同的连接顺序策略:在关系数少于 geqo_threshold(默认 12)时使用动态规划(DP)求最优连接顺序;当连接图更大时则启用 GEQO(遗传查询优化器),用遗传算法在连接顺序空间中做启发式搜索。GEQO 存在一些已知问题:在中等规模连接数下规划时间可能比 DP 还慢,且缺少类似可复现种子的调优手段。

2025 年 12 月,Chengpeng Yan 在 pgsql-hackers 邮件列表中提出了 GOO(Greedy Operator Ordering)——一种确定性的贪心连接顺序搜索方法,旨在作为 GEQO 的替代方案处理大规模连接。算法基于 Leonidas Fegaras 在 1998 年 DEXA 上的论文《A New Heuristic for Optimizing Large Queries》。

为什么值得关注

  • 规划时间:在星型/雪花型及 TPC-DS 类负载上,GOO 的规划时间明显短于 GEQO(例如某次测试中,对 99 条 TPC-DS 查询做 EXPLAIN:GOO 约 5s,GEQO 约 20s),而在连接数未超阈值时 DP 仍然最快。
  • 计划质量:目标是在当前使用 GEQO 的场景下做到「足够好」——减少极端退化、行为更可预期。
  • 内存:GOO 的迭代、逐步合并结构理论上比完整 DP 更省内存,与 GEQO 的特性也不同;作者计划补充测量。

了解这一讨论有助于把握 PostgreSQL 在多表连接场景下的演进方向,以及 GOO 与 DP、GEQO 的定位关系。

技术分析

GOO 算法原理

GOO 以增量方式构建连接顺序:

  1. 初始时每个基表对应一个「团块」(clump, geqo 中的概念)。
  2. 每一步在所有合法连接对(满足查询连接约束的团块对)中评估。
  3. 对每一对构造连接关系,用规划器现有代价模型得到总代价。
  4. 选择估计总代价最低的一对,合并为一个团块,重复上述过程。
  5. 直到只剩一个团块,其最优路径即为最终计划。

因此「贪心」体现在:在每一步,在所有当前团块中,只做当前看起来代价最小的连接。原论文按估计结果大小排序;补丁中为与现有规划器一致,采用规划器的 total_cost

复杂度:时间为 O(n³),n 为基表数量:共 (n−1) 次合并,每步约 O(k²) 个团块对。

与规划器的集成

补丁引入:

  • enable_goo_join_search:GUC,用于开启 GOO(默认关闭)。
  • 阈值:目前复用 geqo_threshold;当连接层数 ≥ geqo_threshold 且开启 GOO 时,规划器调用 goo_join_search() 而非 GEQO。因此 GOO 的定位是替代 GEQO,不替代 DP。

allpaths.c 中的相关逻辑:

else if (enable_goo_join_search && levels_needed >= geqo_threshold)
    return goo_join_search(root, levels_needed, initial_rels);
else if (enable_geqo && levels_needed >= geqo_threshold)
    return geqo(root, levels_needed, initial_rels);

实验中的贪心策略

作者用不同指标作为「最便宜」的贪心依据做了对比:

策略含义
cost使用规划器对该连接的 total_cost(基线)。
result_size使用估计输出大小(字节):reltarget->width * rows
rows使用估计输出行数。
selectivity使用连接选择性(输出行数 / (左表行数 × 右表行数))。
combined分别按 cost 和 result_size 各跑一次 GOO,再选最终估计代价更低的计划。

结论概括:

  • 仅用 cost 会出现严重长尾(如 JOB 上最大 431 倍、大量 ≥10 倍退化)。
  • result_size 平均更好,但仍有差的长尾(如 JOB 上最大 67 倍)。
  • combined(cost + result_size 各生成一计划再选更便宜的)在鲁棒性上最好:几何平均更优、JOB 上无 ≥10 倍退化、最坏约 8.68 倍。

因此讨论重点从「选哪种单一指标」转向「如何降低长尾风险」——例如通过多种贪心策略并行、再选更优计划。

社区讨论要点

基准测试说明

Dilip Kumar 最初质疑用 pgbench 测连接顺序是否合适。作者澄清:数据来自 Tomas Vondra 此前线程中的自定义星型/雪花型负载,而非默认 pgbench;这些负载包含多表连接且表为空,因此吞吐主要反映 规划时间(DP / GEQO / GOO 的差异)。

崩溃与「提前聚合」修复(v1 → v2)

Tomas Vondra 在 TPC-DS 查询上跑 EXPLAIN 时遇到崩溃,堆栈显示 sort_inner_and_outer()inner_path = NULL。根因是:GOO 在某些路径下构造连接关系时未调用 set_cheapest(),导致后续代码读到空的 cheapest 路径。Tomas 进一步将问题缩小到 SELECT 列表中含聚合 的查询(如 TPC-DS Q7)。

v2 的修复是正确处理 eager aggregation:规划器会生成需要正确 cheapest 路径的分组/基表关系。修复后,99 条 TPC-DS 查询均可正常完成规划。

GEQO 在中等连接数下比 DP 更慢

Tomas 给出了 99 条 TPC-DS 查询的 EXPLAIN 耗时(3 种 scale × 0/4 worker):

  • master (DP):8s
  • master/geqo:20s
  • master/goo:5s

说明在该负载下 GEQO 比 DP 慢,GOO 最快。John Naylor 和 Pavel Stehule 指出:GEQO 的设计本就是在连接规模足够大时才占优;在连接数较小或中等时,其固定开销可能占主导。因此评估应聚焦在真正会启用 GEQO 的范围(例如关系数超过 geqo_threshold)。

TPC-DS 执行结果(Tomas Vondra)

Tomas 分享了完整 TPC-DS 运行结果(scale 1 和 10,0 和 4 worker)。99 条查询总耗时概括:

  • Scale 1:GOO 比 master 和 GEQO 都慢(例如约 1124s vs GEQO 399s vs master 816s)。
  • Scale 10:GOO 更快(例如约 1859s vs GEQO 2325s vs master 2439s)。

说明 GOO 在小数据量下表现更差、在较大数据量下更好,行为与负载和规模相关。Tomas 建议分析变慢的查询以改进启发式,并增加更大数据量和冷缓存测试。

TPC-H:单一指标贪心的失效模式(v3)

作者在 TPC-H SF=1 上对比了四种策略:rows、selectivity、result_size、cost。主要结论:

  • Q20partsupp 与聚合后的 lineitem 子查询连接;行数估计严重偏低(估计几十行、实际数十万)。面向输出的规则(rows、selectivity、result_size)会因「看起来」极度收缩而非常早地选这个连接;实际却产生巨大中间结果并放大后续代价。估计错误会使面向输出的贪心规则表现很差。
  • Q7:基于 cost 的贪心选了一个局部很便宜的连接,却形成大的多对多中间结果,使后续连接代价激增。说明局部最优代价可能全局很差

Tomas 指出:Q20 本质是估计问题(垃圾进垃圾出);Q7 则是贪心算法固有的——局部最优导致全局次优,换单一指标无法从根本上解决。

JOB 与组合策略(v4)

在完整 JOB 负载上,combined 策略(cost + result_size 各跑一次,选代价更低的计划)表现如下:

  • 几何平均最优(相对 DP 为 0.953)。
  • 无 ≥10 倍退化,最大约 8.68 倍。
  • 比单独 GOO(cost) 或 GOO(result_size) 的尾部好很多。

通过两种贪心策略并行、再选更优计划,可以在几乎不增加规划成本的前提下减少灾难性计划。

定位:GOO 作为 GEQO 替代

Tomas 询问目标是替代 DP 还是 GEQO。作者明确:GOO 的定位是替代 GEQO,不替代 DP;在连接数低于 geqo_threshold 时仍应使用 DP。

文献与后续方向

Tomas 提到 CIDR 2021 的 "Simplicity Done Right for Join Ordering"(Hertzschuch 等),该工作强调鲁棒性(如最坏情况/上界连接顺序)和仅信任基表估计,对基数估计过于乐观导致的 nestloop 爆炸可能有参考价值。作者计划先用当前方案打好基线,再逐步吸收这类思路。

实现与细节

实现方式

  • 新增:src/backend/optimizer/path/goo.csrc/include/optimizer/goo.h
  • GOO 通过反复调用现有规划接口(如 make_join_rel、路径生成)构建连接关系,因此与 DP/GEQO 共用同一代价模型和路径类型。
  • 使用多个内存上下文以在候选评估阶段控制内存占用。

边界与鲁棒性

  • Eager aggregation:v2 通过保证 GOO 生成的连接关系都正确设置 cheapest 路径,修复了 sort_inner_and_outer 遇到 NULL inner_path 的崩溃。
  • 基数估计错误:所有方法在估计严重偏差时都会受影响;GOO 对不同策略的敏感度不同(例如面向输出的规则在行数估计错误时可能更糟)。combined 策略的目标是降低长尾,而非从根本上修正估计。
  • 结构局限:某些图结构(如星型、扇出)会使 cost 和 result_size 都选到类似的差计划,这是「只看一步」的贪心枚举的固有局限。

性能相关

  • 规划时间:GOO 为 O(n³),在已报告的基准中通常快于 GEQO;作者计划补充规划时间和内存的显式测量。
  • 执行时间:高度依赖负载;GOO 相对 GEQO/DP 可能更好或更差(如 TPC-DS 的 scale 1 与 10、JOB 中涉及 GEQO 的子集)。

当前状态

  • 补丁:讨论截止时最新为 v4。v4-0001 为核心 GOO 实现(与 v3-0001 一致);v4-0002 为测试用 GUC 及多策略(如 combined)的实验框架。
  • 目标:将 GOO 确立为可行的 GEQO 替代——相同阈值下,计划质量和规划时间不逊于或优于 GEQO,并减少极端退化。
  • 后续(作者计划):在更多负载、更大连接图、冷缓存、更大 scale 上评估;考虑在组合策略中加入 selectivity;测量规划时间与内存;探索可调参数与渐进降级(如资源受限时先用 DP 再退化为贪心)。

小结

GOO 邮件线程展示了一种用确定性贪心连接顺序算法替代 GEQO 的完整尝试:

  • 复用现有代价模型与规划基础设施。
  • 在多个基准上缩短了规划时间(相对 GEQO)。
  • 通过组合多种贪心策略(如 cost 与 result_size)并选取更优计划,改善了最坏情况下的计划质量。

同时承认局限:贪心本质是局部决策,在估计错误或不利的连接图结构下仍可能产生差计划;当前重点放在鲁棒性与尾部行为而非单一指标的极致调优。对 PostgreSQL 用户而言,这是值得关注的补丁:若被采纳,将在复杂多表连接场景下提供除 GEQO 之外的另一种选择,并在可预测性和规划性能上可能带来改进。

参考

执行器批处理:面向批量的元组处理

引言

PostgreSQL 的执行器长期以来都是 逐元组(tuple-at-a-time) 的:每个计划节点通常向子节点要一个元组、处理后再向上返回一个结果元组。这种设计简单,在 OLTP 场景下表现良好,但在分析型和批量负载中,每元组的开销——尤其是重复的函数调用和表达式求值——往往会成为主要成本。在 PGConf.dev 2025 上,社区讨论了 PostgreSQL 是否能够支持一种 额外的批处理执行器,在节点之间传递 成批元组 而不是一次一个 slot。

在那次讨论以及 Andres Freund 和 David Rowley 的私下交流之后,Amit Langote 于 2025 年 9 月在 pgsql-hackers 上发布了题为 「Batching in executor」 的补丁系列。该系列引入了 批处理表访问方法(Table AM)API,在执行器中增加了 支持批处理的接口ExecProcNodeBatchTupleBatch),并原型化了 面向批量的表达式求值(包括批量 qual 和聚合转换函数)。目标是降低每元组开销、为聚合函数中的 SIMD 等未来优化铺路,并为受益于批量执行的列存或压缩表 AM 打基础。

为何重要

  • 执行器开销:在 CPU 受限、IO 极少的负载下(例如全缓存表),大量时间消耗在执行器内部。批处理减少了进入表 AM 和表达式解释器的调用次数,并可通过一次对多行求值来削减函数调用开销。
  • 聚合与分析:批量转换求值(如 count(*)sum()avg())可以按批而非按行支付 fmgr 成本,并为向量化或 SIMD 友好路径打开空间。
  • 未来表 AM:批量执行器便于列存或压缩表 AM(如 Parquet 风格)以原生批量格式传递数据,而不必过早物化为堆元组。
  • OLTP 安全:设计上保留现有逐行路径不变;批处理为可选(例如通过 executor_batching GUC),因此 OLTP 负载不受影响。

理解本线程的内容有助于把握 PostgreSQL 如何可能增加一条面向批量的执行路径,以及社区在物化、ExprContext、EEOP 设计等方面正在权衡的内容。

技术分析

补丁结构

系列分为两部分:

  1. 0001–0003 — 基础:批处理表 AM API、heapam 批处理实现、与 SeqScan 对接的执行器批处理接口。
  2. 0004–0008 — 原型(WIP/PoC):支持批处理的 Agg 节点、TupleBatch 相关新 EEOP、批量 qual 求值、批量聚合转换(按行循环与「直接」批 fmgr)。

核心抽象

表 AM 批处理 API(0001) 新增回调允许表 AM 一次返回 多个元组 而非一个。对 heap 而言:

  • HeapBatch 保存单页内的元组;大小受 EXEC_BATCH_ROWS(当前 64)和「不跨页」限制。
  • heapgettup_pagemode_batch() 从当前页填充 HeapTupleData 数组,逻辑与 heapgettup_pagemode() 对应,但面向一批。可见性和扫描方向处理方式一致。

通用层在 tableam.h 中引入 batch 类型与操作,以便其他 AM 提供自己的批量格式与实现。

执行器批处理路径(0002–0003)

  • TupleBatch 是批处理模式下在节点间传递的容器,可持有 AM 原生批(如堆元组)或物化后的 slot,视路径而定。
  • ExecProcNodeBatch() 对应 ExecProcNode():返回 TupleBatch* 而非 TupleTableSlot*PlanState 增加 ExecProcNodeBatch 函数指针,沿用与逐行路径相同的「首次调用」与插桩包装。
  • SeqScan 获得:
    • 批量驱动的 slot 路径:仍每次返回一个 slot,但从内部批中填充,减少对 AM 的调用。
    • 批路径:当父节点支持批处理时,SeqScan 的 ExecProcNodeBatch 直接返回 TupleBatch(如通过 ExecSeqScanBatch*)。

因此前三个补丁提供:(1) 能产生批的表 AM;(2) 请求与传递批的执行器 API;(3) SeqScan 作为首个既能消费又能产生批的节点。

面向批量的表达式求值(0004–0008)

后续补丁尝试对 一批 行做表达式求值:

  • Agg 的批量输入:Agg 可通过 ExecProcNodeBatch() 从子节点拉取 TupleBatch,并成批喂入聚合转换函数。
  • 新 EEOP:表达式解释器增加针对 TupleBatch 的步骤——例如将属性取到批量向量、对一批求 qual、以及按行在解释器内循环(ROWLOOP)或按批调用转换函数(DIRECT)执行聚合转换。
  • 批量 qual 求值:一批元组可用单次遍历完成过滤(ExecQualBatch 及相关 EEOP),降低每行解释器和 fmgr 开销。

提供了两种批量聚合原型路径:一是在解释器内按行迭代(每行转换);二是每批调用一次转换函数(每批 fmgr)。在 Amit 的基准中,当执行器成本占主导时,后者收益更大。

设计选择与未决点

  • 单页批:堆批限于一页,因此批大小可能小于 EXEC_BATCH_ROWS(例如每页元组少或 qual 选择性高)。线程中提到未来可改进:跨页批或扫描在批未满时继续要元组。
  • TupleBatch 与 ExprContext:补丁在 ExprContext 上扩展了 scan_batchinner_batchouter_batch。每批表达式求值仍使用 ecxt_per_tuple_memory,Amit 指出这「 arguably 滥用了」每元组契约。批作用域内存 的更清晰模型仍待定义。
  • 物化:目前面向批的表达式求值通常作用在已物化到 slot(或堆元组数组)的元组上。长期目标是在 原生批格式(如列存或压缩)上做表达式求值而不强制物化;这需要更多基础设施(如 AM 控制的表达式求值或面向批的算子)。

社区观点

Tomas Vondra:批设计与索引预取

Tomas 将本补丁与 索引预取 工作(他参与其中)对比,后者也在索引 AM 与执行器之间引入「批」概念。他指出两种设计因目标不同而不同:

  • 索引预取:共享的批结构由索引 AM 填充,之后由 indexam.c 管理;批在此之后与 AM 无关。
  • 执行器批处理:每个表 AM 可产生自己的批格式(如 HeapBatch),包装在带 AM 特定操作的通用 TupleBatch 中。执行器保留 TAM 特定优化,并依赖 TAM 对批内容进行操作。

Amit 同意:执行器批处理旨在保留 TAM 特定行为并尽可能避免过早物化;预取则追求由 indexam 统一的批格式。两种设计都与各自目标一致。

Tomas 还问:(1) 何时必须将 TupleBatch 物化为通用格式(如 slot)?(2) 表达式能否直接在「自定义」批(如压缩/列存)上执行?Amit 回复说目前表达式求值仍需物化,但设计上不应阻碍未来在原生批数据上求值(如列存或 Parquet 风格)。给表 AM 更多控制「如何在其批数据上求值」是可能的后续扩展。

Tomas Vondra:TPC-H Q22 段错误与 v3 修复

Tomas 报告在启用批处理运行 TPC-H 时出现 段错误,且 仅出现在 Q22,堆栈始终指向同一处:numeric_avg_accum 收到 NULL 的 datum(DatumGetNumeric(X=0)),从 ExecAggPlainTransBatchagg_retrieve_direct_batch。因此问题在批量聚合路径:转换函数收到了本不应为 NULL 的 NULL。

Amit 将崩溃追溯到 表达式解释器。两个不同的 EEOP(分别对应 ROWLOOP 和 DIRECT 批量聚合路径)都调用了 同一个辅助函数。该辅助函数在运行时再次推导 opcode(如通过 ExecExprEvalOp(op))。在某些构建(如 macOS 上的 clang-17)中,这两个 EEOP 分支编译成相同代码,导致 分发标签地址相同。解释器按标签地址做反向查找时可能返回错误的 EEOP;初始化路径可能以为在执行 ROWLOOP EEOP,而执行路径却按 DIRECT EEOP 行为,导致状态错误和 NULL/崩溃。

v3 中的修复(补丁 0009)是 将共享辅助拆成两个函数,每个 EEOP 一个,这样辅助不再重新推导 opcode。修改后 Amit 在 macOS clang-17 上无法再复现崩溃。同一修复也解决了 Tomas 遇到的 TPC-H Q22 段错误。

Bruce Momjian:POSETTE 与 OLTP

Bruce 引用了 POSETTE 2025 的两场演讲做背景:一场讲数据仓库需求,一场讲 「Hacking Postgres Executor For Performance」。Amit(第二场演讲者)确认批处理设计上不会给 OLTP 路径增加明显开销;逐行路径仍是默认且未改动。

关闭批处理时的回归

Tomas 观察到在 关闭批处理executor_batching=off)时,打补丁的树可能比未打补丁的 master 更慢——即新代码路径未启用时存在回归。Amit 复现了该现象:例如单聚合 SELECT count(*) FROM bar 和多聚合 SELECT avg(a), … FROM bar 在关闭批处理时相比 master 有约 3–18% 的变慢,具体取决于行数和并行度。他承认回归并表示正在排查。确保在关闭批处理时零或极小成本是合入基础补丁的重要前提。

技术细节

实现要点

  • 批大小EXEC_BATCH_ROWS 为 64。堆批还受单页限制,实际批大小可能更小(如 Amit 的 1000 万行测试表中每页约 43 行)。
  • 插桩ExecProcNodeBatch 使用与逐行路径相同的插桩钩子;批调用的「元组」数记为返回的 TupleBatch 的有效行数(b->nvalid),便于 EXPLAIN ANALYZE 等统计保持意义。
  • GUC:在 v4/v5 中 GUC 为 executor_batch_rows(0 = 关闭批处理;例如 64 = 批大小)。

边界与限制

  • 稀疏批:高选择性 qual 下,过滤后批内有效行可能很少。线程建议未来支持跨页批或扫描在批未满时继续填充。
  • ExprContext 与批生命周期:用 ecxt_per_tuple_memory 承担每批工作是目前的设计债;独立的批作用域分配器或上下文会更清晰。
  • 并行与嵌套 Agg:Tomas 崩溃的堆栈涉及并行 worker(Gather/GatherMerge)和嵌套聚合(如子计划上的 Agg)。NULL datum 问题出在该场景下使用的批量转换路径;v3 的 EEOP 辅助拆分从根因上修复,而非针对单条查询。

基准摘要(来自 Amit v1 邮件)

均在完全 VACUUM 的表、大 shared_buffers 且预热缓存下运行;时间单位为 ms,「off」= 批处理关,「on」= 批处理开;负 %diff 表示「on」更快。

  • 单聚合、无 WHERE(如 SELECT count(*) FROM bar_N):仅批量 SeqScan(0001–0003)约快 8–22%;加上批量 agg(0001–0007)在部分规模下约快 33–49%。
  • 单聚合、有 WHERE:批量 agg + 批量 qual(0001–0008)约快 31–40%。
  • 五聚合、无 WHERE:批量转换(每批 fmgr,0001–0007)约快 22–31%。
  • 五聚合、有 WHERE:批量转换 + 批量 qual(0001–0008)约快 18–32%。

因此在执行器占主导(IO 极少)时,批处理一致降低 CPU 时间,最大收益来自减少每行 fmgr 调用和对整批求 qual。

演进:v4 与 v5

后续修订在基础之上增加了可观测性与批量 qual 工作:

  • v4(2025 年 10 月):新增 EXPLAIN (BATCHES)(补丁 0003)用于展示元组批处理统计,对应此前「插桩」的待办项。Amit 报告在 v4 中 关闭批处理时的回归(相对未打补丁的 master)已不再出现——可能与移除 HeapScanData 中的多余字段以及避免混用编译器(gcc vs clang)比较有关。新基准使用 SELECT * FROM t LIMIT 1 OFFSET n;在 batch=64 下,无 WHERE 时约快 22–26%,WHERE a > 0 时约快 21–48%;变形开销大的情况(如对最后一列求 qual)收益较小。Daniil Davydov 审阅了堆批处理代码(如 SO_ALLOW_PAGEMODE 断言、heapgettup_pagemode_batch 逻辑与风格),Amit 在 v4 中已回应。

  • v5(2026 年 1 月):0001–0003 仍为核心(批表 AM API、SeqScan + TupleBatch、EXPLAIN BATCHES)。0004 增加 ExecQualBatch 用于批量 qual 求值(WIP);0005 将批量 qual 的 opcode 移到 专用解释器,使逐行路径(ExecInterpExpr)不被修改,从而在 executor_batch_rows=0 时避免额外成本。Amit 移除了 BatchVector 中间表示(qual 直接读取批内 slot 的 tts_values)。仍有两个待解决问题:(1) 在 0% 选择性(所有行不满足 qual)时,即使关闭批处理,打上批量 qual 补丁后逐行路径仍更热;(2) 对靠后列的 qual(变形开销大)批处理几乎无收益。近期补丁中的 GUC 为 executor_batch_rows(0 = 关闭)。

当前状态

  • 线程 仍在进行;最近消息为 2026 年 1 月。系列仍为 进行中
  • v5 为当前版本。0001–0003(表 AM 批 API、heapam 批、SeqScan + TupleBatch、EXPLAIN BATCHES)是拟先审阅并争取合入的部分。
  • v5 的 0004–0005实验性(ExecQualBatch、批量 qual 专用解释器)。
  • v3 已包含针对 TPC-H Q22 / 批量 agg 崩溃的 段错误修复(拆分 EEOP 辅助);v4/v5 在此基础上演进。
  • 待办:(1) 当批量 qual(0004–0005)在树中但 executor_batch_rows=0 时的逐行路径回归(如 0% 选择性);(2) 批作用域内存与 ExprContext;(3) 跨页批与在原生/压缩批格式上求值等后续工作。

小结

Amit Langote 的「Batching in executor」系列在 PostgreSQL 执行器中引入了一条 面向批量的路径:表 AM 可返回成批元组,执行器通过 TupleBatch 请求与传递批,SeqScan 是首个接入该路径的节点。v4、v5 增加了 EXPLAIN (BATCHES) 用于可观测性,并原型化了 批量 qual 求值 与专用解释器,以保持逐行路径不变。基准显示在开启批处理时收益可观(多为 20–50%);此前「关闭批处理」时的回归在 v4 中已解决,但仍有问题:在打上批量 qual 补丁且关闭批处理时(如 0% 选择性)逐行路径的成本。

审阅者提出了重要问题:与其他「批」类工作(如索引预取)的协调、物化与未来「在批上求值」的设计、TPC-H Q22 段错误(v3 修复)以及 Daniil 对堆批处理的审阅(v4 已回应)。当前审阅与合入重点为基础补丁(0001–0003)与 EXPLAIN BATCHES。

参考

第 04 周(2026)

2026 年第 04 周的 PostgreSQL 邮件列表讨论。

🇬🇧 English Version

文章

PostgreSQL 查询规划器优化:自动 COUNT(*) 转换

引言

2025 年 10 月,PostgreSQL 提交者 David Rowley 提出了一个重要的查询规划器优化,能够自动将 COUNT(1)COUNT(not_null_col) 表达式转换为 COUNT(*)。这个优化解决了一个常见的性能反模式:开发者认为 COUNT(1) 等同于 COUNT(*),但实际上 COUNT(*) 更高效。该补丁于 2025 年 11 月提交,并引入了用于聚合函数简化的新基础设施。

为什么这很重要

COUNT(*)COUNT(column) 之间的性能差异可能非常显著,特别是对于大表。当计算特定列时,PostgreSQL 必须:

  1. 解构元组以提取列值
  2. 检查 NULL 值(即使对于 NOT NULL 列,检查仍然会发生)
  3. 通过聚合函数处理列数据

相比之下,COUNT(*) 可以在不访问单个列值的情况下计算行数,从而获得显著更好的性能。David Rowley 的基准测试显示,在包含 100 万行的表上,使用 COUNT(*) 而不是 COUNT(not_null_col) 可以获得约 37% 的性能提升

技术分析

基础设施:SupportRequestSimplifyAggref

该补丁引入了一个名为 SupportRequestSimplifyAggref 的新基础设施,类似于现有的用于常规函数表达式(FuncExpr)的 SupportRequestSimplify。由于聚合使用 Aggref 节点,因此需要一个单独的机制。

关键组件包括:

  1. 新的支持节点类型supportnodes.h 中的 SupportRequestSimplifyAggref
  2. 简化函数clauses.c 中的 simplify_aggref(),在常量折叠期间调用聚合的支持函数
  3. 增强的可空性检查:扩展 expr_is_nonnullable() 以处理 Const 节点,而不仅仅是 Var 节点

实现细节

优化在查询规划的常量折叠阶段执行,具体在 eval_const_expressions_mutator() 中。当遇到 Aggref 节点时,规划器会:

  1. 检查聚合函数是否通过 pg_proc.prosupport 注册了支持函数
  2. 使用 SupportRequestSimplifyAggref 请求调用支持函数
  3. 如果支持函数返回简化的节点,则替换原始的 Aggref

对于 COUNT 聚合,支持函数(int8_agg_support_simplify())会检查:

  • 参数是否不可为空(使用 expr_is_nonnullable()
  • 聚合中是否没有 ORDER BYDISTINCT 子句
  • 如果两个条件都满足,则将 COUNT(ANY) 转换为 COUNT(*)

代码示例

int8.c 中的核心简化逻辑:

static Node *
int8_agg_support_simplify(SupportRequestSimplifyAggref *req)
{
    Aggref    *aggref = req->aggref;

    /* 只处理 COUNT */
    if (aggref->aggfnoid != INT8_AGG_COUNT_OID)
        return NULL;

    /* 必须恰好有一个参数 */
    if (list_length(aggref->args) != 1)
        return NULL;

    /* 没有 ORDER BY 或 DISTINCT */
    if (aggref->aggorder != NIL || aggref->aggdistinct != NIL)
        return NULL;

    /* 检查参数是否不可为空 */
    if (!expr_is_nonnullable(req->root,
                             (Expr *) linitial(aggref->args),
                             true))
        return NULL;

    /* 转换为 COUNT(*) */
    return make_count_star_aggref(aggref);
}

补丁演进

该补丁经历了四次迭代,每次都在改进实现:

版本 1(初始提案)

  • 引入基本基础设施
  • 使用 SysCache 获取 pg_proc 元组

版本 2(代码清理)

  • get_func_support() 函数替换 SysCache 查找
  • 更清晰、更高效的方法

版本 3(移除实验性代码)

  • 移除了处理 COUNT(NULL) 优化的 #ifdef NOT_USED
  • 清理了未使用的包含文件
  • 改进了注释

版本 4(最终版本)

  • 在提交 b140c8d7a 后重新基于
  • 修复了支持函数总是返回 Aggref 的假设
  • 允许支持函数返回其他节点类型(例如常量),以实现更激进的优化
  • 这种灵活性使得未来的优化成为可能,例如将 COUNT(NULL) 转换为 '0'::bigint

社区见解

审查者反馈

Corey Huinker 提供了积极的反馈:

  • +1 支持自动查询改进
  • 指出我们无法教育所有人 COUNT(1) 是反模式,所以让它不再是反模式是正确的做法
  • 确认补丁可以干净地应用且测试通过

Matheus Alcantara 也进行了审查和测试:

  • 确认基准测试中约 30% 的性能提升
  • 验证了代码放置与现有的 SupportRequestSimplify 基础设施一致
  • +1 支持这个想法

设计决策

优化的时机:优化在常量折叠期间发生,这是规划过程的早期阶段。David 考虑过是否应该在稍后(在 add_base_clause_to_rel() 之后)进行,以捕获如下情况:

SELECT count(nullable_col) FROM t WHERE nullable_col IS NOT NULL;

但是,它必须在 preprocess_aggref() 之前发生,该函数将具有相同转换函数的聚合分组。当前的位置与常规函数的 SupportRequestSimplify 一致。

支持函数返回类型:基础设施允许支持函数返回 Aggref 以外的节点。这个设计决策使得未来的优化成为可能,例如:

  • COUNT(NULL) 转换为 '0'::bigint
  • 对聚合进行更激进的常量折叠

性能考虑

该优化提供了显著的性能优势:

  1. 减少元组解构COUNT(*) 不需要从元组中提取列值
  2. 更少的 NULL 检查:不需要检查单个列值
  3. 更好的缓存利用率:更少的数据移动意味着更好的 CPU 缓存使用

对于具有多列的表,性能提升可能更加显著,因为 COUNT(column) 可能需要解构许多列才能到达目标列。

边界情况和限制

优化仅在以下情况下应用:

  1. 列被证明不可为空(NOT NULL 约束或常量)
  2. 聚合中没有 ORDER BY 子句
  3. 聚合中没有 DISTINCT 子句

尚未优化的情况:

  • COUNT(nullable_col),其中列可能为 NULL(即使在同一查询中通过 WHERE nullable_col IS NOT NULL 过滤)
  • COUNT(col ORDER BY col) - ORDER BY 阻止优化
  • COUNT(DISTINCT col) - DISTINCT 阻止优化

WHERE 子句的限制是由于优化的时机(在常量折叠期间,在关系信息完全可用之前)。

当前状态

该补丁由 David Rowley 于 2025 年 11 月 26 日提交。它可在 PostgreSQL master 分支中使用,并将包含在 PostgreSQL 18 中。

结论

这个优化代表了 PostgreSQL 查询规划器的重大改进,自动修复了常见的性能反模式,而无需更改应用程序。新的 SupportRequestSimplifyAggref 基础设施也为未来的聚合优化打开了大门。

对于开发者和 DBA:

  • 无需操作:优化会自动发生
  • 性能优势:使用 COUNT(1)COUNT(not_null_col) 的现有查询将自动变得更快
  • 最佳实践:虽然规划器现在会优化这些情况,但 COUNT(*) 仍然是计算行数最清晰、最符合习惯的方式

这一变化体现了 PostgreSQL 对自动改进查询性能的承诺,减少了开发者了解每个优化细节的负担,同时仍然允许专家在需要时编写最优查询。

参考资料

  • 讨论线程
  • 相关:用于常规函数表达式的 SupportRequestSimplify

第 03 周(2026)

2026 年第 03 周的 PostgreSQL 邮件列表讨论。

🇬🇧 English Version

文章

PostgreSQL 19 扩展统计信息导入/导出功能

引言

PostgreSQL 的扩展统计信息功能自版本 10 引入以来,使优化器能够理解简单单列统计信息无法捕获的列间相关性。这包括 n-distinct 系数(存在多少个不同的组合)、函数依赖(当一个列决定另一个列时)以及多列组合的最常见值(MCV)列表。

虽然 PostgreSQL 17 引入了用于导入和导出关系和属性统计信息的函数(pg_restore_relation_statspg_restore_attribute_stats),但扩展统计信息被排除在这一初始实现之外。最近在 pgsql-hackers 邮件列表上,由 Corey Huinker 发起的一个讨论线程解决了这一空白,提供了一个全面的补丁系列,添加了 pg_restore_extended_stats()pg_clear_extended_stats() 及相关基础设施。

这项工作意义重大,原因如下:

  • 实现跨 pg_dump/pg_restore 和 pg_upgrade 的完整统计信息保留
  • 允许使用假设统计信息进行查询计划器实验
  • 支持仅包含 schema 和统计信息的转储,用于在没有实际数据的情况下测试查询计划

技术分析

原始格式的问题

pg_ndistinctpg_dependencies 类型的原始输出格式使用了一种 JSON 结构,其中键本身包含结构化数据:

{"1, 2": 2323, "1, 3": 3232, "2, 3": 1500}

虽然这在技术上是有效的 JSON,但这种格式存在几个问题:

  1. 包含逗号分隔属性号的键需要额外解析
  2. 难以以编程方式操作
  3. 不存在可用的输入函数——这些类型实际上只能输出

新的 JSON 格式

补丁系列引入了一种更清晰、更结构化的 JSON 格式。对于 pg_ndistinct

[
  {"attributes": [2, 3], "ndistinct": 4},
  {"attributes": [2, -1], "ndistinct": 4},
  {"attributes": [2, 3, -1], "ndistinct": 4}
]

对于 pg_dependencies

[
  {"attributes": [2], "dependency": 3, "degree": 1.000000},
  {"attributes": [2, 3], "dependency": -1, "degree": 0.850000}
]

主要改进:

  • 规范的 JSON 数组,每个元素都有命名的键
  • 清晰分离属性、值和元数据
  • 机器可读,无需自定义解析逻辑
  • 负数属性号表示统计对象中的表达式(例如,-1 是第一个表达式)

输入函数实现

新的输入函数使用 PostgreSQL 的 JSON 解析器基础设施,配合自定义语义动作处理器。以下是 pg_ndistinct 解析状态机的简化视图:

typedef enum
{
    NDIST_EXPECT_START = 0,
    NDIST_EXPECT_ITEM,
    NDIST_EXPECT_KEY,
    NDIST_EXPECT_ATTNUM_LIST,
    NDIST_EXPECT_ATTNUM,
    NDIST_EXPECT_NDISTINCT,
    NDIST_EXPECT_COMPLETE
} ndistinctSemanticState;

解析器验证:

  • 正确的 JSON 结构(对象数组)
  • 必需的键(ndistinct 统计信息需要 attributesndistinct
  • 属性号在有效范围内(正数表示列,负数表示表达式,但不超过 STATS_MAX_DIMENSIONS
  • 单个项目内没有重复属性

扩展统计信息函数

补丁引入了三个主要 SQL 函数:

pg_restore_extended_stats() — 从先前导出的值导入扩展统计信息:

SELECT pg_restore_extended_stats(
    'public',                    -- 关系 schema
    'my_table',                  -- 关系名称
    'public',                    -- 统计信息 schema
    'my_stats',                  -- 统计信息名称
    false,                       -- inherited(是否继承)
    '{"version": ..., "ndistinct": [...], "dependencies": [...], "mcv": [...], "exprs": [...]}'::text
);

pg_clear_extended_stats() — 从 pg_statistic_ext_data 中删除扩展统计信息数据:

SELECT pg_clear_extended_stats(
    'public',        -- 统计信息 schema
    'my_stats',      -- 统计信息名称
    false            -- inherited(是否继承)
);

这些函数遵循为关系/属性统计信息建立的相同模式:

  • 返回布尔值表示成功与否
  • 遇到问题时发出 WARNING(而非 ERROR),以避免破坏 pg_restore 脚本
  • 需要目标关系的 MAINTAIN 权限

验证与安全

实现包括仔细的验证:

  1. 属性边界检查:正数 attnum 必须存在于 stxkeys 中,负数 attnum 不得超过表达式数量
  2. 组合完整性:对于 pg_ndistinct,基于最长属性列表,必须存在所有 N 选 K 组合
  3. 软错误处理:使用 PostgreSQL 的 ErrorSaveContext 进行安全的错误报告而不会崩溃

属性号验证示例:

if (attnum == 0 || attnum < (0 - STATS_MAX_DIMENSIONS))
{
    errsave(parse->escontext,
            errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
            errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
            errdetail("Invalid \"%s\" element: %d.",
                      PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
    return JSON_SEM_ACTION_FAILED;
}

社区洞察

关键讨论点

格式更改时机:Tomas Vondra 最初建议采用更结构化的 JSON 格式。社区认识到这是在可用输入函数锁定向后兼容性要求之前更改格式的最后机会。

验证范围:关于应执行多少验证存在重大讨论:

  • 早期补丁对统计一致性进行了广泛检查(例如,MCV 频率总和为 1.0)
  • 审查者提出反对意见,倾向于最小化验证以避免破坏合法但不寻常的导入
  • 最终共识:验证结构和属性引用,但不验证统计值

pg_dependencies 特殊情况:与存储所有组合的 pg_ndistinct 不同,pg_dependencies 可能会省略统计上不显著的组合。这意味着输入函数无法对依赖项强制执行完整的组合覆盖。

审查反馈整合

Michael Paquier 提供了广泛的审查并贡献了重大改进:

  • 重构补丁系列以获得更清晰的提交
  • 将格式更改与输入函数添加分开
  • 添加全面的回归测试,实现超过 90% 的代码覆盖率
  • 修复旧版 GCC 上的编译器警告

Tom Lane 发现了风格问题:

  • 错误详细消息转换为完整句子
  • 用直接状态检查替换 SOFT_ERROR_OCCURRED() 宏以避免警告

当前状态

截至 2026 年 1 月,补丁系列已取得重大进展:

已提交:

  • pg_ndistinct 的输出格式更改(新的 JSON 数组格式)
  • pg_dependencies 的输出格式更改(新的 JSON 数组格式)
  • 两种类型的输入函数及全面验证
  • pg_clear_extended_stats() 函数

审查中 (v27):

  • pg_restore_extended_stats() 函数
  • pg_dump 集成用于扩展统计信息导出/导入

pg_dump 集成支持向后兼容到 PostgreSQL 10,通过特定版本的 SQL 生成来处理格式差异。

技术细节

内部存储未更改

重要的是,内部二进制存储格式保持不变。新的输入/输出函数只影响文本表示。这意味着:

  • 不需要目录更改
  • 现有数据保持有效
  • 二进制 COPY 操作不受影响

表达式统计支持

扩展统计信息可以包含表达式(例如,CREATE STATISTICS s ON (a + b), c FROM t)。实现通过负数属性号处理这些:

  • -1 = 第一个表达式
  • -2 = 第二个表达式
  • 以此类推

恢复格式中的 exprs 元素包含类似于 pg_statistic 条目的每个表达式的统计信息,实现完整的往返保留。

MCV 列表处理

扩展统计信息的 MCV(最常见值)列表特别复杂,包含:

  • 跨多列的值组合
  • 频率和基础频率数组
  • 每个值的空值位图

实现重用了属性统计信息导入的基础设施,并针对多列值数组进行了扩展。

结论

这个补丁系列代表了 PostgreSQL 统计信息基础设施的重大增强。通过启用扩展统计信息的导入/导出,它:

  1. 完善了统计信息功能,延续了 PostgreSQL 17 中为关系和属性统计信息开始的工作
  2. 实现了真实的测试,在清理后的 schema 上使用类似生产环境的统计信息
  3. 提高了升级可靠性,通过 pg_upgrade 保留优化器信息

对于 DBA 和开发人员:

  • 使用 CREATE STATISTICS 创建的扩展统计信息现在可以在 pg_dump/pg_restore 后保留
  • 使用包含完整统计信息的 --no-data 转储,查询计划测试变得更加实用
  • 新的 JSON 格式便于人类阅读,用于调试和假设场景测试

目标发布版本是 PostgreSQL 19,剩余的恢复函数和 pg_dump 集成预计很快会合并。

参考资料

pg_plan_advice:PostgreSQL 查询计划控制的新方案

背景介绍

PostgreSQL 的查询优化器功能强大,通常能生成优秀的执行计划。然而,经验丰富的 DBA 和开发者偶尔会遇到需要影响或稳定优化器决策的场景。来自 EnterpriseDB 的 Robert Haas 正在开发一个重要的 contrib 模块 pg_plan_advice,旨在解决这一长期存在的需求。

本文分析了 pgsql-hackers 邮件列表上自 2025 年 10 月以来持续讨论的 pg_plan_advice 线程

pg_plan_advice 是什么?

pg_plan_advice 是一个提议中的 contrib 模块,引入了一种专门用于控制关键规划决策的"建议迷你语言"(advice mini-language)。该模块支持:

  • 使用 EXPLAIN (PLAN_ADVICE) 从现有查询计划生成建议字符串
  • 通过 pg_plan_advice.advice GUC 参数应用建议字符串,以复现或约束后续的规划决策

建议语言允许控制:

  • 连接顺序:表的连接顺序
  • 连接方法:Nested Loop、Merge Join、Hash Join
  • 扫描类型:顺序扫描、索引扫描(可指定具体索引)
  • 并行执行:并行执行的位置和方式
  • 分区连接:分区表连接的处理方式

核心设计理念

Robert Haas 在 README 中强调,主要使用场景并非让用户"战胜优化器",而是复现过去表现良好的执行计划

"我们不需要接受用户能比优化器做出更好规划的观点。我们只需要接受用户比优化器更能分辨好计划和坏计划。这是一个很低的门槛。优化器永远不知道它生成的计划实际执行时会发生什么,但用户知道。"

这将 pg_plan_advice 定位为计划稳定性工具,而非微观管理优化器的提示系统。

技术架构

关系标识符系统

pg_plan_advice 最创新的方面之一是其关系标识符系统(Relation Identifier System)。该系统提供对查询各部分的无歧义引用,能处理复杂场景:

  • 同一表使用不同别名的多次引用
  • 子查询和 CTE
  • 分区表及其分区

标识符语法使用特殊表示法如 t#2 来区分查询中表 t 的第一次和第二次出现。

使用示例

以下是 Jakub Wartak 测试中展示的系统能力:

-- 为带别名的查询生成建议
EXPLAIN (PLAN_ADVICE, COSTS OFF) 
SELECT * FROM (SELECT * FROM t1 a JOIN t2 b USING (id)) a, t2 b, t3 c 
WHERE a.id = b.id AND b.id = c.id;

-- 输出包含:
-- Generated Plan Advice:
--   JOIN_ORDER(a#2 b#2 c)
--   MERGE_JOIN_PLAIN(b#2 c)
--   SEQ_SCAN(c)
--   INDEX_SCAN(a#2 public.t1_pkey)
--   NO_GATHER(c a#2 b#2)

然后可以选择性地应用约束:

-- 强制使用特定扫描类型
SET pg_plan_advice.advice = 'SEQ_SCAN(b#2)';

-- 重新执行 EXPLAIN 查看新计划
EXPLAIN (PLAN_ADVICE, COSTS OFF) 
SELECT * FROM (SELECT * FROM t1 a JOIN t2 b USING (id)) a, t2 b, t3 c 
WHERE a.id = b.id AND b.id = c.id;

-- 输出显示:
-- Supplied Plan Advice:
--   SEQ_SCAN(b#2) /* matched */

补丁结构(v10)

实现分为五个补丁:

补丁描述大小
0001存储范围表扁平化信息7.8 KB
0002在最终计划中存储省略节点的信息9.8 KB
0003存储 Append 节点合并信息40.4 KB
0004允许插件控制路径生成策略56.1 KB
0005WIP:添加 pg_plan_advice contrib 模块399.1 KB

前四个补丁为优化器添加必要的基础设施,第五个包含实际模块。这种分离设计使基础设施在未来可能惠及其他扩展。

社区审查与测试

该线程得到了多位社区成员的积极参与:

Jakub Wartak(EDB)

进行了大量 TPC-H 基准测试,发现了若干 bug:

  • debug/ASAN 构建中的空指针崩溃
  • 无统计信息时的半连接唯一性检测问题
  • 复杂查询中的连接顺序建议冲突

Jacob Champion(EDB)

应用模糊测试技术发现边缘情况:

  • 畸形建议字符串导致的解析器崩溃
  • 对非分区表使用分区相关建议的问题
  • 通过语料库模糊测试发现的 AST 工具 bug

其他贡献者

  • Alastair Turner:赞赏测试替代计划的能力
  • Hannu Krosing(Google):引用 VLDB 研究,显示 20% 的实际查询有 10+ 个连接
  • Lukas Fittl:对与 pg_stat_statements 集成的可能性感兴趣

发现并修复的问题

协作审查过程发现并修复了多个版本中的若干问题:

  1. 编译器警告(gcc-13、clang-20)- 在早期版本中修复
  2. 扩展状态未分配时 pgpa_join_path_setup() 中的空指针崩溃
  3. 连接顺序冲突检测错误地将连接方法建议视为正向约束
  4. 在 EXPLAIN 中未使用 PLAN_ADVICE 时半连接唯一性追踪工作不正确
  5. 嵌套连接顺序规范中的部分匹配检测问题

当前状态

截至 v10(2026 年 1 月 15 日发布):

  • 补丁已注册在 Commitfest
  • 仍标记为 WIP(进行中)
  • 测试仍在进行,特别是 TPC-H 查询测试
  • Robert Haas 正在寻求实质性的代码审查,特别是针对补丁 0001

对 PostgreSQL 用户的意义

如果被合并,pg_plan_advice 将提供:

  1. 计划稳定性:捕获并复现已知良好的查询计划
  2. 调试辅助:理解优化器为何做出特定选择
  3. 测试工具:在不修改查询的情况下实验替代计划形状
  4. 生产安全网:防止统计信息变化后的意外计划退化

与 pg_hint_plan 的比较

与流行的 pg_hint_plan 扩展不同,pg_plan_advice 专注于往返安全性(round-trip safety)

  • 计划可以被可靠地捕获和重新应用
  • 关系标识符系统自动处理复杂的别名
  • 设计为可与任何查询结构配合使用,无需手动管理标识符

总结

pg_plan_advice 代表了 PostgreSQL 优化器可扩展性方面的重要进步。它不是要取代优化器的判断,而是提供一种安全机制来保留经过验证的执行策略。活跃的社区审查过程已经大幅改进了代码,持续的测试正在帮助确保其健壮性。

对于管理复杂工作负载的 DBA,特别是那些查询偶尔遭受计划退化的场景,该模块提供了一个有前途的解决方案,它与优化器协同工作而非对抗。


邮件列表链接: pg_plan_advice - pgsql-hackers

Commitfest 条目: CF 6184

作者: Robert Haas(EnterpriseDB)

审查者: Jakub Wartak、Jacob Champion、Alastair Turner、Hannu Krosing、John Naylor 等