Extended Statistics Import/Export Functions for PostgreSQL 19

Introduction

PostgreSQL's extended statistics feature, introduced in version 10, allows the optimizer to understand correlations between columns that simple per-column statistics cannot capture. This includes n-distinct coefficients (how many distinct combinations exist), functional dependencies (when one column determines another), and Most Common Values (MCV) lists for multi-column combinations.

While PostgreSQL 17 introduced functions for importing and exporting relation and attribute statistics (pg_restore_relation_stats, pg_restore_attribute_stats), extended statistics were left out of this initial implementation. A recent thread on pgsql-hackers, initiated by Corey Huinker, addresses this gap with a comprehensive patch series that adds pg_restore_extended_stats(), pg_clear_extended_stats(), and related infrastructure.

This work is significant for several reasons:

Enables complete statistics preservation across pg_dump/pg_restore and pg_upgrade
Allows query planner experimentation with hypothetical statistics
Supports schema-only dumps with statistics for testing query plans without actual data

Technical Analysis

The Problem with the Original Format

The original output format for pg_ndistinct and pg_dependencies types used a JSON structure where the keys themselves contained structured data:

{"1, 2": 2323, "1, 3": 3232, "2, 3": 1500}

While technically valid JSON, this format posed several problems:

Keys containing comma-separated attribute numbers require additional parsing
Difficult to manipulate programmatically
No working input function existed—these types were effectively output-only

The New JSON Format

The patch series introduces a cleaner, more structured JSON format. For pg_ndistinct:

[
  {"attributes": [2, 3], "ndistinct": 4},
  {"attributes": [2, -1], "ndistinct": 4},
  {"attributes": [2, 3, -1], "ndistinct": 4}
]

For pg_dependencies:

[
  {"attributes": [2], "dependency": 3, "degree": 1.000000},
  {"attributes": [2, 3], "dependency": -1, "degree": 0.850000}
]

Key improvements:

Proper JSON arrays with named keys for each element
Clear separation of attributes, values, and metadata
Machine-readable without custom parsing logic
Negative attribute numbers represent expressions in the statistics object (e.g., -1 is the first expression)

Input Function Implementation

The new input functions use PostgreSQL's JSON parser infrastructure with a custom semantic action handler. Here's a simplified view of the parsing state machine for pg_ndistinct:

typedef enum
{
    NDIST_EXPECT_START = 0,
    NDIST_EXPECT_ITEM,
    NDIST_EXPECT_KEY,
    NDIST_EXPECT_ATTNUM_LIST,
    NDIST_EXPECT_ATTNUM,
    NDIST_EXPECT_NDISTINCT,
    NDIST_EXPECT_COMPLETE
} ndistinctSemanticState;

The parser validates:

Proper JSON structure (array of objects)
Required keys (attributes and ndistinct for ndistinct statistics)
Attribute numbers within valid ranges (positive for columns, negative for expressions, but not beyond STATS_MAX_DIMENSIONS)
No duplicate attributes within a single item

Extended Statistics Functions

The patch introduces three main SQL functions:

pg_restore_extended_stats() — Imports extended statistics from a previously exported value:

SELECT pg_restore_extended_stats(
    'public',                    -- relation schema
    'my_table',                  -- relation name
    'public',                    -- statistics schema  
    'my_stats',                  -- statistics name
    false,                       -- inherited
    '{"version": ..., "ndistinct": [...], "dependencies": [...], "mcv": [...], "exprs": [...]}'::text
);

pg_clear_extended_stats() — Removes extended statistics data from pg_statistic_ext_data:

SELECT pg_clear_extended_stats(
    'public',        -- statistics schema
    'my_stats',      -- statistics name
    false            -- inherited
);

The functions follow the same patterns established for relation/attribute statistics:

Return boolean indicating success
Issue WARNING (not ERROR) on problems to avoid breaking pg_restore scripts
Require MAINTAIN privilege on the target relation

Validation and Safety

The implementation includes careful validation:

Attribute bounds checking: Positive attnums must exist in stxkeys, negative attnums must not exceed the number of expressions
Combination completeness: For pg_ndistinct, all N-choose-K combinations must be present based on the longest attribute list
Soft error handling: Uses PostgreSQL's ErrorSaveContext for safe error reporting without crashing

Example validation for attribute numbers:

if (attnum == 0 || attnum < (0 - STATS_MAX_DIMENSIONS))
{
    errsave(parse->escontext,
            errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
            errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
            errdetail("Invalid \"%s\" element: %d.",
                      PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
    return JSON_SEM_ACTION_FAILED;
}

Community Insights

Key Discussion Points

Format Change Timing: Tomas Vondra initially suggested a more structured JSON format. The community recognized this was the last opportunity to change the format before a working input function locked in backward compatibility requirements.

Validation Scope: There was significant discussion about how much validation to perform:

Early patches had extensive checks for statistical consistency (e.g., MCV frequencies summing to 1.0)
Reviewers pushed back, preferring minimal validation to avoid breaking legitimate but unusual imports
Final consensus: validate structure and attribute references, but not statistical values

pg_dependencies Special Case: Unlike pg_ndistinct which stores all combinations, pg_dependencies may omit statistically insignificant combinations. This means the input function cannot enforce complete combination coverage for dependencies.

Reviewer Feedback Integration

Michael Paquier provided extensive review and contributed significant improvements:

Restructured the patch series for cleaner commits
Split format changes from input function additions
Added comprehensive regression tests achieving >90% code coverage
Fixed compiler warnings on older GCC versions

Tom Lane caught style issues:

Error detail messages converted to complete sentences
Replaced SOFT_ERROR_OCCURRED() macro with direct state checks to avoid warnings

Current Status

As of January 2026, the patch series has progressed significantly:

Committed:

Output format changes for pg_ndistinct (new JSON array format)
Output format changes for pg_dependencies (new JSON array format)
Input functions for both types with comprehensive validation
pg_clear_extended_stats() function

In Review (v27):

pg_restore_extended_stats() function
pg_dump integration for extended statistics export/import

The pg_dump integration supports backward compatibility to PostgreSQL 10, with version-specific SQL generation to handle format differences.

Technical Details

Internal Storage Unchanged

Importantly, the internal binary storage format remains unchanged. The new input/output functions only affect the text representation. This means:

No catalog changes required
Existing data remains valid
Binary COPY operations unaffected

Expression Statistics Support

Extended statistics can include expressions (e.g., CREATE STATISTICS s ON (a + b), c FROM t). The implementation handles these via negative attribute numbers:

-1 = first expression
-2 = second expression
etc.

The exprs element in the restore format contains per-expression statistics similar to pg_statistic entries, enabling complete round-trip preservation.

MCV List Handling

MCV (Most Common Values) lists for extended statistics are particularly complex, containing:

Value combinations across multiple columns
Frequency and base frequency arrays
Per-value null bitmaps

The implementation reuses infrastructure from attribute statistics import, with extensions for multi-column value arrays.

Conclusion

This patch series represents a significant enhancement to PostgreSQL's statistics infrastructure. By enabling import/export of extended statistics, it:

Completes the statistics story started in PostgreSQL 17 for relation and attribute statistics
Enables realistic testing with production-like statistics on sanitized schemas
Improves upgrade reliability by preserving optimizer information across pg_upgrade

For DBAs and developers:

Extended statistics created with CREATE STATISTICS will now survive pg_dump/pg_restore
Query plan testing becomes more practical with --no-data dumps that include full statistics
The new JSON format is human-readable for debugging and hypothetical scenario testing

The target release is PostgreSQL 19, with the remaining restore function and pg_dump integration expected to land soon.

Postgres Weekly, A Hacker’s Digest