More Speedups for Tuple Deformation: Precalculating attcacheoff
Introduction
Tuple deformation is the process of extracting individual attribute values from a PostgreSQL heap tuple's raw byte representation into a TupleTableSlot. It happens constantly during query execution—every time a sequential scan, index scan, or join produces a row, the executor must "deform" that tuple to access column values. For workloads that process millions of rows, even small improvements to the deforming hot path can yield significant gains.
David Rowley has been steadily optimizing tuple deformation. In PostgreSQL 18, he landed several patches: CompactAttribute (5983a4cff), faster offset aligning (db448ce5a), and inline deforming loops (58a359e58). Building on that work, he proposed precalculating attcacheoff rather than computing it on every attribute access. The discussion has since evolved through v10 (February 2026), with Andres Freund contributing a NULL-bitmap-to-isnull conversion that pushes Apple M2 speedups to 63% in some scenarios. The patch set remains under active review.
Why This Matters
When the executor needs a column value from a tuple, it must:
- Align the current offset according to the attribute's alignment
- Fetch the value via
fetch_att() - Advance past the attribute to the next one
These steps form a dependency chain: each offset depends on the previous. There is little opportunity for instruction-level parallelism. For fixed-width attributes, PostgreSQL can cache the offset (attcacheoff) to avoid recomputing alignment and length—but that caching was previously done inside the deforming loop. David's idea: do it once when the TupleDesc is finalized, not on every tuple.
Technical Approach
TupleDescFinalize()
The core change introduces TupleDescFinalize(), which must be called after a TupleDesc has been created or changed. This function:
- Pre-calculates
attcacheofffor all fixed-width attributes - Records
firstNonCachedOffAttr—the first attribute (byattnum) that is varlena or cstring and thus cannot have a cached offset - Enables a tight loop that deforms all attributes with cached offsets before falling through to attributes that require manual offset calculation
If a tuple has a NULL before the last attribute with a cached offset, the code can only use attcacheoff up to that NULL—but for tuples without early NULLs, the fast path handles many attributes in a tight loop without any per-attribute offset arithmetic.
Dedicated Deforming Loops
The patch adds a dedicated loop that processes all attributes with a precomputed attcacheoff before entering the loop that handles varlena/cstring attributes. For tuples with HEAP_HASNULL set, the current code calls att_isnull() for every attribute. A further optimization: keep deforming without calling att_isnull() until we reach the first NULL. Test #5 in the benchmark (first col int not null, last col int null) highlights this—it often shows the largest speedup.
Optional OPTIMIZE_BYVAL Loop
An optional variant adds a loop for tuples where all deformed attributes are attbyval == true. In that case, fetch_att() can be inlined without the branch that handles pointer types, reducing branching and yielding a tighter loop. The tradeoff: when the optimization doesn't apply, there is extra overhead to check attnum < firstByRefAttr. Benchmark results vary by hardware and compiler as to whether this helps.
Benchmark Design
To stress tuple deformation, David designed a benchmark that maximizes deforming work relative to other CPU:
SELECT sum(a) FROM t1;
The a column is almost last, so all prior attributes must be deformed before a can be read. Eight test schemas cover combinations of first column (int/text, null/not null) and last column (int null/not null). For each of the 8 tests, he ran with 0, 10, 20, 30, and 40 extra INT NOT NULL columns—40 scenarios per benchmark run. Each scenario used 1 million rows.
Benchmark Results
Results varied by hardware and compiler:
- AMD Zen 2 (3990x) with GCC: Up to 21% average speedup with
OPTIMIZE_BYVAL; some tests exceed 44%; no regressions. - AMD Zen 2 with Clang: Some small regressions in the 0-extra-column tests.
- Apple M2: Tests #1 and #5 improve significantly; others less so; a few slight regressions with certain patches.
- Intel (Azure): Benchmarks run on shared, low-core instances; results were noisier due to co-located workloads.
Patch Evolution
v1 → v3 (December 2025 – January 2026)
- v1: Three patches—0001 (precalculate attcacheoff), 0002 (experimental NULL bitmap look-ahead), 0003 (remove dedicated hasnulls loop)
- v2: Rebase, fix Assert for NULL bitmap in 0003, JIT fix (remove
TTS_FLAG_SLOW), more benchmarks - v3: Rebase, drop 0002 and 0003 (benchmarks showed little advantage), keep only 0001
v4 (January 2026)
Addressed code review from Chao Li:
- NULL bitmap mask (tupmacs.h): Clarified comment—when
natts & 7 == 0, the mask is zero and the code correctly returnsnatts - Uninitialized TupleDesc:
firstNonCachedOffAttr == 0means no cached attributes;-1means uninitialized. Added Asserts with hints to callTupleDescFinalize()if they fail - Typo: "possibily" → "possibly"
- LLVM: Fixed compiler warning
v5–v8 (January–February 2026): Andres Freund's NULL Bitmap Optimization
Andres Freund joined the discussion and proposed a key improvement: instead of calling att_isnull() for each column, compute the isnull[] array directly from the NULL bitmap using a SWAR (SIMD Within A Register) technique. The idea: multiply one byte of the bitmap by a carefully chosen value (e.g. 0x204081) so each bit spreads into a separate byte, then mask. This avoids a 2KB lookup table and works well on most hardware.
David implemented this in patch 0004 ("Various experimental changes"). Additional changes in 0004:
populate_isnull_array(): Converts the NULL bitmap totts_isnullin bulk using the multiplication tricktts_isnullsizing: Rounded up to a multiple of 8 so the loop can write 8 bytes at a time (avoidsmemsetinlining issues)t_hoff: For!hasnullstuples, useMAXALIGN(offsetof(HeapTupleHeaderData, t_bits))instead oft_hofffetch_att_noerr(): New variant withoutelogfor the commonattlen == 8case
John Naylor noted that __builtin_ctz(~bits[bytenum]) is undefined when the byte is 255; David fixed this with a cast: pg_rightmost_one_pos32(~((uint32) bits[bytenum])).
Results with 0004: Apple M2 averaged 53% faster than master (or ~63% excluding 0-extra-column tests). Andres suggested pg_nounroll and pg_novector pragmas to prevent GCC from over-vectorizing populate_isnull_array(), which was generating poor code.
v9 (February 24, 2026)
- Resequenced patches:
deform_benchmoved to 0001 for easier master benchmarking - 0004 (new): Sibling-call optimization in
slot_getsomeattrs—movedslot_getmissingattrs()intogetsomeattrs()so the compiler can apply tail-call optimization. Reduces overhead and improves 0-extra-column tests - 0005 (new): Shrink
CompactAttributefrom 16 to 8 bytes—attcacheoff→int16(max 2^15), bitflags for booleans. Andres noted the 8-byte size lets the compiler use a single LEA with scale factor 8; 6 bytes would require two LEA instructions
v10 (February 25, 2026) — Latest Patch Set
Based on the actual v10 patch content:
0003 (Optimize tuple deformation):
firstNonCachedOffsetAttr: index of the first attribute without a cached offsetfirstNonGuaranteedAttr: index of the first nullable, missing, or!attbyvalattribute. When deforming only up to this point, the code need not accessHeapTupleHeaderGetNatts(tup)—a dependency reduction that helps the CPU pipelineTTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS: opt-in flag for the guaranteed-attribute optimization (some code deforms tuples before NOT NULL validation)populate_isnull_array(): usesSPREAD_BITS_MULTIPLIER_32(0x204081) to spread each bit of the inverted NULL bitmap into a separate byte; processes lower 4 and upper 4 bits separately to avoid uint64 overflowfetch_att_noerr(): variant offetch_att()withoutelogfor invalid attlen; safe when attlen comes fromCompactAttributefirst_null_attr(): finds the first NULL in the bitmap usingpg_rightmost_one_pos32or__builtin_ctz
0004 (Sibling-call optimization):
getsomeattrs()is now responsible for callingslot_getmissingattrs()slot_getmissingattrs(): replacedmemsetwith a for-loop (benchmarks showed the loop is faster)slot_deform_heap_tuple(): callsslot_getmissingattrs()at the end whenattnum < reqnatts; parameter renamed fromnattstoreqnatts
0005 (8-byte CompactAttribute):
attcacheoff→int16; offsets >PG_INT16_MAXare not cached- Bitflags for
attispackable,atthasmissing,attisdropped,attgenerated - Stores
cattrs = tupleDesc->compact_attrsto help GCC generate better code (avoids repeatedTupleDescCompactAttr()calls)
Review fixes:
- Amit Langote: Fixed rebase noise (duplicate
attcacheoffcheck) - Zsolt Parragi: Big-endian fix—
pg_bswap64()beforememcpyinpopulate_isnull_array() - Typos: "benchmaring" → "benchmarking", "to info" → "into"
- Andres: Set
*offpbeforeslot_getmissingattrsto reduce stack spills; usesize_tforattnumto fix GCC-fwrapvcodegen
deform_bench and Benchmark Infrastructure
Andres and Álvaro Herrera discussed where to put deform_bench: src/test/modules/benchmark_tools, src/benchmark/tuple_deform, or a single extension for micro-benchmarks. Andres argued for merging useful tools incrementally rather than waiting for a full suite. David prefers to focus on the deformation patches first; deform_bench may be committed separately.
Code Review: Chao Li's Feedback
Chao Li reviewed the patch and raised several points:
- NULL bitmap mask: Add a comment to clarify no overflow/OOB risk when
natts & 7 == 0 - Uninitialized TupleDesc: Initialize
firstNonCachedOffAttrto-1in TupleDesc creation; Assert>= 0innocachegetattr() - Semantic consistency: Use 0 for "no cached attributes," >0 for "some cached"
- Typo: "possibily" → "possibly"
David addressed all in v4.
Current Status
- v10 (February 2026) is the latest patch set: 0001 (deform_bench), 0002 (TupleDescFinalize stub), 0003 (main optimization), 0004 (sibling-call + NULL bitmap→isnull), 0005 (8-byte CompactAttribute)
- Andres Freund supports merging 0004 as a clear win; 0005's benefit is less certain (helps with LEA addressing when deforming few columns)
- Active review from Zsolt Parragi (Percona), Álvaro Herrera, John Naylor, Amit Langote
- deform_bench placement (src/test/modules vs. src/benchmark) still under discussion; David prefers to land the optimization patches first
Conclusion
Precalculating attcacheoff in TupleDescFinalize() and using a dedicated tight loop for attributes with cached offsets yields meaningful speedups for tuple deformation on modern CPUs. The optimization is most effective when tuples have many fixed-width columns and few or late NULLs. With Andres Freund's NULL-bitmap-to-isnull conversion (the "0x204081" SWAR trick), Apple M2 sees up to 63% speedup excluding edge cases. The sibling-call optimization in slot_getsomeattrs further reduces overhead. Results depend on hardware and compiler; GCC can over-vectorize some loops, addressed with pragmas or size_t for loop indices. The patch set (v10) has been refined through extensive review from Andres, John Naylor, Zsolt Parragi, Álvaro Herrera, and Amit Langote, and is progressing toward integration.
References
- Thread: More speedups for tuple deformation
- Related v18 work: 5983a4cff (CompactAttribute), db448ce5a (faster offset aligning), 58a359e58 (inline deforming loops)