Performance¶

rustfits aims to be as fast or faster than fitsio on every benchmark it shares with that library, plus offer capabilities (bounded-memory extend builds, transparent ZTABLE read/write) that fitsio doesn’t have. This page is a snapshot of where things stand.

Ratios in the tables below are fitsio_time / rustfits_time — greater than 1.0 means rustfits is faster. Cells are colorized so the overall pattern is visible at a glance: green for ≥ 1.10× faster, yellow for ~par (0.95–1.10×), red for slower (< 0.95×).

Numbers below are point-in-time measurements from one Linux x86_64 machine. They illustrate the rough shape of the comparison (rustfits typically 1.0×–2.5× cfitsio on common workloads, with a few cases at 30×+ where structural choices give a large win). Your mileage will vary with CPU, disk, file size, and data content — re-run the benchmarks yourself for numbers that match your hardware:

# release build required for representative timings
maturin develop --release

# full sweep (~5 minutes; --skip extend to skip the RSS benches)
python perf/perf-all.py --skip extend

The benchmark scripts live under perf/ in the source tree; each is a standalone script with a docstring explaining its methodology. The runner perf-all.py collects every script’s results into the two summary tables below. Refresh both tables with:

python perf/perf-all.py \
    --rst-out-xtool docs/tutorial/_perf_tables_xtool.rst \
    --rst-out-self  docs/tutorial/_perf_tables_self.rst

How comparisons are timed¶

The conventions every perf-*.py script follows so the numbers mean what they claim:

Release build. Debug builds read ~7× slower than release and produce misleading “rustfits is slower” results. maturin develop --release is required.
Fresh open per timed iteration. fitsio caches decoded compressed tiles forever; timing repeated reads of one open handle would measure fitsio’s cache hits against rustfits’s bounded LRU re-decoding (backward from the real workload, which reads each tile once). Both tools get a fresh open per iter so caches start empty.
Warmup primes the FS cache. The first read off disk is I/O-bound and washes out the comparison; a warmup pass loads the compressed bytes into the OS page cache so timed passes measure decoder + Python boundary speed, not disk.
Fresh FILE per timed iter for write benches. Overwriting a large file in a tight loop triggers a kernel page-cache penalty that masks the realistic single-file-per-program-run pattern. Write benches generate a unique filename per iter via h.fresh_path.
Median of 5. GC is paused around each timed call; samples are sorted and the median is reported.

What’s measured¶

Cross-tool comparisons — rustfits vs fitsio for every benchmark fitsio’s Python API supports. Each row is one (script, operation) pair; the vs fitsio column is fitsio_time / rustfits_time (so > 1.0 means rustfits is faster).
Self-comparisons — rustfits compressed vs rustfits uncompressed for ZTABLE read/write (fitsio’s Python API does not decompress ZTABLE, so cross-tool isn’t possible). The ratio is the compression cost in time, not a tool comparison.
Build wall + peak RSS — extend benchmarks measure bounded-memory incremental builds vs whole-array write-once. Each build runs in its own subprocess so ru_maxrss is a clean per-build high-water mark.

Cross-tool comparisons (rustfits vs fitsio)¶

Every benchmark whose operation fitsio also implements. Each row’s vs fitsio cell is fitsio_time / rustfits_time, colorized green / yellow / red as described above. Tables are grouped by script; the per-script subtitle supplies the data shape so the operation labels make sense in isolation.

Last generated 2026-05-30 16:28 UTC. Environment: Python 3.14.4, numpy 2.4.3, rustfits ?, fitsio 1.3.0, astropy 7.2.0, x86_64 / Linux.

These are point-in-time measurements. The benchmarks under perf/ are the source of truth — re-run python perf/perf-all.py for current numbers on your hardware.

Uncompressed 1-D image read

1-D f8 image, 64 M pixels (~512 MB on disk).

Uncompressed 1-D image read — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
chunk 1000 (partial, 8M)	19.10 ms	34.35 ms	1.80×	3,351 MB/s
chunk 50000 (whole)	54.62 ms	64.36 ms	1.18×	9,374 MB/s
whole .read()	215.50 ms	225.27 ms	1.05×	2,376 MB/s

Uncompressed 2-D image read

2-D f4 image, 4000×4000 pixels (64 MB on disk).

Uncompressed 2-D image read — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
stamps 32x32 random (x1000)	17.30 ms	43.96 ms	2.54×	237 MB/s
whole .read()	29.68 ms	30.03 ms	1.01×	2,156 MB/s

Uncompressed 1-D image write

1-D f8 image, 64 M pixels (~512 MB on disk); fresh file per timed iteration.

Uncompressed 1-D image write — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
write raw (512 MB f8)	166.92 ms	233.70 ms	1.40×	3,067 MB/s

Uncompressed 2-D image write

2-D f4 image, 4000×4000 pixels (64 MB on disk); fresh file per timed iteration.

Uncompressed 2-D image write — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
write raw (64 MB f4)	23.20 ms	30.74 ms	1.32×	2,758 MB/s

Compressed 1-D image read (GZIP_2, healsparse-like)

1-D f8 healsparse map with long quantized runs; ~537 MB raw, ~26 MB compressed (GZIP_2, 1 MiB tiles).

Compressed 1-D image read (GZIP_2, healsparse-like) — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
chunk 1000 (partial, 8 tiles)	64.42 ms	2.989 s	46.39×	1,042 MB/s
chunk 1048576 = tile (whole array)	406.84 ms	1.040 s	2.56×	1,320 MB/s
chunk 50000 (whole array)	392.71 ms	1.472 s	3.75×	1,367 MB/s

Compressed 1-D image read (GZIP_2, random data)

1-D f8 array of pure noise (worst case for any codec); GZIP_2 with 1 MiB tiles, ~512 MB raw / ~485 MB compressed.

Compressed 1-D image read (GZIP_2, random data) — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
chunk 1000 (partial, 8 tiles)	106.45 ms	3.094 s	29.07×	630 MB/s
chunk 1048576 = tile (whole array)	764.15 ms	1.698 s	2.22×	703 MB/s
chunk 50000 (whole array)	740.14 ms	2.130 s	2.88×	725 MB/s

Compressed 1-D image write (GZIP_2, healsparse-like)

Same input as the healsparse read bench, lossless GZIP_2 encode (level matched to cfitsio’s Z_BEST_SPEED=1).

Compressed 1-D image write (GZIP_2, healsparse-like) — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
write GZIP_2 level=1 (537 MB f8)	625.81 ms	1.495 s	2.39×	858 MB/s

Compressed 2-D image read (RICE_1 + dither2, DES-like)

2-D f4 image with ~5% masked zeros; RICE_1 + Quantize(level=16, method='dither2'), 100×100 tiles, 4000×4000 pixels. Stamps = 1000 random 32×32 windows; whole = all tiles in disk order (band walk).

Compressed 2-D image read (RICE_1 + dither2, DES-like) — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
stamps 32x32 random (x1000)	85.27 ms	134.86 ms	1.58×	48 MB/s
whole file, tiles in order (40 bands)	112.32 ms	125.43 ms	1.12×	570 MB/s

Compressed 2-D image read — codec isolation sweep

Same shape as DES, but varying ONE factor per row (codec, quantization, tile size) so the gap from each contributing factor is visible. Helps separate “decode is slow” from “dequant is slow”.

Compressed 2-D image read — codec isolation sweep — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
gzip2 f4 quant (5.0x) [stamps]	97.09 ms	180.95 ms	1.86×	42 MB/s
gzip2 f4 quant [whole]	141.65 ms	166.68 ms	1.18×	452 MB/s
gzip2 f4 unquant (1.1x) [stamps]	155.90 ms	332.88 ms	2.14×	26 MB/s
gzip2 f4 unquant [whole]	227.50 ms	309.79 ms	1.36×	281 MB/s
rice f4 quant (DES) (4.7x) [stamps]	81.50 ms	128.03 ms	1.57×	50 MB/s
rice f4 quant (DES) [whole]	106.60 ms	117.59 ms	1.10×	600 MB/s
rice f4 quant t=1000 (4.7x) [stamps]	117.41 ms	5.575 s	47.49×	35 MB/s
rice f4 quant t=1000 [whole]	121.39 ms	120.90 ms	1.00×	527 MB/s
rice i4 lossless (4.7x) [stamps]	57.97 ms	114.06 ms	1.97×	71 MB/s
rice i4 lossless [whole]	80.54 ms	105.38 ms	1.31×	795 MB/s

Compressed 2-D image write (RICE_1 + dither2, DES-like)

Same input/shape as the DES read bench; RICE_1 + dither2.

Compressed 2-D image write (RICE_1 + dither2, DES-like) — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
write RICE q=16.0 dither2 (64 MB f4)	566.02 ms	565.63 ms	1.00×	113 MB/s

Uncompressed BINTABLE read

Type-exhaustive 34-column catalog (every scalar type, 1-D + 2-D f4/f8 sub-arrays, S and U strings fixed + VLA, an f4 VLA); 500 k rows, ~306 MB.

Uncompressed BINTABLE read — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
column subset (3 cols)	30.55 ms	58.16 ms	1.90×	262 MB/s
row slice [125000:375000]	602.53 ms	780.70 ms	1.30×	254 MB/s
scattered rows (x2000)	8.12 ms	20.51 ms	2.52×	151 MB/s
whole table	1.210 s	1.484 s	1.23×	253 MB/s

Uncompressed BINTABLE write

Same catalog as the BINTABLE read bench; fresh file per timed iteration.

Uncompressed BINTABLE write — vs fitsio¶
operation	rustfits	fitsio	vs fitsio	rustfits rate
write catalog	897.53 ms	2.450 s	2.73×	341 MB/s

Open file with many HDUs

Synthetic files with N empty image HDUs (N=100/1000/10000). Two regimes per N: bare open (rustfits eager parse vs fitsio lazy noop – different constructor contracts, rustfits looks slower) and open + walk all (apples-to-apples; both walk every HDU). The note column carries per-HDU normalized time – flat across N confirms linear scaling (cfitsio had a quadratic-on-open bug here).

Open file with many HDUs — vs fitsio¶
operation	rustfits	fitsio	vs fitsio
bare open (N=1,000)	3.59 ms	25.5 us	0.01×
bare open (N=10,000)	37.10 ms	25.6 us	0.00×
bare open (N=100)	384.2 us	24.4 us	0.06×
open + walk all (N=1,000)	3.57 ms	5.58 ms	1.56×
open + walk all (N=10,000)	36.96 ms	59.28 ms	1.60×
open + walk all (N=100)	373.6 us	579.0 us	1.55×

Cross-tool tally¶

Across all cross-tool comparisons: 31/38 faster (≥ 1.10×), 4/38 ≈ par (0.95–1.10×), 3/38 slower (< 0.95×).

Self-comparisons (rustfits self)¶

Benchmarks that have no cross-tool equivalent — either because fitsio’s Python API can’t do the operation (ZTABLE read/write) or because the comparison is structural (incremental append / extend vs whole-array write). Numbers here are rustfits vs rustfits, so the headline is the trade-off (compression cost, append overhead) rather than a tool ranking.

Incremental table builds (`append` and the `appending()` context)¶

Building a catalog by N/K calls to hdu.append(K rows) is the natural pattern for streaming pipelines (per-frame source extraction, per-file harvest).

TL;DR for ZTABLE streaming pipelines: use the :meth:`CompressedTableHDU.appending` context manager. It makes the append-chunk size irrelevant — every chunk size lands within ~1-7% of the one-shot write_table cost:

with hdu.appending():
    for batch in batches:
        hdu.append(batch)

perf/perf-table-compressed-append-chunks.py (4×f4 schema, N=100,000 rows, ZTILELEN=10,000) shows the win directly:

regime	chunk	unbuffered `append`	`appending()` context
1% of tile	100	48.91× time	1.07× time (46× win)
10% of tile	1,000	5.40× time	1.00× time (5.4× win)
50% of tile	5,000	1.53× time	0.99× time
exact tile	10,000	1.02× time	1.00× time
2 tiles	20,000	1.01× time	0.99× time

Without the context, ZTABLE append pays a partial-trailing- tile decode + merge + re-encode on every call — fine when the chunk is at least one ZTILELEN, expensive when it’s smaller. The context buffers row batches in RAM and drains them in ZTILELEN-aligned bursts (cap at 32 MB, MAX_PENDING_BYTES in src/hdu_table_compressed/extending.rs), collapsing N partial-tile re-encodes into 1. For a 16-byte row schema the cap allows 2 M rows per drain — a typical streaming pipeline does 0-1 mid-context drains plus one residual drain at __exit__.

extending() is a symmetric alias for appending(); extend() is the same alias on the append call itself. Strict context semantics (the same as CompressedImageHDU.extending()): only append() / extend() is permitted inside the with block; any read() / __getitem__ / write() / __setitem__ / repack() / add_checksum() / verify_checksum() raises ValueError (exit the context first). Uncompressed TableHDU.append doesn’t need the context — no merge tax — but a no-op appending() / extending() on the uncompressed types is on the roadmap so generic code that doesn’t know the HDU type works uniformly.

The rest of this section is the underlying behavior: how append itself performs across the four variants (the context wraps it, so understanding the per-call cost is still useful when ZTILELEN is small or the streaming-loop budget is tight).

perf/perf-table-append.py measures wall time + peak RSS of the per-call append loop against the equivalent one-shot write_table(N rows) across four variants: {uncompressed, ZTABLE} × {fixed-only, with VLA}. Each build runs in its own subprocess for a clean per-build ru_maxrss.

Sample numbers from the reference machine (N=100,000 rows, 34-column type-exhaustive catalog at ~600 B/row, ZTILELEN ≈ 16 k rows). The first table is the rustfits self-comparison (append vs write-once across all four variants); the second is the rustfits-vs-fitsio cross- tool comparison on the uncompressed variants, where pairing is possible (fitsio’s Python API cannot write ZTABLE).

In the self-comparison table, vs rf write-once reports each row’s wall time / RSS divided by the rustfits write-once measurement for that variant; values < 1 mean the row was faster / lighter than the one-shot rustfits write.

Table append (N=100,000) — rustfits self¶
variant	regime	build	peak RSS	vs rf write-once
uncompressed, fixed-only	rustfits write-once	49.6 ms	163 MB	(ref)
	rustfits append C=1k (K=100)	63.1 ms	163 MB	1.27× time, ≈ RAM
	rustfits append C=10k (K=10)	50.5 ms	163 MB	1.02× time, ≈ RAM
uncompressed, with VLA	rustfits write-once	174.8 ms	216 MB	(ref)
	rustfits append C=1k (K=100)	199.9 ms	165 MB	1.14× time, 1.3× less RAM
	rustfits append C=10k (K=10)	156.4 ms	165 MB	0.89× time, 1.3× less RAM
ZTABLE, fixed-only	rustfits write-once	2.03 s	174 MB	(ref)
	rustfits append C=1k (K=100)	20.45 s	163 MB	10.1× time, ≈ RAM
	rustfits append C=10k (K=10)	3.73 s	163 MB	1.8× time, ≈ RAM
ZTABLE, with VLA	rustfits write-once	5.63 s	213 MB	(ref)
	rustfits append C=1k (K=100)	29.70 s	170 MB	5.3× time, 1.3× less RAM
	rustfits append C=10k (K=10)	7.94 s	173 MB	1.4× time, 1.2× less RAM

Cross-tool comparison on the uncompressed variants, paired row-by-row (vs fitsio = fitsio_time / rustfits_time; > 1.0 means rustfits is faster):

Table append (N=100,000) — rustfits vs fitsio (uncompressed)¶
variant	operation	rustfits	fitsio	vs fitsio
fixed-only	write-once	49.6 ms	166.7 ms	3.36×
	append C=1k (K=100)	63.1 ms	171.0 ms	2.71×
	append C=10k (K=10)	50.5 ms	153.9 ms	3.05×
with VLA	write-once	174.8 ms	495.3 ms	2.83×
	append C=1k (K=100)	199.9 ms	40.08 s	200× (see note)
	append C=10k (K=10)	156.4 ms	32.79 s	210× (see note)

Five things to take away from the per-call append numbers above (the appending() context wraps these, so it inherits the wins and erases the losses):

Uncompressed write-once: rustfits is ~3× fitsio. The one-shot write_table call beats fitsio’s equivalent by 3.36× on the fixed-only variant and 2.83× on the VLA variant. (fitsio doesn’t write ZTABLE.)
Uncompressed append is essentially free. At chunk=10 k the fixed-only rustfits append loop runs within 3 % of the one-shot rustfits write, and the VLA append loop is faster than the one-shot write (the write-once path plans the whole VLA heap layout in RAM up front; the append loop pays it incrementally).
rustfits append crushes fitsio append on VLA — ~200×. On the fixed-only variant rustfits append is ~2.7–3.0× faster than fitsio append (in line with the write-once ratio). On VLA the gap blows out to 200× faster at chunk=1k and 210× faster at chunk=10k — fitsio takes ~33–40 s where rustfits takes ~160–200 ms. See the note below for the root cause.
VLA append wins on peak RSS. The rustfits whole-table write holds ~216 MB resident; the append loop holds ~165 MB. That’s the bounded-memory story repeating from the image side — incremental write keeps live memory near the chunk size instead of the full output.
ZTABLE ``append`` (without the context) has a chunk-size cliff at ZTILELEN. At chunk=10 k (≈ default ZTILELEN of ~17 k for this 588 B row width), append is only 1.8× the rustfits ZTABLE write-once for fixed-only and 1.4× for VLA. At chunk=1 k (well below ZTILELEN), the partial-last-tile re-encode tax kicks in: every append decompresses + merges into the trailing tile and re-encodes it, costing 10× for fixed-only and 5× for VLA. The fix is the appending() context (lead of this section); the alternative is to pick chunks that are an integer multiple of ZTILELEN.

Note

The 1.8× / 1.4× / 10× / 5× numbers above reflect a 2026-05-31 fix in create_table_hdu: prior to that commit, the nrows=0 streaming-create pattern (create_table_hdu(nrows=0, compress=True[, ztilelen=K]) + repeated append(chunk)) silently collapsed ZTILELEN to 1 (regardless of what the user passed), forcing every appended row into its own single-row-per-tile, independently-gzip-compressed tile. The pre-fix numbers were 14–15× for fixed-only and 6–7× for VLA at any chunk size, since the bug masked the true per-call cost. Resolved in CLAUDE.md TODO #10 (commit 957f94a).

Note

Known fitsio issue behind the 200× VLA-append gap. fitsio’s write_var_column Python wrapper calls fits_flush_file after every per-column write (fitsio_pywrap.c, line 2710), and cfitsio’s fits_flush_file (ffflus in buffers.c) is close-current-HDU + flush-buffers + re-open-current-HDU. Each reopen re-walks the header and re-parses column descriptors. With 3 VLA columns and 100 appends at chunk=1k that’s 300 close-and-reopen cycles — about 40 s of overhead unrelated to the actual data write. The fix is in fitsio (the underlying cfitsio fits_write_col doesn’t need the flush); the gap will close once that wrapper is patched.

2-D image extend — uncompressed mosaic build¶

The mosaic / strip-build pattern: append per-detector frames (or per-night strips) to a growing image via hdu.extend(strip) (rustfits) or hdu.write(strip, start=(row, 0)) (fitsio). Both grow the slowest-varying axis — same primitive, two APIs. Unlike ZTABLE/ZIMAGE, fitsio CAN extend uncompressed images, so this is a true cross-tool comparison. perf/perf-image-extend-2d.py measures wall + peak RSS for write-once vs extend at two chunk sizes (100 rows and 1000 rows) on a (20,000 × 4,000) f4 image (~320 MB).

Uncompressed 2-D image extend (20 k × 4 k f4) — N=320 MB¶
regime	build	peak RSS	vs rf write-once
fitsio write-once	174.7 ms	440 MB	1.48× time, 1.2× more RAM (fitsio)
rustfits write-once	117.9 ms	363 MB	(ref)
rustfits extend C=100 rows (K=200)	119.0 ms	70 MB	1.01× time, 5.2× less RAM
fitsio extend C=100 rows (K=200)	231.4 ms	70 MB	1.96× time (fitsio), 5.2× less RAM
rustfits extend C=1000 rows (K=20)	117.1 ms	74 MB	0.99× time, 4.9× less RAM
fitsio extend C=1000 rows (K=20)	239.7 ms	74 MB	2.03× time (fitsio), 4.9× less RAM

Three takeaways:

Bounded memory works in both tools. Either extend path keeps peak RSS at ~70–75 MB regardless of the final image size (dominated by Python + numpy baseline; the actual chunk’s data is only 1.6–16 MB). write-once needs the whole image resident, plus per-tool overhead. Both tools deliver the same ~5× RAM win.
rustfits extend ≈ rustfits write-once on time. Both chunk sizes match the write-once baseline within 2 %. No per-call overhead worth worrying about — for uncompressed incremental builds the only cost vs write-once is the bookkeeping for the NAXIS2 header update.
fitsio extend is ~2× slower than rustfits extend. At both chunk sizes fitsio’s per-extend cost is roughly double rustfits’ — same ratio as the write-once gap (1.48×), so this is a general fitsio per-call overhead rather than something specific to start= writes. fitsio also holds ~80 MB more RAM during write-once (440 vs 363 MB) — suggesting it keeps an extra byteswapped copy that rustfits avoids.

2-D compressed image extend — mosaic build with GZIP_2¶

Same shape as the uncompressed bench (20,000 × 4,000 f4, ~320 MB raw) but tile-compressed with GZIP_2 and tile shape (100, cols) so the chunk-row axis maps directly to “tile-rows per append”. fitsio cannot extend a compressed image (cfitsio returns status = 107: tried to move past end of file on the second write), so the extend rows are rustfits-only; fitsio appears only as a write-once reference. perf/perf-compressed-image-extend-2d.py.

Compressed 2-D image extend (20 k × 4 k f4, GZIP_2 tile=(100, cols))¶
regime	build	peak RSS	vs rf write-once
fitsio write-once	6.44 s	440 MB	2.98× time (fitsio)
rustfits write-once	2.17 s	626 MB	(ref)
rustfits extend C=50 rows (K=400, sub-tile)	48.30 s	454 MB	22.3× time, 1.4× less RAM
rustfits extending() C=50 rows (K=400, sub-tile)	2.73 s	364 MB	1.26× time, 1.7× less RAM
rustfits extend C=100 rows (K=200, exact tile)	16.68 s	321 MB	7.7× time, 1.9× less RAM
rustfits extending() C=100 rows (K=200, exact tile)	2.74 s	365 MB	1.26× time, 1.7× less RAM
rustfits extend C=1000 rows (K=20, 10 tiles)	3.34 s	338 MB	1.54× time, 1.8× less RAM
rustfits extending() C=1000 rows (K=20, 10 tiles)	2.49 s	394 MB	1.15× time, 1.6× less RAM

Four takeaways:

Multi-tile chunks are nearly free. At chunk=1000 rows (10 tiles per call) extend is 1.54× write-once — the per-extend overhead is one heap-relocate-forward and a PCOUNT bump, amortized across the 10 tiles’ encode work. For mosaic builds that can buffer multi-tile strips, this is the regime to aim for.
Exact-tile chunks are moderate. At chunk=100 rows (1 tile per call) extend costs 7.7× write-once — the per-call overhead dominates because there’s only one tile’s worth of “real” encode work per call but the same bookkeeping.
Sub-tile chunks pay heavily. At chunk=50 rows (½ tile) every append decompresses + merges into the partial last tile then re-encodes it: 22.3× write-once — the same mirror-pattern as the ZTABLE small-chunk re-encode finding in the table-append section above. For compressed 2-D mosaic builds, align chunks to a multiple of tile-rows (or buffer to that size in user code) to skip the re-encode tax — or use the extending() context (takeaway 4) to do the buffering automatically.
The ``extending()`` context manager wins at every chunk size. Wrapping the extend loop in with hdu.extending(): buffers chunks in RAM and drains in tile-aligned bursts (triggered by a 32 MB soft cap), collapsing N partial-tile re-encodes into 1:
```
with hdu.extending():
    for batch in batches:
        hdu.extend(batch)
```
At sub-tile chunks the speedup is dramatic: 22.3× → 1.26× write-once (18× faster than the unbuffered extend) AND peak RSS drops to 364 MB — 1.7× less RAM than write-once because the cap bounds the buffer below what write-once’s whole-array path holds. Even at exact-tile chunks the context helps (7.7× → 1.26×). At multi-tile chunks it’s a modest win (1.54× → 1.15×).

Memory cap. The mid-context drain is triggered at MAX_PENDING_BYTES = 32 MB of accumulated input, chosen as a soft cap that fits many tiles per drain while staying small enough to not balloon RSS during long streaming loops. Sized as a code constant; not currently user-configurable — if a workload needs a different cap please open an issue.

Strict context semantics. Inside with hdu.extending(): only extend() is permitted; any read(), __getitem__, write(), __setitem__, repack(), add_checksum(), or verify_checksum() call raises ValueError. Exit the context first. The natural nested-with pattern with FITS composes correctly (Python guarantees the inner __exit__ runs before the outer close), so this restriction only surfaces in mixed-operation loops that should be restructured anyway.

Scattered reads on compressed images — tune the tile cache¶

The compressed-read benches above all walk the file in chunks of increasing size — sequential access, where the tile cache holds whatever was most recently decoded. Scattered / random-access workloads behave differently: every read may revisit a tile decoded much earlier, so cache hit rate dominates. perf/perf-compressed-image-read-1d-scattered.py measures 1000 random 1000-row windows against a 64-tile 1-D f8 GZIP_2 image (537 MB raw, 15 MB compressed, healsparse-like). With random sampling across 64 tiles, every tile is touched ~16× per iteration, so the cache size you pick matters a lot.

The two regimes:

Scattered compressed-1D read (1000 × 1k-row windows, 64-tile f8 GZIP_2)¶
cache	rustfits	fitsio	rf vs fi	note
default (rf=32 MB, fi=unbounded)	5.34 s	1.73 s	0.32×	rf holds 4/64 tiles → thrashes
large (rf=528 MB, fi=unbounded)	483 ms	1.74 s	3.60×	both backends fully cache

What’s going on:

rustfits’s default cache is 32 MB (4 tiles for an 8 MB tile). On scattered access across 64 unique tiles, every read evicts something we still need — the workload collapses to “decode every read” and the headline ratio looks bad.
fitsio’s cache is unbounded — it caches every decoded tile forever, so on this small file it implicitly covers the whole working set with no user tuning. The unbounded cache is what makes fitsio look ~3× faster than rustfits in the default-cache regime.
At equal coverage, rustfits is 3.6× faster because the per-read lookup-and-slice path is leaner (same shape as the small-chunk wins in the sequential-read benches above).

If your access pattern is scattered, the fix is one line — size the tile cache to the working set:

hdu = fits[1]
hdu.set_tile_cache_size(500 * 1024 * 1024)   # 500 MB
for row in random_rows:
    chunk = hdu[row:row + 1000]
    ...

Important

rustfits’s cache is bounded for a reason. fitsio’s “remember every tile forever” works great when the compressed image fits in RAM (the case in this bench), but on a multi-GB compressed image fitsio’s unbounded cache will OOM — there’s no knob to bound it. rustfits’s bounded LRU degrades gracefully: above the cap, the per- tile decode cost replaces the cache hit, but the process keeps running. The cache size is a memory-vs-speed knob you tune for your workload; there’s no universally-good value.

Practical tuning rules of thumb (until a real workload disagrees):

Sequential / chunked reads: default 32 MB is fine — LRU naturally holds the few most-recent tiles, and the chunked-read benches already show large wins.
Scattered reads, file fits in RAM: bump set_tile_cache_size to roughly the size of the working set’s unique tiles (or the whole compressed-image data section if you can spare the RSS).
Scattered reads, file doesn’t fit in RAM: pick a cache size you can afford; the bench above shows that even at 256 MB / 32 tiles, partial coverage halves the cost vs. default.

Compressed-image `setitem` — per-tile re-encode tax¶

A hdu[selection] = value on a tile-compressed image decodes every tile the selection touches, modifies it in numpy, and re-encodes / appends to the heap. A “patch a few pixels” workflow (interactive masking, bad-pixel fix-up, hot-pixel flagging) therefore pays a full tile re-encode for every tile the selection covers — even a single-pixel write. perf/perf-compressed-image-setitem.py measures that per-call cost across all five algorithms + the unquantized / quantized float paths, with an uncompressed memcpy floor for reference.

Setup: (256 × 256) image with (32, 32) tiles (an 8×8 tile grid). Four selection shapes:

single pixel — one int per axis (touches 1 tile)
1 tile aligned — a 32×32 slice on a tile boundary (1 tile)
4 tiles spanning — an 8×8 slice straddling a tile corner (touches 4 tiles)
16 tiles aligned — a 4×4 block of tiles (16 tiles)

Per-call costs span 3 µs to 200 µs, which at the low end is below scheduler / cache / IRQ jitter on a typical Linux box (early prototypes of this bench wobbled 55% run-to-run on the 3 µs uncompressed-single-pixel row). The bench mitigates by auto-calibrating the inner loop size per (algo, sel): a 20-call probe estimates per-call cost, then the timed loop is sized so each window takes at least 100 ms. Result: 30 000 inner calls for the cheap rows down to 200 for the expensive ones, every row stable to within 1% run-to-run. GC is disabled during each timed window; reported value is the median across 5 windows. PCOUNT grows monotonically across iters but per-call cost is constant (__setitem__ just appends new tile bytes to the heap; it doesn’t repack).

Compressed-image `__setitem__` cost (256×256, tile 32×32)¶
algorithm	selection	tiles	per call	per tile	per pixel
uncompressed i4	single pixel	1	3.0 µs	3.0 µs	3.0 µs
uncompressed i4	1 tile aligned	1	17.5 µs	17.5 µs	0.0 µs
uncompressed i4	4 tiles spanning	4	6.7 µs	1.7 µs	0.1 µs
uncompressed i4	16 tiles aligned	16	64.8 µs	4.0 µs	0.0 µs
GZIP_1 i4	single pixel	1	99.5 µs	99.5 µs	99.5 µs
GZIP_1 i4	1 tile aligned	1	37.6 µs	37.6 µs	0.0 µs
GZIP_1 i4	4 tiles spanning	4	361.6 µs	90.4 µs	5.7 µs
GZIP_1 i4	16 tiles aligned	16	401.4 µs	25.1 µs	0.0 µs
GZIP_2 i4	single pixel	1	81.5 µs	81.5 µs	81.5 µs
GZIP_2 i4	1 tile aligned	1	40.8 µs	40.8 µs	0.0 µs
GZIP_2 i4	4 tiles spanning	4	263.0 µs	65.8 µs	4.1 µs
GZIP_2 i4	16 tiles aligned	16	452.4 µs	28.3 µs	0.0 µs
RICE_1 i4	single pixel	1	39.3 µs	39.3 µs	39.3 µs
RICE_1 i4	1 tile aligned	1	22.8 µs	22.8 µs	0.0 µs
RICE_1 i4	4 tiles spanning	4	103.2 µs	25.8 µs	1.6 µs
RICE_1 i4	16 tiles aligned	16	139.4 µs	8.7 µs	0.0 µs
HCOMPRESS_1 i4	single pixel	1	139.8 µs	139.8 µs	139.8 µs
HCOMPRESS_1 i4	1 tile aligned	1	31.0 µs	31.0 µs	0.0 µs
HCOMPRESS_1 i4	4 tiles spanning	4	461.4 µs	115.4 µs	7.2 µs
HCOMPRESS_1 i4	16 tiles aligned	16	275.3 µs	17.2 µs	0.0 µs
PLIO_1 i4	single pixel	1	18.8 µs	18.8 µs	18.8 µs
PLIO_1 i4	1 tile aligned	1	20.1 µs	20.1 µs	0.0 µs
PLIO_1 i4	4 tiles spanning	4	40.5 µs	10.1 µs	0.6 µs
PLIO_1 i4	16 tiles aligned	16	124.2 µs	7.8 µs	0.0 µs
GZIP_1 f4 unquantized	single pixel	1	93.2 µs	93.2 µs	93.2 µs
GZIP_1 f4 unquantized	1 tile aligned	1	37.4 µs	37.4 µs	0.0 µs
GZIP_1 f4 unquantized	4 tiles spanning	4	281.3 µs	70.4 µs	4.4 µs
GZIP_1 f4 unquantized	16 tiles aligned	16	402.6 µs	25.2 µs	0.0 µs
GZIP_1 f4 quantized	single pixel	1	195.4 µs	195.4 µs	195.4 µs
GZIP_1 f4 quantized	1 tile aligned	1	51.2 µs	51.2 µs	0.0 µs
GZIP_1 f4 quantized	4 tiles spanning	4	680.7 µs	170.2 µs	10.6 µs
GZIP_1 f4 quantized	16 tiles aligned	16	560.9 µs	35.1 µs	0.0 µs

Four takeaways:

Per-call cost is dominated by per-tile re-encode. A single-pixel write costs 19–200 µs depending on algorithm — 30–60× the uncompressed memcpy (3 µs). The user’s budget for “patch one pixel” is the per-tile cost of their chosen algorithm; touching N tiles costs roughly N × the per-tile rate plus a small constant.
PLIO_1 is fastest, quantized float is slowest. At one tile touched, PLIO costs ~20 µs / tile (it’s encoded as (value, run-length) pairs so decode + re-encode are trivial); quantized f4 costs ~200 µs / tile because every touched tile has to re-quantize against its existing per-tile bscale/bzero/seed. RICE_1 and HCOMPRESS_1 fall between (40–145 µs single-pixel).
Full-tile-aligned writes are CHEAPER than single-pixel writes. Counterintuitive but real: a “1 tile aligned” selection that covers the whole tile (32×32) is 20–50 µs across all algorithms, where a single-pixel write on the same tile is 40–200 µs. The reason is that an aligned full-tile write doesn’t need to decode the existing tile first — every pixel is being replaced, so the read-modify-write collapses to just write. Single-pixel writes pay the full decode + modify + encode cycle.
Per-pixel rate amortizes with selection size. For batched writes (the 16-tile column) the per-pixel rate drops to <0.1 µs / pixel — comparable to uncompressed. If you can buffer your patches into rectangular tile-aligned blocks, the per-tile re-encode tax amortizes away. For true scattered single-pixel workflows there’s no shortcut (use PLIO_1 if your data fits its non-negative integer constraint).

Practical guidance: for interactive masking on a compressed image, budget ~50–200 µs per single-pixel touch (algorithm- dependent), so up to a few thousand single-pixel patches per second. For bulk masking, group patches into rectangular selections that cover whole tiles where possible.

`repack()` — peak RSS scaling at large heap¶

Three repack flavors clean up heap orphans accumulated by __setitem__ / extend calls:

TableHDU.repack() for BINTABLE VLA columns,
CompressedImageHDU.repack() for ZIMAGE compressed images,
CompressedTableHDU.repack() for ZTABLE compressed tables.

They share a goal (rebuild the heap with only live bytes, shrink the file) but the implementations differ. This bench measures the peak RSS at 10 / 100 / 1000 MB heap sizes for each. Each fixture is built in the parent process; the repack runs in a fresh subprocess so the reported peak RSS is the repack alone, read via /proc/self/status:VmHWM (the kernel’s per-task high-water). Using resource.getrusage(RUSAGE_SELF).ru_maxrss here would silently report the parent’s peak, because on Linux ru_maxrss is accumulated in signal_struct and inherited across fork+exec; the shared h.vm_hwm_kb helper is the immune path.

`repack()` peak RSS vs PCOUNT¶
flavor	PCOUNT	repack t	peak RSS	notes
BINTABLE VLA (uncompressed)	10 MB	8.2 ms	50 MB	streaming + staging impl
BINTABLE VLA	100 MB	16.4 ms	50 MB	flat — no PCOUNT scaling
BINTABLE VLA	1,000 MB	94.2 ms	50 MB	flat at 1 GB heap
ZIMAGE compressed image	11 MB	2.0 ms	49 MB	streaming + staging impl
ZIMAGE compressed image	119 MB	19.6 ms	49 MB	flat — no PCOUNT scaling
ZIMAGE compressed image	1,215 MB	189.6 ms	50 MB	flat at 1 GB heap
ZTABLE compressed table	7 MB	1.8 ms	50 MB	streaming + staging impl
ZTABLE compressed table	102 MB	18.3 ms	50 MB	flat — no PCOUNT scaling
ZTABLE compressed table	1,049 MB	160.4 ms	50 MB	flat at 1 GB heap

Two takeaways:

All three repack flavors are bounded-memory. RSS stays constant at ~50 MB (Python + rustfits baseline) regardless of heap size, from 10 MB to 1 GB, for all three: BINTABLE VLA, ZIMAGE, and ZTABLE. All three share the same streaming + staging implementation (a ~1 MiB chunk plus the descriptor table + a small move-plan vector; documented in CLAUDE.md under “Heap repack”). The shared primitives stream_copy_in_file and grow_file_to_at_least live in src/common.rs.

The rewrite cycle (2026-05-31) was driven by a real in-place-modify-large-compressed-image workload needing repack() on a memory-constrained worker. Before the rewrite, BINTABLE VLA repack scaled at ~1.05× PCOUNT and ZIMAGE at ~1.5× PCOUNT. The numbers:

Repack rewrite — 1 GB heap, before vs after¶
flavor	before	after	improvement
BINTABLE VLA	1.05 GB peak, 451 ms	50 MB peak, 94 ms	21× less RAM, 4.8× faster
ZIMAGE	1.79 GB peak, 819 ms	50 MB peak, 190 ms	36× less RAM, 4.3× faster
ZTABLE	(already streaming)	50 MB peak, ~160 ms	(unchanged)

The time win on the two new streaming paths comes from eliminating the entire Vec-allocate + memcpy + drop cycle on the heap bytes; the streaming approach reads + writes 1 MiB chunks straight through the file handle.

Other self-comparisons + RSS benches¶

ZTABLE read/write (rustfits compressed vs rustfits uncompressed) and the image-extend RSS benches.

Last generated 2026-05-30 16:28 UTC. Environment: Python 3.14.4, numpy 2.4.3, rustfits ?, fitsio 1.3.0, astropy 7.2.0, x86_64 / Linux.

These are point-in-time measurements. The benchmarks under perf/ are the source of truth — re-run python perf/perf-all.py for current numbers on your hardware.

Compressed BINTABLE read (ZTABLE; rustfits self)

Same catalog as the uncompressed BINTABLE bench. fitsio’s Python API does not decompress ZTABLE, so this is a rustfits self-comparison (compressed vs uncompressed timing).

Compressed BINTABLE read (ZTABLE; rustfits self)¶
operation	primary	primary t	secondary	secondary t	ratio
column subset (3 cols)	ZTABLE	23.60 ms	uncompressed	29.81 ms	0.79×
row slice [125000:375000]	ZTABLE	1.183 s	uncompressed	609.34 ms	1.94×
scattered rows (x2000)	ZTABLE	547.71 ms	uncompressed	8.19 ms	66.84×
whole table	ZTABLE	2.321 s	uncompressed	1.214 s	1.91×

Compressed BINTABLE write (ZTABLE; rustfits self)

Same catalog. Self-comparison (no other Python library writes ZTABLE).

Compressed BINTABLE write (ZTABLE; rustfits self)¶
operation	primary	primary t	secondary	secondary t	ratio
write catalog	ZTABLE	28.236 s	uncompressed	907.24 ms	31.12×

Uncompressed 1-D image build — bounded-memory extend vs write-once

1 GB f8 map; ImageHDU.extend appends chunks instead of holding the whole array in RAM. fitsio cannot append; RSS comparison is rustfits-self plus fitsio’s write-once as a reference.

Uncompressed 1-D image build — bounded-memory extend vs write-once — build wall + peak RSS¶
regime	build	peak RSS
fitsio write-once	487.40 ms	1,205 MB
rustfits extend C=1,048,576 (K=128)	347.50 ms	66 MB
rustfits extend C=8,388,608 (K=16)	366.98 ms	118 MB
rustfits write-once	353.91 ms	1,078 MB

Compressed 1-D image build — bounded-memory extend vs write-once

Same shape as the compressed healsparse benches; CompressedImageHDU.extend appends chunks. fitsio cannot append compressed.

Compressed 1-D image build — bounded-memory extend vs write-once — build wall + peak RSS¶
regime	build	peak RSS
fitsio write-once	2.999 s	1,225 MB
rustfits extend C=1,048,576 (K=128)	3.025 s	129 MB
rustfits extend C=8,388,608 (K=16)	1.460 s	186 MB
rustfits write-once	1.974 s	1,145 MB

EXTNAME lookup (`name in fits` / `fits[name]`) — linear in HDU count¶

name in fits and fits["NAME"] walk the HDU list until a match is found (or fall through and return False / raise on miss), reading and parsing each HDU’s EXTNAME card along the way. Cost scales linearly at roughly 0.14 µs per HDU on the reference machine — about 1.4 ms for a 10,000-HDU file when the name lives at the end (or doesn’t exist). fitsio caches a name→index dict and returns in ~10 µs regardless of HDU count.

For typical files (tens to a few hundred HDUs) the gap is invisible. Workloads that do many name lookups on a multi- thousand-HDU file will notice; integer indexing is O(1) in both libraries, so the workaround until this is fixed is to build the name→index dict yourself once:

name_to_idx = {h.extname: i for i, h in enumerate(fits) if h.extname}

This could be re-engineered to match fitsio (file-level name→index cache with version-stamped invalidation) if a real workload asks for it.