Performance

rustfits aims to be as fast or faster than fitsio on every benchmark it shares with that library, plus offer capabilities (bounded-memory extend builds, transparent ZTABLE read/write) that fitsio doesn’t have. This page is a snapshot of where things stand.

Ratios in the tables below are fitsio_time / rustfits_time — greater than 1.0 means rustfits is faster. Cells are colorized so the overall pattern is visible at a glance: green for ≥ 1.10× faster, yellow for ~par (0.95–1.10×), red for slower (< 0.95×).

Numbers below are point-in-time measurements from one Linux x86_64 machine. They illustrate the rough shape of the comparison (rustfits typically 1.0×–2.5× cfitsio on common workloads, with a few cases at 30×+ where structural choices give a large win). Your mileage will vary with CPU, disk, file size, and data content — re-run the benchmarks yourself for numbers that match your hardware:

# release build required for representative timings
maturin develop --release

# full sweep (~5 minutes; --skip extend to skip the RSS benches)
python perf/perf-all.py --skip extend

The benchmark scripts live under perf/ in the source tree; each is a standalone script with a docstring explaining its methodology. The runner perf-all.py collects every script’s results into the two summary tables below. Refresh both tables with:

python perf/perf-all.py \
    --rst-out-xtool docs/tutorial/_perf_tables_xtool.rst \
    --rst-out-self  docs/tutorial/_perf_tables_self.rst

How comparisons are timed

The conventions every perf-*.py script follows so the numbers mean what they claim:

  • Release build. Debug builds read ~7× slower than release and produce misleading “rustfits is slower” results. maturin develop --release is required.

  • Fresh open per timed iteration. fitsio caches decoded compressed tiles forever; timing repeated reads of one open handle would measure fitsio’s cache hits against rustfits’s bounded LRU re-decoding (backward from the real workload, which reads each tile once). Both tools get a fresh open per iter so caches start empty.

  • Warmup primes the FS cache. The first read off disk is I/O-bound and washes out the comparison; a warmup pass loads the compressed bytes into the OS page cache so timed passes measure decoder + Python boundary speed, not disk.

  • Fresh FILE per timed iter for write benches. Overwriting a large file in a tight loop triggers a kernel page-cache penalty that masks the realistic single-file-per-program-run pattern. Write benches generate a unique filename per iter via h.fresh_path.

  • Median of 5. GC is paused around each timed call; samples are sorted and the median is reported.

What’s measured

  • Cross-tool comparisons — rustfits vs fitsio for every benchmark fitsio’s Python API supports. Each row is one (script, operation) pair; the vs fitsio column is fitsio_time / rustfits_time (so > 1.0 means rustfits is faster).

  • Self-comparisons — rustfits compressed vs rustfits uncompressed for ZTABLE read/write (fitsio’s Python API does not decompress ZTABLE, so cross-tool isn’t possible). The ratio is the compression cost in time, not a tool comparison.

  • Build wall + peak RSS — extend benchmarks measure bounded-memory incremental builds vs whole-array write-once. Each build runs in its own subprocess so ru_maxrss is a clean per-build high-water mark.

Cross-tool comparisons (rustfits vs fitsio)

Every benchmark whose operation fitsio also implements. Each row’s vs fitsio cell is fitsio_time / rustfits_time, colorized green / yellow / red as described above. Tables are grouped by script; the per-script subtitle supplies the data shape so the operation labels make sense in isolation.

Last generated 2026-05-30 16:28 UTC. Environment: Python 3.14.4, numpy 2.4.3, rustfits ?, fitsio 1.3.0, astropy 7.2.0, x86_64 / Linux.

These are point-in-time measurements. The benchmarks under perf/ are the source of truth — re-run python perf/perf-all.py for current numbers on your hardware.

Uncompressed 1-D image read

1-D f8 image, 64 M pixels (~512 MB on disk).

Uncompressed 1-D image read — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

chunk 1000 (partial, 8M)

19.10 ms

34.35 ms

1.80×

3,351 MB/s

chunk 50000 (whole)

54.62 ms

64.36 ms

1.18×

9,374 MB/s

whole .read()

215.50 ms

225.27 ms

1.05×

2,376 MB/s

Uncompressed 2-D image read

2-D f4 image, 4000×4000 pixels (64 MB on disk).

Uncompressed 2-D image read — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

stamps 32x32 random (x1000)

17.30 ms

43.96 ms

2.54×

237 MB/s

whole .read()

29.68 ms

30.03 ms

1.01×

2,156 MB/s

Uncompressed 1-D image write

1-D f8 image, 64 M pixels (~512 MB on disk); fresh file per timed iteration.

Uncompressed 1-D image write — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

write raw (512 MB f8)

166.92 ms

233.70 ms

1.40×

3,067 MB/s

Uncompressed 2-D image write

2-D f4 image, 4000×4000 pixels (64 MB on disk); fresh file per timed iteration.

Uncompressed 2-D image write — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

write raw (64 MB f4)

23.20 ms

30.74 ms

1.32×

2,758 MB/s

Compressed 1-D image read (GZIP_2, healsparse-like)

1-D f8 healsparse map with long quantized runs; ~537 MB raw, ~26 MB compressed (GZIP_2, 1 MiB tiles).

Compressed 1-D image read (GZIP_2, healsparse-like) — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

chunk 1000 (partial, 8 tiles)

64.42 ms

2.989 s

46.39×

1,042 MB/s

chunk 1048576 = tile (whole array)

406.84 ms

1.040 s

2.56×

1,320 MB/s

chunk 50000 (whole array)

392.71 ms

1.472 s

3.75×

1,367 MB/s

Compressed 1-D image read (GZIP_2, random data)

1-D f8 array of pure noise (worst case for any codec); GZIP_2 with 1 MiB tiles, ~512 MB raw / ~485 MB compressed.

Compressed 1-D image read (GZIP_2, random data) — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

chunk 1000 (partial, 8 tiles)

106.45 ms

3.094 s

29.07×

630 MB/s

chunk 1048576 = tile (whole array)

764.15 ms

1.698 s

2.22×

703 MB/s

chunk 50000 (whole array)

740.14 ms

2.130 s

2.88×

725 MB/s

Compressed 1-D image write (GZIP_2, healsparse-like)

Same input as the healsparse read bench, lossless GZIP_2 encode (level matched to cfitsio’s Z_BEST_SPEED=1).

Compressed 1-D image write (GZIP_2, healsparse-like) — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

write GZIP_2 level=1 (537 MB f8)

625.81 ms

1.495 s

2.39×

858 MB/s

Compressed 2-D image read (RICE_1 + dither2, DES-like)

2-D f4 image with ~5% masked zeros; RICE_1 + Quantize(level=16, method='dither2'), 100×100 tiles, 4000×4000 pixels. Stamps = 1000 random 32×32 windows; whole = all tiles in disk order (band walk).

Compressed 2-D image read (RICE_1 + dither2, DES-like) — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

stamps 32x32 random (x1000)

85.27 ms

134.86 ms

1.58×

48 MB/s

whole file, tiles in order (40 bands)

112.32 ms

125.43 ms

1.12×

570 MB/s

Compressed 2-D image read — codec isolation sweep

Same shape as DES, but varying ONE factor per row (codec, quantization, tile size) so the gap from each contributing factor is visible. Helps separate “decode is slow” from “dequant is slow”.

Compressed 2-D image read — codec isolation sweep — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

gzip2 f4 quant (5.0x) [stamps]

97.09 ms

180.95 ms

1.86×

42 MB/s

gzip2 f4 quant [whole]

141.65 ms

166.68 ms

1.18×

452 MB/s

gzip2 f4 unquant (1.1x) [stamps]

155.90 ms

332.88 ms

2.14×

26 MB/s

gzip2 f4 unquant [whole]

227.50 ms

309.79 ms

1.36×

281 MB/s

rice f4 quant (DES) (4.7x) [stamps]

81.50 ms

128.03 ms

1.57×

50 MB/s

rice f4 quant (DES) [whole]

106.60 ms

117.59 ms

1.10×

600 MB/s

rice f4 quant t=1000 (4.7x) [stamps]

117.41 ms

5.575 s

47.49×

35 MB/s

rice f4 quant t=1000 [whole]

121.39 ms

120.90 ms

1.00×

527 MB/s

rice i4 lossless (4.7x) [stamps]

57.97 ms

114.06 ms

1.97×

71 MB/s

rice i4 lossless [whole]

80.54 ms

105.38 ms

1.31×

795 MB/s

Compressed 2-D image write (RICE_1 + dither2, DES-like)

Same input/shape as the DES read bench; RICE_1 + dither2.

Compressed 2-D image write (RICE_1 + dither2, DES-like) — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

write RICE q=16.0 dither2 (64 MB f4)

566.02 ms

565.63 ms

1.00×

113 MB/s

Uncompressed BINTABLE read

Type-exhaustive 34-column catalog (every scalar type, 1-D + 2-D f4/f8 sub-arrays, S and U strings fixed + VLA, an f4 VLA); 500 k rows, ~306 MB.

Uncompressed BINTABLE read — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

column subset (3 cols)

30.55 ms

58.16 ms

1.90×

262 MB/s

row slice [125000:375000]

602.53 ms

780.70 ms

1.30×

254 MB/s

scattered rows (x2000)

8.12 ms

20.51 ms

2.52×

151 MB/s

whole table

1.210 s

1.484 s

1.23×

253 MB/s

Uncompressed BINTABLE write

Same catalog as the BINTABLE read bench; fresh file per timed iteration.

Uncompressed BINTABLE write — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

write catalog

897.53 ms

2.450 s

2.73×

341 MB/s

Open file with many HDUs

Synthetic files with N empty image HDUs (N=100/1000/10000). Two regimes per N: bare open (rustfits eager parse vs fitsio lazy noop – different constructor contracts, rustfits looks slower) and open + walk all (apples-to-apples; both walk every HDU). The note column carries per-HDU normalized time – flat across N confirms linear scaling (cfitsio had a quadratic-on-open bug here).

Open file with many HDUs — vs fitsio

operation

rustfits

fitsio

vs fitsio

rustfits rate

bare open (N=1,000)

3.59 ms

25.5 us

0.01×

bare open (N=10,000)

37.10 ms

25.6 us

0.00×

bare open (N=100)

384.2 us

24.4 us

0.06×

open + walk all (N=1,000)

3.57 ms

5.58 ms

1.56×

open + walk all (N=10,000)

36.96 ms

59.28 ms

1.60×

open + walk all (N=100)

373.6 us

579.0 us

1.55×

Cross-tool tally

Across all cross-tool comparisons: 31/38 faster (≥ 1.10×), 4/38 ≈ par (0.95–1.10×), 3/38 slower (< 0.95×).

Self-comparisons (rustfits self)

Benchmarks that have no cross-tool equivalent — either because fitsio’s Python API can’t do the operation (ZTABLE read/write) or because the comparison is structural (incremental append / extend vs whole-array write). Numbers here are rustfits vs rustfits, so the headline is the trade-off (compression cost, append overhead) rather than a tool ranking.

Incremental table builds (append and the appending() context)

Building a catalog by N/K calls to hdu.append(K rows) is the natural pattern for streaming pipelines (per-frame source extraction, per-file harvest).

TL;DR for ZTABLE streaming pipelines: use the :meth:`CompressedTableHDU.appending` context manager. It makes the append-chunk size irrelevant — every chunk size lands within ~1-7% of the one-shot write_table cost:

with hdu.appending():
    for batch in batches:
        hdu.append(batch)

perf/perf-table-compressed-append-chunks.py (4×f4 schema, N=100,000 rows, ZTILELEN=10,000) shows the win directly:

regime

chunk

unbuffered append

appending() context

1% of tile

100

48.91× time

1.07× time (46× win)

10% of tile

1,000

5.40× time

1.00× time (5.4× win)

50% of tile

5,000

1.53× time

0.99× time

exact tile

10,000

1.02× time

1.00× time

2 tiles

20,000

1.01× time

0.99× time

Without the context, ZTABLE append pays a partial-trailing- tile decode + merge + re-encode on every call — fine when the chunk is at least one ZTILELEN, expensive when it’s smaller. The context buffers row batches in RAM and drains them in ZTILELEN-aligned bursts (cap at 32 MB, MAX_PENDING_BYTES in src/hdu_table_compressed/extending.rs), collapsing N partial-tile re-encodes into 1. For a 16-byte row schema the cap allows 2 M rows per drain — a typical streaming pipeline does 0-1 mid-context drains plus one residual drain at __exit__.

extending() is a symmetric alias for appending(); extend() is the same alias on the append call itself. Strict context semantics (the same as CompressedImageHDU.extending()): only append() / extend() is permitted inside the with block; any read() / __getitem__ / write() / __setitem__ / repack() / add_checksum() / verify_checksum() raises ValueError (exit the context first). Uncompressed TableHDU.append doesn’t need the context — no merge tax — but a no-op appending() / extending() on the uncompressed types is on the roadmap so generic code that doesn’t know the HDU type works uniformly.

The rest of this section is the underlying behavior: how append itself performs across the four variants (the context wraps it, so understanding the per-call cost is still useful when ZTILELEN is small or the streaming-loop budget is tight).

perf/perf-table-append.py measures wall time + peak RSS of the per-call append loop against the equivalent one-shot write_table(N rows) across four variants: {uncompressed, ZTABLE} × {fixed-only, with VLA}. Each build runs in its own subprocess for a clean per-build ru_maxrss.

Sample numbers from the reference machine (N=100,000 rows, 34-column type-exhaustive catalog at ~600 B/row, ZTILELEN ≈ 16 k rows). The first table is the rustfits self-comparison (append vs write-once across all four variants); the second is the rustfits-vs-fitsio cross- tool comparison on the uncompressed variants, where pairing is possible (fitsio’s Python API cannot write ZTABLE).

In the self-comparison table, vs rf write-once reports each row’s wall time / RSS divided by the rustfits write-once measurement for that variant; values < 1 mean the row was faster / lighter than the one-shot rustfits write.

Table append (N=100,000) — rustfits self

variant

regime

build

peak RSS

vs rf write-once

uncompressed, fixed-only

rustfits write-once

49.6 ms

163 MB

(ref)

rustfits append C=1k (K=100)

63.1 ms

163 MB

1.27× time, ≈ RAM

rustfits append C=10k (K=10)

50.5 ms

163 MB

1.02× time, ≈ RAM

uncompressed, with VLA

rustfits write-once

174.8 ms

216 MB

(ref)

rustfits append C=1k (K=100)

199.9 ms

165 MB

1.14× time, 1.3× less RAM

rustfits append C=10k (K=10)

156.4 ms

165 MB

0.89× time, 1.3× less RAM

ZTABLE, fixed-only

rustfits write-once

2.03 s

174 MB

(ref)

rustfits append C=1k (K=100)

20.45 s

163 MB

10.1× time, ≈ RAM

rustfits append C=10k (K=10)

3.73 s

163 MB

1.8× time, ≈ RAM

ZTABLE, with VLA

rustfits write-once

5.63 s

213 MB

(ref)

rustfits append C=1k (K=100)

29.70 s

170 MB

5.3× time, 1.3× less RAM

rustfits append C=10k (K=10)

7.94 s

173 MB

1.4× time, 1.2× less RAM

Cross-tool comparison on the uncompressed variants, paired row-by-row (vs fitsio = fitsio_time / rustfits_time; > 1.0 means rustfits is faster):

Table append (N=100,000) — rustfits vs fitsio (uncompressed)

variant

operation

rustfits

fitsio

vs fitsio

fixed-only

write-once

49.6 ms

166.7 ms

3.36×

append C=1k (K=100)

63.1 ms

171.0 ms

2.71×

append C=10k (K=10)

50.5 ms

153.9 ms

3.05×

with VLA

write-once

174.8 ms

495.3 ms

2.83×

append C=1k (K=100)

199.9 ms

40.08 s

200× (see note)

append C=10k (K=10)

156.4 ms

32.79 s

210× (see note)

Five things to take away from the per-call append numbers above (the appending() context wraps these, so it inherits the wins and erases the losses):

  1. Uncompressed write-once: rustfits is ~3× fitsio. The one-shot write_table call beats fitsio’s equivalent by 3.36× on the fixed-only variant and 2.83× on the VLA variant. (fitsio doesn’t write ZTABLE.)

  2. Uncompressed append is essentially free. At chunk=10 k the fixed-only rustfits append loop runs within 3 % of the one-shot rustfits write, and the VLA append loop is faster than the one-shot write (the write-once path plans the whole VLA heap layout in RAM up front; the append loop pays it incrementally).

  3. rustfits append crushes fitsio append on VLA — ~200×. On the fixed-only variant rustfits append is ~2.7–3.0× faster than fitsio append (in line with the write-once ratio). On VLA the gap blows out to 200× faster at chunk=1k and 210× faster at chunk=10k — fitsio takes ~33–40 s where rustfits takes ~160–200 ms. See the note below for the root cause.

  4. VLA append wins on peak RSS. The rustfits whole-table write holds ~216 MB resident; the append loop holds ~165 MB. That’s the bounded-memory story repeating from the image side — incremental write keeps live memory near the chunk size instead of the full output.

  5. ZTABLE ``append`` (without the context) has a chunk-size cliff at ZTILELEN. At chunk=10 k (≈ default ZTILELEN of ~17 k for this 588 B row width), append is only 1.8× the rustfits ZTABLE write-once for fixed-only and 1.4× for VLA. At chunk=1 k (well below ZTILELEN), the partial-last-tile re-encode tax kicks in: every append decompresses + merges into the trailing tile and re-encodes it, costing 10× for fixed-only and 5× for VLA. The fix is the appending() context (lead of this section); the alternative is to pick chunks that are an integer multiple of ZTILELEN.

    Note

    The 1.8× / 1.4× / 10× / 5× numbers above reflect a 2026-05-31 fix in create_table_hdu: prior to that commit, the nrows=0 streaming-create pattern (create_table_hdu(nrows=0, compress=True[, ztilelen=K]) + repeated append(chunk)) silently collapsed ZTILELEN to 1 (regardless of what the user passed), forcing every appended row into its own single-row-per-tile, independently-gzip-compressed tile. The pre-fix numbers were 14–15× for fixed-only and 6–7× for VLA at any chunk size, since the bug masked the true per-call cost. Resolved in CLAUDE.md TODO #10 (commit 957f94a).

Note

Known fitsio issue behind the 200× VLA-append gap. fitsio’s write_var_column Python wrapper calls fits_flush_file after every per-column write (fitsio_pywrap.c, line 2710), and cfitsio’s fits_flush_file (ffflus in buffers.c) is close-current-HDU + flush-buffers + re-open-current-HDU. Each reopen re-walks the header and re-parses column descriptors. With 3 VLA columns and 100 appends at chunk=1k that’s 300 close-and-reopen cycles — about 40 s of overhead unrelated to the actual data write. The fix is in fitsio (the underlying cfitsio fits_write_col doesn’t need the flush); the gap will close once that wrapper is patched.

2-D image extend — uncompressed mosaic build

The mosaic / strip-build pattern: append per-detector frames (or per-night strips) to a growing image via hdu.extend(strip) (rustfits) or hdu.write(strip, start=(row, 0)) (fitsio). Both grow the slowest-varying axis — same primitive, two APIs. Unlike ZTABLE/ZIMAGE, fitsio CAN extend uncompressed images, so this is a true cross-tool comparison. perf/perf-image-extend-2d.py measures wall + peak RSS for write-once vs extend at two chunk sizes (100 rows and 1000 rows) on a (20,000 × 4,000) f4 image (~320 MB).

Uncompressed 2-D image extend (20 k × 4 k f4) — N=320 MB

regime

build

peak RSS

vs rf write-once

fitsio write-once

174.7 ms

440 MB

1.48× time, 1.2× more RAM (fitsio)

rustfits write-once

117.9 ms

363 MB

(ref)

rustfits extend C=100 rows (K=200)

119.0 ms

70 MB

1.01× time, 5.2× less RAM

fitsio extend C=100 rows (K=200)

231.4 ms

70 MB

1.96× time (fitsio), 5.2× less RAM

rustfits extend C=1000 rows (K=20)

117.1 ms

74 MB

0.99× time, 4.9× less RAM

fitsio extend C=1000 rows (K=20)

239.7 ms

74 MB

2.03× time (fitsio), 4.9× less RAM

Three takeaways:

  1. Bounded memory works in both tools. Either extend path keeps peak RSS at ~70–75 MB regardless of the final image size (dominated by Python + numpy baseline; the actual chunk’s data is only 1.6–16 MB). write-once needs the whole image resident, plus per-tool overhead. Both tools deliver the same ~5× RAM win.

  2. rustfits extend ≈ rustfits write-once on time. Both chunk sizes match the write-once baseline within 2 %. No per-call overhead worth worrying about — for uncompressed incremental builds the only cost vs write-once is the bookkeeping for the NAXIS2 header update.

  3. fitsio extend is ~2× slower than rustfits extend. At both chunk sizes fitsio’s per-extend cost is roughly double rustfits’ — same ratio as the write-once gap (1.48×), so this is a general fitsio per-call overhead rather than something specific to start= writes. fitsio also holds ~80 MB more RAM during write-once (440 vs 363 MB) — suggesting it keeps an extra byteswapped copy that rustfits avoids.

2-D compressed image extend — mosaic build with GZIP_2

Same shape as the uncompressed bench (20,000 × 4,000 f4, ~320 MB raw) but tile-compressed with GZIP_2 and tile shape (100, cols) so the chunk-row axis maps directly to “tile-rows per append”. fitsio cannot extend a compressed image (cfitsio returns status = 107: tried to move past end of file on the second write), so the extend rows are rustfits-only; fitsio appears only as a write-once reference. perf/perf-compressed-image-extend-2d.py.

Compressed 2-D image extend (20 k × 4 k f4, GZIP_2 tile=(100, cols))

regime

build

peak RSS

vs rf write-once

fitsio write-once

6.44 s

440 MB

2.98× time (fitsio)

rustfits write-once

2.17 s

626 MB

(ref)

rustfits extend C=50 rows (K=400, sub-tile)

48.30 s

454 MB

22.3× time, 1.4× less RAM

rustfits extending() C=50 rows (K=400, sub-tile)

2.73 s

364 MB

1.26× time, 1.7× less RAM

rustfits extend C=100 rows (K=200, exact tile)

16.68 s

321 MB

7.7× time, 1.9× less RAM

rustfits extending() C=100 rows (K=200, exact tile)

2.74 s

365 MB

1.26× time, 1.7× less RAM

rustfits extend C=1000 rows (K=20, 10 tiles)

3.34 s

338 MB

1.54× time, 1.8× less RAM

rustfits extending() C=1000 rows (K=20, 10 tiles)

2.49 s

394 MB

1.15× time, 1.6× less RAM

Four takeaways:

  1. Multi-tile chunks are nearly free. At chunk=1000 rows (10 tiles per call) extend is 1.54× write-once — the per-extend overhead is one heap-relocate-forward and a PCOUNT bump, amortized across the 10 tiles’ encode work. For mosaic builds that can buffer multi-tile strips, this is the regime to aim for.

  2. Exact-tile chunks are moderate. At chunk=100 rows (1 tile per call) extend costs 7.7× write-once — the per-call overhead dominates because there’s only one tile’s worth of “real” encode work per call but the same bookkeeping.

  3. Sub-tile chunks pay heavily. At chunk=50 rows (½ tile) every append decompresses + merges into the partial last tile then re-encodes it: 22.3× write-once — the same mirror-pattern as the ZTABLE small-chunk re-encode finding in the table-append section above. For compressed 2-D mosaic builds, align chunks to a multiple of tile-rows (or buffer to that size in user code) to skip the re-encode tax — or use the extending() context (takeaway 4) to do the buffering automatically.

  4. The ``extending()`` context manager wins at every chunk size. Wrapping the extend loop in with hdu.extending(): buffers chunks in RAM and drains in tile-aligned bursts (triggered by a 32 MB soft cap), collapsing N partial-tile re-encodes into 1:

    with hdu.extending():
        for batch in batches:
            hdu.extend(batch)
    

    At sub-tile chunks the speedup is dramatic: 22.3× → 1.26× write-once (18× faster than the unbuffered extend) AND peak RSS drops to 364 MB — 1.7× less RAM than write-once because the cap bounds the buffer below what write-once’s whole-array path holds. Even at exact-tile chunks the context helps (7.7× → 1.26×). At multi-tile chunks it’s a modest win (1.54× → 1.15×).

    Memory cap. The mid-context drain is triggered at MAX_PENDING_BYTES = 32 MB of accumulated input, chosen as a soft cap that fits many tiles per drain while staying small enough to not balloon RSS during long streaming loops. Sized as a code constant; not currently user-configurable — if a workload needs a different cap please open an issue.

    Strict context semantics. Inside with hdu.extending(): only extend() is permitted; any read(), __getitem__, write(), __setitem__, repack(), add_checksum(), or verify_checksum() call raises ValueError. Exit the context first. The natural nested-with pattern with FITS composes correctly (Python guarantees the inner __exit__ runs before the outer close), so this restriction only surfaces in mixed-operation loops that should be restructured anyway.

Scattered reads on compressed images — tune the tile cache

The compressed-read benches above all walk the file in chunks of increasing size — sequential access, where the tile cache holds whatever was most recently decoded. Scattered / random-access workloads behave differently: every read may revisit a tile decoded much earlier, so cache hit rate dominates. perf/perf-compressed-image-read-1d-scattered.py measures 1000 random 1000-row windows against a 64-tile 1-D f8 GZIP_2 image (537 MB raw, 15 MB compressed, healsparse-like). With random sampling across 64 tiles, every tile is touched ~16× per iteration, so the cache size you pick matters a lot.

The two regimes:

Scattered compressed-1D read (1000 × 1k-row windows, 64-tile f8 GZIP_2)

cache

rustfits

fitsio

rf vs fi

note

default (rf=32 MB, fi=unbounded)

5.34 s

1.73 s

0.32×

rf holds 4/64 tiles → thrashes

large (rf=528 MB, fi=unbounded)

483 ms

1.74 s

3.60×

both backends fully cache

What’s going on:

  • rustfits’s default cache is 32 MB (4 tiles for an 8 MB tile). On scattered access across 64 unique tiles, every read evicts something we still need — the workload collapses to “decode every read” and the headline ratio looks bad.

  • fitsio’s cache is unbounded — it caches every decoded tile forever, so on this small file it implicitly covers the whole working set with no user tuning. The unbounded cache is what makes fitsio look ~3× faster than rustfits in the default-cache regime.

  • At equal coverage, rustfits is 3.6× faster because the per-read lookup-and-slice path is leaner (same shape as the small-chunk wins in the sequential-read benches above).

If your access pattern is scattered, the fix is one line — size the tile cache to the working set:

hdu = fits[1]
hdu.set_tile_cache_size(500 * 1024 * 1024)   # 500 MB
for row in random_rows:
    chunk = hdu[row:row + 1000]
    ...

Important

rustfits’s cache is bounded for a reason. fitsio’s “remember every tile forever” works great when the compressed image fits in RAM (the case in this bench), but on a multi-GB compressed image fitsio’s unbounded cache will OOM — there’s no knob to bound it. rustfits’s bounded LRU degrades gracefully: above the cap, the per- tile decode cost replaces the cache hit, but the process keeps running. The cache size is a memory-vs-speed knob you tune for your workload; there’s no universally-good value.

Practical tuning rules of thumb (until a real workload disagrees):

  • Sequential / chunked reads: default 32 MB is fine — LRU naturally holds the few most-recent tiles, and the chunked-read benches already show large wins.

  • Scattered reads, file fits in RAM: bump set_tile_cache_size to roughly the size of the working set’s unique tiles (or the whole compressed-image data section if you can spare the RSS).

  • Scattered reads, file doesn’t fit in RAM: pick a cache size you can afford; the bench above shows that even at 256 MB / 32 tiles, partial coverage halves the cost vs. default.

Compressed-image __setitem__ — per-tile re-encode tax

A hdu[selection] = value on a tile-compressed image decodes every tile the selection touches, modifies it in numpy, and re-encodes / appends to the heap. A “patch a few pixels” workflow (interactive masking, bad-pixel fix-up, hot-pixel flagging) therefore pays a full tile re-encode for every tile the selection covers — even a single-pixel write. perf/perf-compressed-image-setitem.py measures that per-call cost across all five algorithms + the unquantized / quantized float paths, with an uncompressed memcpy floor for reference.

Setup: (256 × 256) image with (32, 32) tiles (an 8×8 tile grid). Four selection shapes:

  • single pixel — one int per axis (touches 1 tile)

  • 1 tile aligned — a 32×32 slice on a tile boundary (1 tile)

  • 4 tiles spanning — an 8×8 slice straddling a tile corner (touches 4 tiles)

  • 16 tiles aligned — a 4×4 block of tiles (16 tiles)

Per-call costs span 3 µs to 200 µs, which at the low end is below scheduler / cache / IRQ jitter on a typical Linux box (early prototypes of this bench wobbled 55% run-to-run on the 3 µs uncompressed-single-pixel row). The bench mitigates by auto-calibrating the inner loop size per (algo, sel): a 20-call probe estimates per-call cost, then the timed loop is sized so each window takes at least 100 ms. Result: 30 000 inner calls for the cheap rows down to 200 for the expensive ones, every row stable to within 1% run-to-run. GC is disabled during each timed window; reported value is the median across 5 windows. PCOUNT grows monotonically across iters but per-call cost is constant (__setitem__ just appends new tile bytes to the heap; it doesn’t repack).

Compressed-image __setitem__ cost (256×256, tile 32×32)

algorithm

selection

tiles

per call

per tile

per pixel

uncompressed i4

single pixel

1

3.0 µs

3.0 µs

3.0 µs

uncompressed i4

1 tile aligned

1

17.5 µs

17.5 µs

0.0 µs

uncompressed i4

4 tiles spanning

4

6.7 µs

1.7 µs

0.1 µs

uncompressed i4

16 tiles aligned

16

64.8 µs

4.0 µs

0.0 µs

GZIP_1 i4

single pixel

1

99.5 µs

99.5 µs

99.5 µs

GZIP_1 i4

1 tile aligned

1

37.6 µs

37.6 µs

0.0 µs

GZIP_1 i4

4 tiles spanning

4

361.6 µs

90.4 µs

5.7 µs

GZIP_1 i4

16 tiles aligned

16

401.4 µs

25.1 µs

0.0 µs

GZIP_2 i4

single pixel

1

81.5 µs

81.5 µs

81.5 µs

GZIP_2 i4

1 tile aligned

1

40.8 µs

40.8 µs

0.0 µs

GZIP_2 i4

4 tiles spanning

4

263.0 µs

65.8 µs

4.1 µs

GZIP_2 i4

16 tiles aligned

16

452.4 µs

28.3 µs

0.0 µs

RICE_1 i4

single pixel

1

39.3 µs

39.3 µs

39.3 µs

RICE_1 i4

1 tile aligned

1

22.8 µs

22.8 µs

0.0 µs

RICE_1 i4

4 tiles spanning

4

103.2 µs

25.8 µs

1.6 µs

RICE_1 i4

16 tiles aligned

16

139.4 µs

8.7 µs

0.0 µs

HCOMPRESS_1 i4

single pixel

1

139.8 µs

139.8 µs

139.8 µs

HCOMPRESS_1 i4

1 tile aligned

1

31.0 µs

31.0 µs

0.0 µs

HCOMPRESS_1 i4

4 tiles spanning

4

461.4 µs

115.4 µs

7.2 µs

HCOMPRESS_1 i4

16 tiles aligned

16

275.3 µs

17.2 µs

0.0 µs

PLIO_1 i4

single pixel

1

18.8 µs

18.8 µs

18.8 µs

PLIO_1 i4

1 tile aligned

1

20.1 µs

20.1 µs

0.0 µs

PLIO_1 i4

4 tiles spanning

4

40.5 µs

10.1 µs

0.6 µs

PLIO_1 i4

16 tiles aligned

16

124.2 µs

7.8 µs

0.0 µs

GZIP_1 f4 unquantized

single pixel

1

93.2 µs

93.2 µs

93.2 µs

GZIP_1 f4 unquantized

1 tile aligned

1

37.4 µs

37.4 µs

0.0 µs

GZIP_1 f4 unquantized

4 tiles spanning

4

281.3 µs

70.4 µs

4.4 µs

GZIP_1 f4 unquantized

16 tiles aligned

16

402.6 µs

25.2 µs

0.0 µs

GZIP_1 f4 quantized

single pixel

1

195.4 µs

195.4 µs

195.4 µs

GZIP_1 f4 quantized

1 tile aligned

1

51.2 µs

51.2 µs

0.0 µs

GZIP_1 f4 quantized

4 tiles spanning

4

680.7 µs

170.2 µs

10.6 µs

GZIP_1 f4 quantized

16 tiles aligned

16

560.9 µs

35.1 µs

0.0 µs

Four takeaways:

  1. Per-call cost is dominated by per-tile re-encode. A single-pixel write costs 19–200 µs depending on algorithm — 30–60× the uncompressed memcpy (3 µs). The user’s budget for “patch one pixel” is the per-tile cost of their chosen algorithm; touching N tiles costs roughly N × the per-tile rate plus a small constant.

  2. PLIO_1 is fastest, quantized float is slowest. At one tile touched, PLIO costs ~20 µs / tile (it’s encoded as (value, run-length) pairs so decode + re-encode are trivial); quantized f4 costs ~200 µs / tile because every touched tile has to re-quantize against its existing per-tile bscale/bzero/seed. RICE_1 and HCOMPRESS_1 fall between (40–145 µs single-pixel).

  3. Full-tile-aligned writes are CHEAPER than single-pixel writes. Counterintuitive but real: a “1 tile aligned” selection that covers the whole tile (32×32) is 20–50 µs across all algorithms, where a single-pixel write on the same tile is 40–200 µs. The reason is that an aligned full-tile write doesn’t need to decode the existing tile first — every pixel is being replaced, so the read-modify-write collapses to just write. Single-pixel writes pay the full decode + modify + encode cycle.

  4. Per-pixel rate amortizes with selection size. For batched writes (the 16-tile column) the per-pixel rate drops to <0.1 µs / pixel — comparable to uncompressed. If you can buffer your patches into rectangular tile-aligned blocks, the per-tile re-encode tax amortizes away. For true scattered single-pixel workflows there’s no shortcut (use PLIO_1 if your data fits its non-negative integer constraint).

Practical guidance: for interactive masking on a compressed image, budget ~50–200 µs per single-pixel touch (algorithm- dependent), so up to a few thousand single-pixel patches per second. For bulk masking, group patches into rectangular selections that cover whole tiles where possible.

repack() — peak RSS scaling at large heap

Three repack flavors clean up heap orphans accumulated by __setitem__ / extend calls:

  • TableHDU.repack() for BINTABLE VLA columns,

  • CompressedImageHDU.repack() for ZIMAGE compressed images,

  • CompressedTableHDU.repack() for ZTABLE compressed tables.

They share a goal (rebuild the heap with only live bytes, shrink the file) but the implementations differ. This bench measures the peak RSS at 10 / 100 / 1000 MB heap sizes for each. Each fixture is built in the parent process; the repack runs in a fresh subprocess so the reported peak RSS is the repack alone, read via /proc/self/status:VmHWM (the kernel’s per-task high-water). Using resource.getrusage(RUSAGE_SELF).ru_maxrss here would silently report the parent’s peak, because on Linux ru_maxrss is accumulated in signal_struct and inherited across fork+exec; the shared h.vm_hwm_kb helper is the immune path.

repack() peak RSS vs PCOUNT

flavor

PCOUNT

repack t

peak RSS

notes

BINTABLE VLA (uncompressed)

10 MB

8.2 ms

50 MB

streaming + staging impl

BINTABLE VLA

100 MB

16.4 ms

50 MB

flat — no PCOUNT scaling

BINTABLE VLA

1,000 MB

94.2 ms

50 MB

flat at 1 GB heap

ZIMAGE compressed image

11 MB

2.0 ms

49 MB

streaming + staging impl

ZIMAGE compressed image

119 MB

19.6 ms

49 MB

flat — no PCOUNT scaling

ZIMAGE compressed image

1,215 MB

189.6 ms

50 MB

flat at 1 GB heap

ZTABLE compressed table

7 MB

1.8 ms

50 MB

streaming + staging impl

ZTABLE compressed table

102 MB

18.3 ms

50 MB

flat — no PCOUNT scaling

ZTABLE compressed table

1,049 MB

160.4 ms

50 MB

flat at 1 GB heap

Two takeaways:

  1. All three repack flavors are bounded-memory. RSS stays constant at ~50 MB (Python + rustfits baseline) regardless of heap size, from 10 MB to 1 GB, for all three: BINTABLE VLA, ZIMAGE, and ZTABLE. All three share the same streaming + staging implementation (a ~1 MiB chunk plus the descriptor table + a small move-plan vector; documented in CLAUDE.md under “Heap repack”). The shared primitives stream_copy_in_file and grow_file_to_at_least live in src/common.rs.

  2. The rewrite cycle (2026-05-31) was driven by a real in-place-modify-large-compressed-image workload needing repack() on a memory-constrained worker. Before the rewrite, BINTABLE VLA repack scaled at ~1.05× PCOUNT and ZIMAGE at ~1.5× PCOUNT. The numbers:

    Repack rewrite — 1 GB heap, before vs after

    flavor

    before

    after

    improvement

    BINTABLE VLA

    1.05 GB peak, 451 ms

    50 MB peak, 94 ms

    21× less RAM, 4.8× faster

    ZIMAGE

    1.79 GB peak, 819 ms

    50 MB peak, 190 ms

    36× less RAM, 4.3× faster

    ZTABLE

    (already streaming)

    50 MB peak, ~160 ms

    (unchanged)

    The time win on the two new streaming paths comes from eliminating the entire Vec-allocate + memcpy + drop cycle on the heap bytes; the streaming approach reads + writes 1 MiB chunks straight through the file handle.

Other self-comparisons + RSS benches

ZTABLE read/write (rustfits compressed vs rustfits uncompressed) and the image-extend RSS benches.

Last generated 2026-05-30 16:28 UTC. Environment: Python 3.14.4, numpy 2.4.3, rustfits ?, fitsio 1.3.0, astropy 7.2.0, x86_64 / Linux.

These are point-in-time measurements. The benchmarks under perf/ are the source of truth — re-run python perf/perf-all.py for current numbers on your hardware.

Compressed BINTABLE read (ZTABLE; rustfits self)

Same catalog as the uncompressed BINTABLE bench. fitsio’s Python API does not decompress ZTABLE, so this is a rustfits self-comparison (compressed vs uncompressed timing).

Compressed BINTABLE read (ZTABLE; rustfits self)

operation

primary

primary t

secondary

secondary t

ratio

column subset (3 cols)

ZTABLE

23.60 ms

uncompressed

29.81 ms

0.79×

row slice [125000:375000]

ZTABLE

1.183 s

uncompressed

609.34 ms

1.94×

scattered rows (x2000)

ZTABLE

547.71 ms

uncompressed

8.19 ms

66.84×

whole table

ZTABLE

2.321 s

uncompressed

1.214 s

1.91×

Compressed BINTABLE write (ZTABLE; rustfits self)

Same catalog. Self-comparison (no other Python library writes ZTABLE).

Compressed BINTABLE write (ZTABLE; rustfits self)

operation

primary

primary t

secondary

secondary t

ratio

write catalog

ZTABLE

28.236 s

uncompressed

907.24 ms

31.12×

Uncompressed 1-D image build — bounded-memory extend vs write-once

1 GB f8 map; ImageHDU.extend appends chunks instead of holding the whole array in RAM. fitsio cannot append; RSS comparison is rustfits-self plus fitsio’s write-once as a reference.

Uncompressed 1-D image build — bounded-memory extend vs write-once — build wall + peak RSS

regime

build

peak RSS

fitsio write-once

487.40 ms

1,205 MB

rustfits extend C=1,048,576 (K=128)

347.50 ms

66 MB

rustfits extend C=8,388,608 (K=16)

366.98 ms

118 MB

rustfits write-once

353.91 ms

1,078 MB

Compressed 1-D image build — bounded-memory extend vs write-once

Same shape as the compressed healsparse benches; CompressedImageHDU.extend appends chunks. fitsio cannot append compressed.

Compressed 1-D image build — bounded-memory extend vs write-once — build wall + peak RSS

regime

build

peak RSS

fitsio write-once

2.999 s

1,225 MB

rustfits extend C=1,048,576 (K=128)

3.025 s

129 MB

rustfits extend C=8,388,608 (K=16)

1.460 s

186 MB

rustfits write-once

1.974 s

1,145 MB

EXTNAME lookup (name in fits / fits[name]) — linear in HDU count

name in fits and fits["NAME"] walk the HDU list until a match is found (or fall through and return False / raise on miss), reading and parsing each HDU’s EXTNAME card along the way. Cost scales linearly at roughly 0.14 µs per HDU on the reference machine — about 1.4 ms for a 10,000-HDU file when the name lives at the end (or doesn’t exist). fitsio caches a name→index dict and returns in ~10 µs regardless of HDU count.

For typical files (tens to a few hundred HDUs) the gap is invisible. Workloads that do many name lookups on a multi- thousand-HDU file will notice; integer indexing is O(1) in both libraries, so the workaround until this is fixed is to build the name→index dict yourself once:

name_to_idx = {h.extname: i for i, h in enumerate(fits) if h.extname}

This could be re-engineered to match fitsio (file-level name→index cache with version-stamped invalidation) if a real workload asks for it.