Performance =========== .. role:: perf-fast .. role:: perf-par .. role:: perf-slow rustfits aims to be **as fast or faster than fitsio** on every benchmark it shares with that library, plus offer capabilities (bounded-memory extend builds, transparent ZTABLE read/write) that fitsio doesn't have. This page is a snapshot of where things stand. Ratios in the tables below are ``fitsio_time / rustfits_time`` — greater than 1.0 means rustfits is faster. Cells are colorized so the overall pattern is visible at a glance: :perf-fast:`green` for ≥ 1.10× faster, :perf-par:`yellow` for ~par (0.95–1.10×), :perf-slow:`red` for slower (< 0.95×). Numbers below are point-in-time measurements from one Linux x86_64 machine. They illustrate the rough shape of the comparison (rustfits typically 1.0×–2.5× cfitsio on common workloads, with a few cases at 30×+ where structural choices give a large win). Your mileage will vary with CPU, disk, file size, and data content — re-run the benchmarks yourself for numbers that match your hardware: .. code-block:: shell # release build required for representative timings maturin develop --release # full sweep (~5 minutes; --skip extend to skip the RSS benches) python perf/perf-all.py --skip extend The benchmark scripts live under ``perf/`` in the source tree; each is a standalone script with a docstring explaining its methodology. The runner ``perf-all.py`` collects every script's results into the two summary tables below. Refresh both tables with:: python perf/perf-all.py \ --rst-out-xtool docs/tutorial/_perf_tables_xtool.rst \ --rst-out-self docs/tutorial/_perf_tables_self.rst How comparisons are timed ------------------------- The conventions every ``perf-*.py`` script follows so the numbers mean what they claim: * **Release build.** Debug builds read ~7× slower than release and produce misleading "rustfits is slower" results. ``maturin develop --release`` is required. * **Fresh open per timed iteration.** fitsio caches decoded compressed tiles forever; timing repeated reads of one open handle would measure fitsio's cache hits against rustfits's bounded LRU re-decoding (backward from the real workload, which reads each tile once). Both tools get a fresh open per iter so caches start empty. * **Warmup primes the FS cache.** The first read off disk is I/O-bound and washes out the comparison; a warmup pass loads the compressed bytes into the OS page cache so timed passes measure decoder + Python boundary speed, not disk. * **Fresh FILE per timed iter for write benches.** Overwriting a large file in a tight loop triggers a kernel page-cache penalty that masks the realistic single-file-per-program-run pattern. Write benches generate a unique filename per iter via ``h.fresh_path``. * **Median of 5.** GC is paused around each timed call; samples are sorted and the median is reported. What's measured --------------- * **Cross-tool comparisons** — rustfits vs fitsio for every benchmark fitsio's Python API supports. Each row is one (script, operation) pair; the ``vs fitsio`` column is ``fitsio_time / rustfits_time`` (so > 1.0 means rustfits is faster). * **Self-comparisons** — rustfits compressed vs rustfits uncompressed for ZTABLE read/write (fitsio's Python API does not decompress ZTABLE, so cross-tool isn't possible). The ratio is the compression cost in time, *not* a tool comparison. * **Build wall + peak RSS** — extend benchmarks measure bounded-memory incremental builds vs whole-array write-once. Each build runs in its own subprocess so ``ru_maxrss`` is a clean per-build high-water mark. Cross-tool comparisons (rustfits vs fitsio) ------------------------------------------- Every benchmark whose operation fitsio also implements. Each row's ``vs fitsio`` cell is ``fitsio_time / rustfits_time``, colorized green / yellow / red as described above. Tables are grouped by script; the per-script subtitle supplies the data shape so the operation labels make sense in isolation. .. include:: _perf_tables_xtool.rst Self-comparisons (rustfits self) -------------------------------- Benchmarks that have no cross-tool equivalent — either because fitsio's Python API can't do the operation (ZTABLE read/write) or because the comparison is structural (incremental ``append`` / ``extend`` vs whole-array ``write``). Numbers here are rustfits vs rustfits, so the headline is the trade-off (compression cost, append overhead) rather than a tool ranking. Incremental table builds (``append`` and the ``appending()`` context) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Building a catalog by ``N/K`` calls to ``hdu.append(K rows)`` is the natural pattern for streaming pipelines (per-frame source extraction, per-file harvest). **TL;DR for ZTABLE streaming pipelines: use the :meth:`CompressedTableHDU.appending` context manager.** It makes the append-chunk size irrelevant — every chunk size lands within ~1-7% of the one-shot ``write_table`` cost:: with hdu.appending(): for batch in batches: hdu.append(batch) ``perf/perf-table-compressed-append-chunks.py`` (4×f4 schema, N=100,000 rows, ZTILELEN=10,000) shows the win directly: ==================== ======== ========================= ================================= regime chunk unbuffered ``append`` ``appending()`` context ==================== ======== ========================= ================================= 1% of tile 100 :perf-slow:`48.91× time` :perf-fast:`1.07× time` (46× win) 10% of tile 1,000 :perf-slow:`5.40× time` :perf-fast:`1.00× time` (5.4× win) 50% of tile 5,000 :perf-slow:`1.53× time` :perf-fast:`0.99× time` exact tile 10,000 1.02× time :perf-fast:`1.00× time` 2 tiles 20,000 1.01× time :perf-fast:`0.99× time` ==================== ======== ========================= ================================= Without the context, ZTABLE ``append`` pays a partial-trailing- tile decode + merge + re-encode on every call — fine when the chunk is at least one ZTILELEN, expensive when it's smaller. The context buffers row batches in RAM and drains them in ZTILELEN-aligned bursts (cap at 32 MB, ``MAX_PENDING_BYTES`` in ``src/hdu_table_compressed/extending.rs``), collapsing N partial-tile re-encodes into 1. For a 16-byte row schema the cap allows 2 M rows per drain — a typical streaming pipeline does 0-1 mid-context drains plus one residual drain at ``__exit__``. ``extending()`` is a symmetric alias for ``appending()``; ``extend()`` is the same alias on the append call itself. Strict context semantics (the same as :meth:`CompressedImageHDU.extending`): only :meth:`append` / :meth:`extend` is permitted inside the ``with`` block; any :meth:`read` / ``__getitem__`` / :meth:`write` / ``__setitem__`` / :meth:`repack` / :meth:`add_checksum` / :meth:`verify_checksum` raises ``ValueError`` (exit the context first). Uncompressed ``TableHDU.append`` doesn't need the context — no merge tax — but a no-op ``appending()`` / ``extending()`` on the uncompressed types is on the roadmap so generic code that doesn't know the HDU type works uniformly. The rest of this section is the underlying behavior: how ``append`` itself performs across the four variants (the context wraps it, so understanding the per-call cost is still useful when ZTILELEN is small or the streaming-loop budget is tight). ``perf/perf-table-append.py`` measures wall time + peak RSS of the per-call ``append`` loop against the equivalent one-shot ``write_table(N rows)`` across four variants: ``{uncompressed, ZTABLE} × {fixed-only, with VLA}``. Each build runs in its own subprocess for a clean per-build ``ru_maxrss``. Sample numbers from the reference machine (N=100,000 rows, 34-column type-exhaustive catalog at ~600 B/row, ZTILELEN ≈ 16 k rows). The first table is the rustfits self-comparison (append vs write-once across all four variants); the second is the rustfits-vs-fitsio cross- tool comparison on the uncompressed variants, where pairing is possible (fitsio's Python API cannot write ZTABLE). In the self-comparison table, ``vs rf write-once`` reports each row's wall time / RSS divided by the rustfits write-once measurement for that variant; values < 1 mean the row was faster / lighter than the one-shot rustfits write. .. list-table:: Table append (N=100,000) — rustfits self :widths: 24 26 14 12 22 :header-rows: 1 * - variant - regime - build - peak RSS - vs rf write-once * - uncompressed, fixed-only - rustfits write-once - 49.6 ms - 163 MB - (ref) * - - rustfits append C=1k (K=100) - 63.1 ms - 163 MB - 1.27× time, ≈ RAM * - - rustfits append C=10k (K=10) - 50.5 ms - 163 MB - 1.02× time, ≈ RAM * - uncompressed, with VLA - rustfits write-once - 174.8 ms - 216 MB - (ref) * - - rustfits append C=1k (K=100) - 199.9 ms - 165 MB - 1.14× time, 1.3× less RAM * - - rustfits append C=10k (K=10) - 156.4 ms - 165 MB - 0.89× time, 1.3× less RAM * - ZTABLE, fixed-only - rustfits write-once - 2.03 s - 174 MB - (ref) * - - rustfits append C=1k (K=100) - 20.45 s - 163 MB - 10.1× time, ≈ RAM * - - rustfits append C=10k (K=10) - 3.73 s - 163 MB - 1.8× time, ≈ RAM * - ZTABLE, with VLA - rustfits write-once - 5.63 s - 213 MB - (ref) * - - rustfits append C=1k (K=100) - 29.70 s - 170 MB - 5.3× time, 1.3× less RAM * - - rustfits append C=10k (K=10) - 7.94 s - 173 MB - 1.4× time, 1.2× less RAM Cross-tool comparison on the uncompressed variants, paired row-by-row (``vs fitsio`` = ``fitsio_time / rustfits_time``; > 1.0 means rustfits is faster): .. list-table:: Table append (N=100,000) — rustfits vs fitsio (uncompressed) :widths: 24 26 14 14 22 :header-rows: 1 * - variant - operation - rustfits - fitsio - vs fitsio * - fixed-only - write-once - 49.6 ms - 166.7 ms - :perf-fast:`3.36×` * - - append C=1k (K=100) - 63.1 ms - 171.0 ms - :perf-fast:`2.71×` * - - append C=10k (K=10) - 50.5 ms - 153.9 ms - :perf-fast:`3.05×` * - with VLA - write-once - 174.8 ms - 495.3 ms - :perf-fast:`2.83×` * - - append C=1k (K=100) - 199.9 ms - **40.08 s** - :perf-fast:`200×` (see note) * - - append C=10k (K=10) - 156.4 ms - **32.79 s** - :perf-fast:`210×` (see note) Five things to take away from the per-call ``append`` numbers above (the ``appending()`` context wraps these, so it inherits the wins and erases the losses): 1. **Uncompressed write-once: rustfits is ~3× fitsio.** The one-shot ``write_table`` call beats fitsio's equivalent by 3.36× on the fixed-only variant and 2.83× on the VLA variant. (fitsio doesn't write ZTABLE.) 2. **Uncompressed append is essentially free.** At chunk=10 k the fixed-only rustfits append loop runs within 3 % of the one-shot rustfits write, and the VLA append loop is *faster* than the one-shot write (the write-once path plans the whole VLA heap layout in RAM up front; the append loop pays it incrementally). 3. **rustfits append crushes fitsio append on VLA — ~200×.** On the fixed-only variant rustfits append is ~2.7–3.0× faster than fitsio append (in line with the write-once ratio). On VLA the gap blows out to **200× faster** at chunk=1k and **210× faster** at chunk=10k — fitsio takes ~33–40 s where rustfits takes ~160–200 ms. See the note below for the root cause. 4. **VLA append wins on peak RSS.** The rustfits whole-table write holds ~216 MB resident; the append loop holds ~165 MB. That's the bounded-memory story repeating from the image side — incremental write keeps live memory near the chunk size instead of the full output. 5. **ZTABLE ``append`` (without the context) has a chunk-size cliff at ZTILELEN.** At chunk=10 k (≈ default ZTILELEN of ~17 k for this 588 B row width), append is only 1.8× the rustfits ZTABLE write-once for fixed-only and 1.4× for VLA. At chunk=1 k (well below ZTILELEN), the partial-last-tile re-encode tax kicks in: every append decompresses + merges into the trailing tile and re-encodes it, costing 10× for fixed-only and 5× for VLA. The fix is the ``appending()`` context (lead of this section); the alternative is to pick chunks that are an integer multiple of ZTILELEN. .. note:: The 1.8× / 1.4× / 10× / 5× numbers above reflect a 2026-05-31 fix in ``create_table_hdu``: prior to that commit, the ``nrows=0`` streaming-create pattern (``create_table_hdu(nrows=0, compress=True[, ztilelen=K])`` + repeated ``append(chunk)``) silently collapsed ZTILELEN to 1 (regardless of what the user passed), forcing every appended row into its own single-row-per-tile, independently-gzip-compressed tile. The pre-fix numbers were 14–15× for fixed-only and 6–7× for VLA at *any* chunk size, since the bug masked the true per-call cost. Resolved in CLAUDE.md TODO #10 (commit 957f94a). .. note:: **Known fitsio issue behind the 200× VLA-append gap.** fitsio's ``write_var_column`` Python wrapper calls ``fits_flush_file`` after every per-column write (`fitsio_pywrap.c `_, line 2710), and cfitsio's ``fits_flush_file`` (``ffflus`` in ``buffers.c``) is *close-current-HDU + flush-buffers + re-open-current-HDU*. Each reopen re-walks the header and re-parses column descriptors. With 3 VLA columns and 100 appends at chunk=1k that's 300 close-and-reopen cycles — about 40 s of overhead unrelated to the actual data write. The fix is in fitsio (the underlying cfitsio ``fits_write_col`` doesn't need the flush); the gap will close once that wrapper is patched. 2-D image extend — uncompressed mosaic build ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The mosaic / strip-build pattern: append per-detector frames (or per-night strips) to a growing image via ``hdu.extend(strip)`` (rustfits) or ``hdu.write(strip, start=(row, 0))`` (fitsio). Both grow the slowest-varying axis — same primitive, two APIs. Unlike ZTABLE/ZIMAGE, fitsio CAN extend uncompressed images, so this is a true cross-tool comparison. ``perf/perf-image-extend-2d.py`` measures wall + peak RSS for write-once vs extend at two chunk sizes (100 rows and 1000 rows) on a (20,000 × 4,000) ``f4`` image (~320 MB). .. list-table:: Uncompressed 2-D image extend (20 k × 4 k f4) — N=320 MB :widths: 38 12 12 22 :header-rows: 1 * - regime - build - peak RSS - vs rf write-once * - fitsio write-once - 174.7 ms - 440 MB - :perf-fast:`1.48× time, 1.2× more RAM` (fitsio) * - rustfits write-once - 117.9 ms - 363 MB - (ref) * - rustfits extend C=100 rows (K=200) - 119.0 ms - 70 MB - 1.01× time, 5.2× less RAM * - fitsio extend C=100 rows (K=200) - 231.4 ms - 70 MB - :perf-fast:`1.96× time` (fitsio), 5.2× less RAM * - rustfits extend C=1000 rows (K=20) - 117.1 ms - 74 MB - 0.99× time, 4.9× less RAM * - fitsio extend C=1000 rows (K=20) - 239.7 ms - 74 MB - :perf-fast:`2.03× time` (fitsio), 4.9× less RAM Three takeaways: 1. **Bounded memory works in both tools.** Either ``extend`` path keeps peak RSS at ~70–75 MB regardless of the final image size (dominated by Python + numpy baseline; the actual chunk's data is only 1.6–16 MB). write-once needs the whole image resident, plus per-tool overhead. Both tools deliver the same ~5× RAM win. 2. **rustfits extend ≈ rustfits write-once on time.** Both chunk sizes match the write-once baseline within 2 %. No per-call overhead worth worrying about — for uncompressed incremental builds the only cost vs write-once is the bookkeeping for the NAXIS2 header update. 3. **fitsio extend is ~2× slower than rustfits extend.** At both chunk sizes fitsio's per-extend cost is roughly double rustfits' — same ratio as the write-once gap (1.48×), so this is a general fitsio per-call overhead rather than something specific to ``start=`` writes. fitsio also holds ~80 MB more RAM during write-once (440 vs 363 MB) — suggesting it keeps an extra byteswapped copy that rustfits avoids. 2-D compressed image extend — mosaic build with GZIP_2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Same shape as the uncompressed bench (20,000 × 4,000 f4, ~320 MB raw) but tile-compressed with GZIP_2 and tile shape ``(100, cols)`` so the chunk-row axis maps directly to "tile-rows per append". fitsio cannot extend a compressed image (cfitsio returns ``status = 107: tried to move past end of file`` on the second write), so the extend rows are rustfits-only; fitsio appears only as a write-once reference. ``perf/perf-compressed-image-extend-2d.py``. .. list-table:: Compressed 2-D image extend (20 k × 4 k f4, GZIP_2 tile=(100, cols)) :widths: 52 12 12 22 :header-rows: 1 * - regime - build - peak RSS - vs rf write-once * - fitsio write-once - 6.44 s - 440 MB - :perf-fast:`2.98× time` (fitsio) * - rustfits write-once - 2.17 s - 626 MB - (ref) * - rustfits extend C=50 rows (K=400, sub-tile) - 48.30 s - 454 MB - :perf-slow:`22.3× time`, 1.4× less RAM * - rustfits extending() C=50 rows (K=400, sub-tile) - 2.73 s - 364 MB - :perf-fast:`1.26× time, 1.7× less RAM` * - rustfits extend C=100 rows (K=200, exact tile) - 16.68 s - 321 MB - :perf-slow:`7.7× time`, 1.9× less RAM * - rustfits extending() C=100 rows (K=200, exact tile) - 2.74 s - 365 MB - :perf-fast:`1.26× time, 1.7× less RAM` * - rustfits extend C=1000 rows (K=20, 10 tiles) - 3.34 s - 338 MB - :perf-slow:`1.54× time`, 1.8× less RAM * - rustfits extending() C=1000 rows (K=20, 10 tiles) - 2.49 s - 394 MB - :perf-fast:`1.15× time, 1.6× less RAM` Four takeaways: 1. **Multi-tile chunks are nearly free.** At chunk=1000 rows (10 tiles per call) extend is 1.54× write-once — the per-extend overhead is one heap-relocate-forward and a PCOUNT bump, amortized across the 10 tiles' encode work. For mosaic builds that can buffer multi-tile strips, this is the regime to aim for. 2. **Exact-tile chunks are moderate.** At chunk=100 rows (1 tile per call) extend costs 7.7× write-once — the per-call overhead dominates because there's only one tile's worth of "real" encode work per call but the same bookkeeping. 3. **Sub-tile chunks pay heavily.** At chunk=50 rows (½ tile) every append decompresses + merges into the partial last tile then re-encodes it: 22.3× write-once — the same mirror-pattern as the ZTABLE small-chunk re-encode finding in the table-append section above. For compressed 2-D mosaic builds, **align chunks to a multiple of tile-rows** (or buffer to that size in user code) to skip the re-encode tax — or use the ``extending()`` context (takeaway 4) to do the buffering automatically. 4. **The ``extending()`` context manager wins at every chunk size.** Wrapping the extend loop in ``with hdu.extending():`` buffers chunks in RAM and drains in tile-aligned bursts (triggered by a 32 MB soft cap), collapsing N partial-tile re-encodes into 1:: with hdu.extending(): for batch in batches: hdu.extend(batch) At sub-tile chunks the speedup is dramatic: 22.3× → 1.26× write-once (18× faster than the unbuffered extend) AND peak RSS drops to 364 MB — 1.7× *less* RAM than write-once because the cap bounds the buffer below what write-once's whole-array path holds. Even at exact-tile chunks the context helps (7.7× → 1.26×). At multi-tile chunks it's a modest win (1.54× → 1.15×). **Memory cap.** The mid-context drain is triggered at ``MAX_PENDING_BYTES = 32 MB`` of accumulated input, chosen as a soft cap that fits many tiles per drain while staying small enough to not balloon RSS during long streaming loops. Sized as a code constant; not currently user-configurable — if a workload needs a different cap please open an issue. **Strict context semantics.** Inside ``with hdu.extending():`` only :meth:`extend` is permitted; any :meth:`read`, ``__getitem__``, :meth:`write`, ``__setitem__``, :meth:`repack`, :meth:`add_checksum`, or :meth:`verify_checksum` call raises ``ValueError``. Exit the context first. The natural nested-``with`` pattern with ``FITS`` composes correctly (Python guarantees the inner ``__exit__`` runs before the outer ``close``), so this restriction only surfaces in mixed-operation loops that should be restructured anyway. Scattered reads on compressed images — tune the tile cache ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The compressed-read benches above all walk the file in chunks of increasing size — sequential access, where the tile cache holds whatever was most recently decoded. Scattered / random-access workloads behave differently: every read may revisit a tile decoded much earlier, so cache hit rate dominates. ``perf/perf-compressed-image-read-1d-scattered.py`` measures 1000 random 1000-row windows against a 64-tile 1-D ``f8`` GZIP_2 image (537 MB raw, 15 MB compressed, healsparse-like). With random sampling across 64 tiles, every tile is touched ~16× per iteration, so the cache size you pick matters a lot. The two regimes: .. list-table:: Scattered compressed-1D read (1000 × 1k-row windows, 64-tile f8 GZIP_2) :widths: 40 14 14 14 18 :header-rows: 1 * - cache - rustfits - fitsio - rf vs fi - note * - default (rf=32 MB, fi=unbounded) - 5.34 s - 1.73 s - :perf-slow:`0.32×` - rf holds 4/64 tiles → thrashes * - large (rf=528 MB, fi=unbounded) - 483 ms - 1.74 s - :perf-fast:`3.60×` - both backends fully cache What's going on: * **rustfits's default cache is 32 MB (4 tiles for an 8 MB tile).** On scattered access across 64 unique tiles, every read evicts something we still need — the workload collapses to "decode every read" and the headline ratio looks bad. * **fitsio's cache is unbounded** — it caches every decoded tile forever, so on this small file it implicitly covers the whole working set with no user tuning. The unbounded cache is what makes fitsio look ~3× faster than rustfits in the default-cache regime. * **At equal coverage, rustfits is 3.6× faster** because the per-read lookup-and-slice path is leaner (same shape as the small-chunk wins in the sequential-read benches above). If your access pattern is scattered, the fix is one line — size the tile cache to the working set:: hdu = fits[1] hdu.set_tile_cache_size(500 * 1024 * 1024) # 500 MB for row in random_rows: chunk = hdu[row:row + 1000] ... .. important:: **rustfits's cache is bounded for a reason.** fitsio's "remember every tile forever" works great when the compressed image fits in RAM (the case in this bench), but **on a multi-GB compressed image fitsio's unbounded cache will OOM** — there's no knob to bound it. rustfits's bounded LRU degrades gracefully: above the cap, the per- tile decode cost replaces the cache hit, but the process keeps running. The cache size is a memory-vs-speed knob you tune for your workload; there's no universally-good value. Practical tuning rules of thumb (until a real workload disagrees): * **Sequential / chunked reads**: default 32 MB is fine — LRU naturally holds the few most-recent tiles, and the chunked-read benches already show large wins. * **Scattered reads, file fits in RAM**: bump ``set_tile_cache_size`` to roughly the size of the working set's unique tiles (or the whole compressed-image data section if you can spare the RSS). * **Scattered reads, file doesn't fit in RAM**: pick a cache size you can afford; the bench above shows that even at 256 MB / 32 tiles, partial coverage halves the cost vs. default. Compressed-image ``__setitem__`` — per-tile re-encode tax ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A ``hdu[selection] = value`` on a tile-compressed image decodes every tile the selection touches, modifies it in numpy, and re-encodes / appends to the heap. A "patch a few pixels" workflow (interactive masking, bad-pixel fix-up, hot-pixel flagging) therefore pays a full tile re-encode for every tile the selection covers — even a single-pixel write. ``perf/perf-compressed-image-setitem.py`` measures that per-call cost across all five algorithms + the unquantized / quantized float paths, with an uncompressed memcpy floor for reference. Setup: (256 × 256) image with (32, 32) tiles (an 8×8 tile grid). Four selection shapes: * **single pixel** — one ``int`` per axis (touches 1 tile) * **1 tile aligned** — a 32×32 slice on a tile boundary (1 tile) * **4 tiles spanning** — an 8×8 slice straddling a tile corner (touches 4 tiles) * **16 tiles aligned** — a 4×4 block of tiles (16 tiles) Per-call costs span 3 µs to 200 µs, which at the low end is below scheduler / cache / IRQ jitter on a typical Linux box (early prototypes of this bench wobbled 55% run-to-run on the 3 µs uncompressed-single-pixel row). The bench mitigates by auto-calibrating the inner loop size per (algo, sel): a 20-call probe estimates per-call cost, then the timed loop is sized so each window takes at least 100 ms. Result: 30 000 inner calls for the cheap rows down to 200 for the expensive ones, every row stable to within 1% run-to-run. GC is disabled during each timed window; reported value is the median across 5 windows. PCOUNT grows monotonically across iters but per-call cost is constant (``__setitem__`` just appends new tile bytes to the heap; it doesn't repack). .. list-table:: Compressed-image ``__setitem__`` cost (256×256, tile 32×32) :widths: 26 18 7 12 12 13 :header-rows: 1 * - algorithm - selection - tiles - per call - per tile - per pixel * - uncompressed i4 - single pixel - 1 - 3.0 µs - 3.0 µs - 3.0 µs * - uncompressed i4 - 1 tile aligned - 1 - 17.5 µs - 17.5 µs - 0.0 µs * - uncompressed i4 - 4 tiles spanning - 4 - 6.7 µs - 1.7 µs - 0.1 µs * - uncompressed i4 - 16 tiles aligned - 16 - 64.8 µs - 4.0 µs - 0.0 µs * - GZIP_1 i4 - single pixel - 1 - 99.5 µs - 99.5 µs - 99.5 µs * - GZIP_1 i4 - 1 tile aligned - 1 - 37.6 µs - 37.6 µs - 0.0 µs * - GZIP_1 i4 - 4 tiles spanning - 4 - 361.6 µs - 90.4 µs - 5.7 µs * - GZIP_1 i4 - 16 tiles aligned - 16 - 401.4 µs - 25.1 µs - 0.0 µs * - GZIP_2 i4 - single pixel - 1 - 81.5 µs - 81.5 µs - 81.5 µs * - GZIP_2 i4 - 1 tile aligned - 1 - 40.8 µs - 40.8 µs - 0.0 µs * - GZIP_2 i4 - 4 tiles spanning - 4 - 263.0 µs - 65.8 µs - 4.1 µs * - GZIP_2 i4 - 16 tiles aligned - 16 - 452.4 µs - 28.3 µs - 0.0 µs * - RICE_1 i4 - single pixel - 1 - 39.3 µs - 39.3 µs - 39.3 µs * - RICE_1 i4 - 1 tile aligned - 1 - 22.8 µs - 22.8 µs - 0.0 µs * - RICE_1 i4 - 4 tiles spanning - 4 - 103.2 µs - 25.8 µs - 1.6 µs * - RICE_1 i4 - 16 tiles aligned - 16 - 139.4 µs - 8.7 µs - 0.0 µs * - HCOMPRESS_1 i4 - single pixel - 1 - 139.8 µs - 139.8 µs - 139.8 µs * - HCOMPRESS_1 i4 - 1 tile aligned - 1 - 31.0 µs - 31.0 µs - 0.0 µs * - HCOMPRESS_1 i4 - 4 tiles spanning - 4 - 461.4 µs - 115.4 µs - 7.2 µs * - HCOMPRESS_1 i4 - 16 tiles aligned - 16 - 275.3 µs - 17.2 µs - 0.0 µs * - PLIO_1 i4 - single pixel - 1 - 18.8 µs - 18.8 µs - 18.8 µs * - PLIO_1 i4 - 1 tile aligned - 1 - 20.1 µs - 20.1 µs - 0.0 µs * - PLIO_1 i4 - 4 tiles spanning - 4 - 40.5 µs - 10.1 µs - 0.6 µs * - PLIO_1 i4 - 16 tiles aligned - 16 - 124.2 µs - 7.8 µs - 0.0 µs * - GZIP_1 f4 unquantized - single pixel - 1 - 93.2 µs - 93.2 µs - 93.2 µs * - GZIP_1 f4 unquantized - 1 tile aligned - 1 - 37.4 µs - 37.4 µs - 0.0 µs * - GZIP_1 f4 unquantized - 4 tiles spanning - 4 - 281.3 µs - 70.4 µs - 4.4 µs * - GZIP_1 f4 unquantized - 16 tiles aligned - 16 - 402.6 µs - 25.2 µs - 0.0 µs * - GZIP_1 f4 quantized - single pixel - 1 - 195.4 µs - 195.4 µs - 195.4 µs * - GZIP_1 f4 quantized - 1 tile aligned - 1 - 51.2 µs - 51.2 µs - 0.0 µs * - GZIP_1 f4 quantized - 4 tiles spanning - 4 - 680.7 µs - 170.2 µs - 10.6 µs * - GZIP_1 f4 quantized - 16 tiles aligned - 16 - 560.9 µs - 35.1 µs - 0.0 µs Four takeaways: 1. **Per-call cost is dominated by per-tile re-encode.** A single-pixel write costs 19–200 µs depending on algorithm — 30–60× the uncompressed memcpy (3 µs). The user's budget for "patch one pixel" is the per-tile cost of their chosen algorithm; touching N tiles costs roughly N × the per-tile rate plus a small constant. 2. **PLIO_1 is fastest, quantized float is slowest.** At one tile touched, PLIO costs ~20 µs / tile (it's encoded as ``(value, run-length)`` pairs so decode + re-encode are trivial); quantized f4 costs ~200 µs / tile because every touched tile has to re-quantize against its existing per-tile bscale/bzero/seed. RICE_1 and HCOMPRESS_1 fall between (40–145 µs single-pixel). 3. **Full-tile-aligned writes are CHEAPER than single-pixel writes.** Counterintuitive but real: a "1 tile aligned" selection that covers the whole tile (32×32) is 20–50 µs across all algorithms, where a single-pixel write on the same tile is 40–200 µs. The reason is that an aligned full-tile write doesn't need to decode the existing tile first — every pixel is being replaced, so the read-modify-write collapses to just write. Single-pixel writes pay the full decode + modify + encode cycle. 4. **Per-pixel rate amortizes with selection size.** For batched writes (the 16-tile column) the per-pixel rate drops to <0.1 µs / pixel — comparable to uncompressed. If you can buffer your patches into rectangular tile-aligned blocks, the per-tile re-encode tax amortizes away. For true scattered single-pixel workflows there's no shortcut (use PLIO_1 if your data fits its non-negative integer constraint). Practical guidance: for interactive masking on a compressed image, budget ~50–200 µs per single-pixel touch (algorithm- dependent), so up to a few thousand single-pixel patches per second. For bulk masking, group patches into rectangular selections that cover whole tiles where possible. ``repack()`` — peak RSS scaling at large heap ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Three repack flavors clean up heap orphans accumulated by ``__setitem__`` / ``extend`` calls: * ``TableHDU.repack()`` for BINTABLE VLA columns, * ``CompressedImageHDU.repack()`` for ZIMAGE compressed images, * ``CompressedTableHDU.repack()`` for ZTABLE compressed tables. They share a goal (rebuild the heap with only live bytes, shrink the file) but the implementations differ. This bench measures the peak RSS at 10 / 100 / 1000 MB heap sizes for each. Each fixture is built in the parent process; the repack runs in a fresh subprocess so the reported peak RSS is the repack alone, read via ``/proc/self/status:VmHWM`` (the kernel's per-task high-water). Using ``resource.getrusage(RUSAGE_SELF).ru_maxrss`` here would silently report the parent's peak, because on Linux ``ru_maxrss`` is accumulated in ``signal_struct`` and inherited across ``fork+exec``; the shared ``h.vm_hwm_kb`` helper is the immune path. .. list-table:: ``repack()`` peak RSS vs PCOUNT :widths: 36 12 12 14 16 :header-rows: 1 * - flavor - PCOUNT - repack t - peak RSS - notes * - BINTABLE VLA (uncompressed) - 10 MB - 8.2 ms - 50 MB - streaming + staging impl * - BINTABLE VLA - 100 MB - 16.4 ms - **50 MB** - **flat** — no PCOUNT scaling * - BINTABLE VLA - 1,000 MB - 94.2 ms - **50 MB** - **flat** at 1 GB heap * - ZIMAGE compressed image - 11 MB - 2.0 ms - 49 MB - streaming + staging impl * - ZIMAGE compressed image - 119 MB - 19.6 ms - **49 MB** - **flat** — no PCOUNT scaling * - ZIMAGE compressed image - 1,215 MB - 189.6 ms - **50 MB** - **flat** at 1 GB heap * - ZTABLE compressed table - 7 MB - 1.8 ms - 50 MB - streaming + staging impl * - ZTABLE compressed table - 102 MB - 18.3 ms - **50 MB** - **flat** — no PCOUNT scaling * - ZTABLE compressed table - 1,049 MB - 160.4 ms - **50 MB** - **flat** at 1 GB heap Two takeaways: 1. **All three repack flavors are bounded-memory.** RSS stays constant at ~50 MB (Python + rustfits baseline) regardless of heap size, from 10 MB to 1 GB, for all three: BINTABLE VLA, ZIMAGE, and ZTABLE. All three share the same streaming + staging implementation (a ~1 MiB chunk plus the descriptor table + a small move-plan vector; documented in CLAUDE.md under "Heap repack"). The shared primitives ``stream_copy_in_file`` and ``grow_file_to_at_least`` live in ``src/common.rs``. 2. **The rewrite cycle (2026-05-31) was driven by a real in-place-modify-large-compressed-image workload** needing ``repack()`` on a memory-constrained worker. Before the rewrite, BINTABLE VLA repack scaled at ~1.05× PCOUNT and ZIMAGE at ~1.5× PCOUNT. The numbers: .. list-table:: Repack rewrite — 1 GB heap, before vs after :widths: 28 26 26 20 :header-rows: 1 * - flavor - before - after - improvement * - BINTABLE VLA - 1.05 GB peak, 451 ms - **50 MB peak, 94 ms** - 21× less RAM, 4.8× faster * - ZIMAGE - 1.79 GB peak, 819 ms - **50 MB peak, 190 ms** - 36× less RAM, 4.3× faster * - ZTABLE - (already streaming) - 50 MB peak, ~160 ms - (unchanged) The time win on the two new streaming paths comes from eliminating the entire Vec-allocate + memcpy + drop cycle on the heap bytes; the streaming approach reads + writes 1 MiB chunks straight through the file handle. Other self-comparisons + RSS benches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ZTABLE read/write (rustfits compressed vs rustfits uncompressed) and the image-extend RSS benches. .. include:: _perf_tables_self.rst EXTNAME lookup (``name in fits`` / ``fits[name]``) — linear in HDU count ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``name in fits`` and ``fits["NAME"]`` walk the HDU list until a match is found (or fall through and return False / raise on miss), reading and parsing each HDU's EXTNAME card along the way. Cost scales linearly at roughly **0.14 µs per HDU** on the reference machine — about 1.4 ms for a 10,000-HDU file when the name lives at the end (or doesn't exist). fitsio caches a name→index dict and returns in ~10 µs regardless of HDU count. For typical files (tens to a few hundred HDUs) the gap is invisible. Workloads that do many name lookups on a multi- thousand-HDU file will notice; integer indexing is O(1) in both libraries, so the workaround until this is fixed is to build the name→index dict yourself once:: name_to_idx = {h.extname: i for i, h in enumerate(fits) if h.extname} This could be re-engineered to match fitsio (file-level name→index cache with version-stamped invalidation) if a real workload asks for it.