mcarchive-org/CHANGELOG.md
Ryan Malloy a3c7b69ba8
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
CI / test (3.13) (push) Has been cancelled
Release 2026.4.21.1: refresh project URLs after warehack.ing transfer
PyPI metadata is immutable per version, so this post-release exists solely
to refresh the [project.urls] block: Homepage / Repository / Bug Tracker /
Changelog now point at git.supported.systems/warehack.ing/mcarchive-org
(the new canonical home after the org transfer).

No code changes. Same wheel contents as 2026.4.21, only METADATA URLs
differ.
2026-04-21 22:17:50 -06:00

4.1 KiB
Raw Blame History

Changelog

Versioning is date-based: YYYY.MM.DD for normal releases, YYYY.MM.DD.N (PEP 440 post-release) for same-day fixes.

2026.4.21.1 — metadata refresh

Project URLs in package metadata updated to point at the new canonical home: git.supported.systems/warehack.ing/mcarchive-org. No code changes — same wheel contents, just refreshed Project-URL fields. PyPI metadata is immutable per version, hence the post-release bump rather than an in-place edit.

2026.04.21 — initial release

First public release. An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the Internet Archive. No API key required.

Tools

  • search_items — Solr-style search via advancedsearch.php (1200 rows, paginated)
  • scrape_items — bulk cursor-paginated search via the Scrape API (count ≥ 100)
  • get_item_metadata — item metadata; skips the (potentially huge) files list by default
  • list_files — files array with optional format / fnmatch glob filtering, includes pre-built download_url per file
  • get_file_url — build a canonical download URL without hitting the network
  • download_file — stream a file to disk with HTTP Range resume + optional MD5 verification
  • get_download_root — report current download root and its source (env var vs default)
  • set_download_root — change the download root mid-session (useful in stdio mode where env vars can't be re-exported)

Plus an MCP resource template: archive://item/{identifier}.

Reliability features

  • Input validation: identifiers must match ^[A-Za-z0-9._-]+$; filenames reject .. components, absolute paths, NUL bytes, and Windows drive letters before any FS or network I/O
  • Path confinement: download destinations are resolved and asserted to live under MCARCHIVE_DOWNLOAD_ROOT; symlinks at the destination are refused
  • O_NOFOLLOW: defense-in-depth against symlink-substitution races on the destination file
  • Range-correctness check: when resuming, the server's response must be HTTP 206 with a matching Content-Range start byte — otherwise the download aborts before any byte is written, eliminating silent file corruption
  • Atomic write staging: downloads write to <dest>.part and are renamed to <dest> only on successful completion (POSIX-atomic). Failed downloads leave only .part, never an empty dest
  • Already-complete short-circuit: re-downloading an already-complete file skips the network entirely (still re-verifies MD5 if asked)
  • Retry with backoff: 429/502/503/504 retried up to 3 times with Retry-After honored (delta-seconds and HTTP-date forms), exponential backoff with jitter, capped at 30s. Retries happen before any bytes are yielded, so retry can never corrupt a partial write
  • Concurrent-download serialization: per-(identifier, filename) asyncio.Lock prevents two parallel calls from racing on the same destination file. Different files still download in parallel
  • Stream-abort surfacing: httpx.ReadError/RemoteProtocolError/ConnectError/ReadTimeout mid-stream are caught and re-raised as ArchiveError with a byte-count context so the caller knows where the partial download ended
  • Error body surfacing: 4xx/5xx responses include a body preview in the exception message — invaluable for an LLM trying to fix a bad query
  • Process-wide shared httpx.AsyncClient: one connection pool reused across the server's lifetime (no TCP+TLS handshake per tool call)

Output normalization

  • collection field is always list[str] (archive.org returns string OR list inconsistently)
  • Every search doc / metadata response includes a derived is_collection: bool so LLMs can route collection containers vs. real media items without re-querying
  • File entries always include a ready-to-use download_url plus size_human ("12.3 MB") alongside raw size in bytes

Tests

66 tests total (4 live integration against archive.org + 62 mock-transport regression tests). Mock tests cover every reliability claim above so future refactors can't silently regress safety.