PyPI metadata is immutable per version, so this post-release exists solely to refresh the [project.urls] block: Homepage / Repository / Bug Tracker / Changelog now point at git.supported.systems/warehack.ing/mcarchive-org (the new canonical home after the org transfer). No code changes. Same wheel contents as 2026.4.21, only METADATA URLs differ.
4.1 KiB
Changelog
Versioning is date-based: YYYY.MM.DD for normal releases, YYYY.MM.DD.N (PEP 440 post-release) for same-day fixes.
2026.4.21.1 — metadata refresh
Project URLs in package metadata updated to point at the new canonical home: git.supported.systems/warehack.ing/mcarchive-org. No code changes — same wheel contents, just refreshed Project-URL fields. PyPI metadata is immutable per version, hence the post-release bump rather than an in-place edit.
2026.04.21 — initial release
First public release. An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the Internet Archive. No API key required.
Tools
search_items— Solr-style search viaadvancedsearch.php(1–200 rows, paginated)scrape_items— bulk cursor-paginated search via the Scrape API (count ≥ 100)get_item_metadata— item metadata; skips the (potentially huge) files list by defaultlist_files— files array with optional format / fnmatch glob filtering, includes pre-builtdownload_urlper fileget_file_url— build a canonical download URL without hitting the networkdownload_file— stream a file to disk with HTTP Range resume + optional MD5 verificationget_download_root— report current download root and its source (env var vs default)set_download_root— change the download root mid-session (useful in stdio mode where env vars can't be re-exported)
Plus an MCP resource template: archive://item/{identifier}.
Reliability features
- Input validation: identifiers must match
^[A-Za-z0-9._-]+$; filenames reject..components, absolute paths, NUL bytes, and Windows drive letters before any FS or network I/O - Path confinement: download destinations are resolved and asserted to live under
MCARCHIVE_DOWNLOAD_ROOT; symlinks at the destination are refused O_NOFOLLOW: defense-in-depth against symlink-substitution races on the destination file- Range-correctness check: when resuming, the server's response must be HTTP 206 with a matching
Content-Rangestart byte — otherwise the download aborts before any byte is written, eliminating silent file corruption - Atomic write staging: downloads write to
<dest>.partand are renamed to<dest>only on successful completion (POSIX-atomic). Failed downloads leave only.part, never an emptydest - Already-complete short-circuit: re-downloading an already-complete file skips the network entirely (still re-verifies MD5 if asked)
- Retry with backoff: 429/502/503/504 retried up to 3 times with
Retry-Afterhonored (delta-seconds and HTTP-date forms), exponential backoff with jitter, capped at 30s. Retries happen before any bytes are yielded, so retry can never corrupt a partial write - Concurrent-download serialization: per-
(identifier, filename)asyncio.Lockprevents two parallel calls from racing on the same destination file. Different files still download in parallel - Stream-abort surfacing:
httpx.ReadError/RemoteProtocolError/ConnectError/ReadTimeoutmid-stream are caught and re-raised asArchiveErrorwith a byte-count context so the caller knows where the partial download ended - Error body surfacing: 4xx/5xx responses include a body preview in the exception message — invaluable for an LLM trying to fix a bad query
- Process-wide shared
httpx.AsyncClient: one connection pool reused across the server's lifetime (no TCP+TLS handshake per tool call)
Output normalization
collectionfield is alwayslist[str](archive.org returns string OR list inconsistently)- Every search doc / metadata response includes a derived
is_collection: boolso LLMs can route collection containers vs. real media items without re-querying - File entries always include a ready-to-use
download_urlplussize_human("12.3 MB") alongside rawsizein bytes
Tests
66 tests total (4 live integration against archive.org + 62 mock-transport regression tests). Mock tests cover every reliability claim above so future refactors can't silently regress safety.