PyPI metadata is immutable per version, so this post-release exists solely to refresh the [project.urls] block: Homepage / Repository / Bug Tracker / Changelog now point at git.supported.systems/warehack.ing/mcarchive-org (the new canonical home after the org transfer). No code changes. Same wheel contents as 2026.4.21, only METADATA URLs differ.
49 lines
4.1 KiB
Markdown
49 lines
4.1 KiB
Markdown
# Changelog
|
||
|
||
Versioning is date-based: `YYYY.MM.DD` for normal releases, `YYYY.MM.DD.N` (PEP 440 post-release) for same-day fixes.
|
||
|
||
## 2026.4.21.1 — metadata refresh
|
||
|
||
Project URLs in package metadata updated to point at the new canonical home: `git.supported.systems/warehack.ing/mcarchive-org`. No code changes — same wheel contents, just refreshed `Project-URL` fields. PyPI metadata is immutable per version, hence the post-release bump rather than an in-place edit.
|
||
|
||
## 2026.04.21 — initial release
|
||
|
||
First public release. An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the [Internet Archive](https://archive.org). No API key required.
|
||
|
||
### Tools
|
||
|
||
- `search_items` — Solr-style search via `advancedsearch.php` (1–200 rows, paginated)
|
||
- `scrape_items` — bulk cursor-paginated search via the Scrape API (count ≥ 100)
|
||
- `get_item_metadata` — item metadata; skips the (potentially huge) files list by default
|
||
- `list_files` — files array with optional format / fnmatch glob filtering, includes pre-built `download_url` per file
|
||
- `get_file_url` — build a canonical download URL without hitting the network
|
||
- `download_file` — stream a file to disk with HTTP Range resume + optional MD5 verification
|
||
- `get_download_root` — report current download root and its source (env var vs default)
|
||
- `set_download_root` — change the download root mid-session (useful in stdio mode where env vars can't be re-exported)
|
||
|
||
Plus an MCP resource template: `archive://item/{identifier}`.
|
||
|
||
### Reliability features
|
||
|
||
- **Input validation**: identifiers must match `^[A-Za-z0-9._-]+$`; filenames reject `..` components, absolute paths, NUL bytes, and Windows drive letters before any FS or network I/O
|
||
- **Path confinement**: download destinations are resolved and asserted to live under `MCARCHIVE_DOWNLOAD_ROOT`; symlinks at the destination are refused
|
||
- **`O_NOFOLLOW`**: defense-in-depth against symlink-substitution races on the destination file
|
||
- **Range-correctness check**: when resuming, the server's response must be HTTP 206 with a matching `Content-Range` start byte — otherwise the download aborts before any byte is written, eliminating silent file corruption
|
||
- **Atomic write staging**: downloads write to `<dest>.part` and are renamed to `<dest>` only on successful completion (POSIX-atomic). Failed downloads leave only `.part`, never an empty `dest`
|
||
- **Already-complete short-circuit**: re-downloading an already-complete file skips the network entirely (still re-verifies MD5 if asked)
|
||
- **Retry with backoff**: 429/502/503/504 retried up to 3 times with `Retry-After` honored (delta-seconds and HTTP-date forms), exponential backoff with jitter, capped at 30s. Retries happen *before* any bytes are yielded, so retry can never corrupt a partial write
|
||
- **Concurrent-download serialization**: per-`(identifier, filename)` `asyncio.Lock` prevents two parallel calls from racing on the same destination file. Different files still download in parallel
|
||
- **Stream-abort surfacing**: `httpx.ReadError`/`RemoteProtocolError`/`ConnectError`/`ReadTimeout` mid-stream are caught and re-raised as `ArchiveError` with a byte-count context so the caller knows where the partial download ended
|
||
- **Error body surfacing**: 4xx/5xx responses include a body preview in the exception message — invaluable for an LLM trying to fix a bad query
|
||
- **Process-wide shared `httpx.AsyncClient`**: one connection pool reused across the server's lifetime (no TCP+TLS handshake per tool call)
|
||
|
||
### Output normalization
|
||
|
||
- `collection` field is always `list[str]` (archive.org returns string OR list inconsistently)
|
||
- Every search doc / metadata response includes a derived `is_collection: bool` so LLMs can route collection containers vs. real media items without re-querying
|
||
- File entries always include a ready-to-use `download_url` plus `size_human` ("12.3 MB") alongside raw `size` in bytes
|
||
|
||
### Tests
|
||
|
||
66 tests total (4 live integration against archive.org + 62 mock-transport regression tests). Mock tests cover every reliability claim above so future refactors can't silently regress safety.
|