Ryan Malloy 25a34cd24d Add atomic .part staging + runtime download root tools
Atomic write pattern (tier-3 polish from headless test finding):
- download_to_file now writes to <dest>.part and renames to <dest> only on
  successful stream completion (os.replace is POSIX-atomic). Failed
  downloads leave only the .part file — no misleading 0-byte dest files
  in the user's downloads directory.
- Resume logic reads from <dest>.part instead of <dest>; the user's
  directory only ever contains complete files or clearly-marked .part files.
- New `already_complete` short-circuit: if dest exists and no .part, skip
  the network entirely (still re-verify MD5 if requested). The headless
  Claude test confirmed this avoids redundant CDN load.
- Symlink rejection re-added at the new code path: even though os.replace
  would only replace (not follow) a symlink at dest, predictable refusal
  beats silent symlink removal.

Runtime download root tools (for stdio MCP mode):
- get_download_root(): reports current root, source (env var vs default),
  existence, writability.
- set_download_root(path): change MCARCHIVE_DOWNLOAD_ROOT mid-session.
  Expands ~, creates the dir, refuses system paths
  (/, /etc, /usr, /bin, /sbin, /var, /sys, /proc, /dev, /boot, /root).
  The lazy-resolved root means the change takes effect on the next
  download_file call without restarting the server.

14 new tests (66 total, all green, ruff clean):
- 4 staging tests: failed download leaves no dest, success leaves no .part,
  already_complete short-circuit, MD5 verification on existing files
- 6 root-tools tests: env reporting, default reporting, ~ expansion,
  system-dir refusal (parametrized), set→download takes effect immediately
- 4 existing tests rewritten to use .part as the resume staging file

Headless Claude smoke test verified end-to-end: get_download_root →
set_download_root → search → list → download → second download
short-circuits with already_complete=true and zero network bytes.
2026-04-21 21:11:56 -06:00
2026-04-21 09:41:20 -06:00
2026-04-21 09:41:20 -06:00
2026-04-21 09:41:20 -06:00

mcarchive-org

An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the Internet Archive.

Built on FastMCP + httpx. No API key required — archive.org's read endpoints are public.

Tools

Tool Purpose
search_items Small Solr-style search via advancedsearch.php (1200 rows, paginated)
scrape_items Bulk cursor-paginated search via Scrape API (count ≥ 100)
get_item_metadata Metadata for one item; skips the (possibly huge) files list by default
list_files Files array with optional format / glob filtering — includes download_url per file
get_file_url Build a canonical download URL without hitting the network
download_file Stream a file to disk with resume support and optional MD5 verification

Also exposes an MCP resource template: archive://item/{identifier}.

Install & run

# From a checkout:
uv sync
uv run mcarchive-org

# Or from PyPI (once published):
uvx mcarchive-org

Register with Claude Code:

claude mcp add archive-org -- uvx mcarchive-org
# or, from a local checkout:
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org

Environment

Variable Default Purpose
MCARCHIVE_DOWNLOAD_ROOT ./downloads Base directory for download_file

Example flow

search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
  → identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)

list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
  → [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]

download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
  → { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }

Query syntax notes

archive.org uses a Solr/Lucene dialect:

  • mediatype:(audio OR movies) — restrict to media types
  • collection:etree — items in a specific collection
  • date:[1977-01-01 TO 1977-12-31] — date ranges
  • creator:"Grateful Dead" — phrase match
  • -subject:bootleg — exclusion
  • Sort by downloads desc, date asc, addeddate desc, etc.

See archive.org's search docs for the full grammar.

License

MIT

Description
MCP server for searching and downloading files from the Internet Archive (archive.org)
Readme 185 KiB
2026-04-22 04:18:06 +00:00
Languages
Python 100%