Go to file

Ryan Malloy 25a34cd24d Add atomic .part staging + runtime download root tools

Atomic write pattern (tier-3 polish from headless test finding):
- download_to_file now writes to <dest>.part and renames to <dest> only on
  successful stream completion (os.replace is POSIX-atomic). Failed
  downloads leave only the .part file — no misleading 0-byte dest files
  in the user's downloads directory.
- Resume logic reads from <dest>.part instead of <dest>; the user's
  directory only ever contains complete files or clearly-marked .part files.
- New `already_complete` short-circuit: if dest exists and no .part, skip
  the network entirely (still re-verify MD5 if requested). The headless
  Claude test confirmed this avoids redundant CDN load.
- Symlink rejection re-added at the new code path: even though os.replace
  would only replace (not follow) a symlink at dest, predictable refusal
  beats silent symlink removal.

Runtime download root tools (for stdio MCP mode):
- get_download_root(): reports current root, source (env var vs default),
  existence, writability.
- set_download_root(path): change MCARCHIVE_DOWNLOAD_ROOT mid-session.
  Expands ~, creates the dir, refuses system paths
  (/, /etc, /usr, /bin, /sbin, /var, /sys, /proc, /dev, /boot, /root).
  The lazy-resolved root means the change takes effect on the next
  download_file call without restarting the server.

14 new tests (66 total, all green, ruff clean):
- 4 staging tests: failed download leaves no dest, success leaves no .part,
  already_complete short-circuit, MD5 verification on existing files
- 6 root-tools tests: env reporting, default reporting, ~ expansion,
  system-dir refusal (parametrized), set→download takes effect immediately
- 4 existing tests rewritten to use .part as the resume staging file

Headless Claude smoke test verified end-to-end: get_download_root →
set_download_root → search → list → download → second download
short-circuits with already_complete=true and zero network bytes.

2026-04-21 21:11:56 -06:00

src/mcarchive_org

Add atomic .part staging + runtime download root tools

2026-04-21 21:11:56 -06:00

tests

Add atomic .part staging + runtime download root tools

2026-04-21 21:11:56 -06:00

.gitignore

Initial mcarchive-org MCP server

2026-04-21 09:41:20 -06:00

pyproject.toml

Hardening: address Hamilton review ship-blockers

2026-04-21 15:34:30 -06:00

README.md

Initial mcarchive-org MCP server

2026-04-21 09:41:20 -06:00

uv.lock

Initial mcarchive-org MCP server

2026-04-21 09:41:20 -06:00

README.md

mcarchive-org

An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the Internet Archive.

Built on FastMCP + httpx. No API key required — archive.org's read endpoints are public.

Tools

Tool	Purpose
`search_items`	Small Solr-style search via `advancedsearch.php` (1–200 rows, paginated)
`scrape_items`	Bulk cursor-paginated search via Scrape API (count ≥ 100)
`get_item_metadata`	Metadata for one item; skips the (possibly huge) files list by default
`list_files`	Files array with optional format / glob filtering — includes `download_url` per file
`get_file_url`	Build a canonical download URL without hitting the network
`download_file`	Stream a file to disk with resume support and optional MD5 verification

Also exposes an MCP resource template: archive://item/{identifier}.

Install & run

# From a checkout:
uv sync
uv run mcarchive-org

# Or from PyPI (once published):
uvx mcarchive-org

claude mcp add archive-org -- uvx mcarchive-org
# or, from a local checkout:
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org

Environment

Variable	Default	Purpose
`MCARCHIVE_DOWNLOAD_ROOT`	`./downloads`	Base directory for `download_file`

Example flow

search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
  → identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)

list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
  → [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]

download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
  → { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }

Query syntax notes

archive.org uses a Solr/Lucene dialect:

mediatype:(audio OR movies) — restrict to media types
collection:etree — items in a specific collection
date:[1977-01-01 TO 1977-12-31] — date ranges
creator:"Grateful Dead" — phrase match
-subject:bootleg — exclusion
Sort by downloads desc, date asc, addeddate desc, etc.

See archive.org's search docs for the full grammar.

License

MIT

README.md Unescape Escape

mcarchive-org

Tools

Install & run

Environment

Example flow

Query syntax notes

License

README.md