mcarchive-org/README.md
Ryan Malloy 5265a6440b Initial mcarchive-org MCP server
FastMCP server wrapping archive.org's public read APIs:
- search_items / scrape_items: advanced search + bulk cursor pagination
- get_item_metadata / list_files: progressive disclosure with filtering
- get_file_url / download_file: canonical URLs and streaming downloads
  with HTTP Range resume + optional MD5 verification

Smoke-tested end-to-end via claude -p headless MCP and pytest against
live archive.org endpoints.
2026-04-21 09:41:20 -06:00

2.5 KiB
Raw Blame History

mcarchive-org

An MCP (Model Context Protocol) server that lets an LLM search, inspect, and download content from the Internet Archive.

Built on FastMCP + httpx. No API key required — archive.org's read endpoints are public.

Tools

Tool Purpose
search_items Small Solr-style search via advancedsearch.php (1200 rows, paginated)
scrape_items Bulk cursor-paginated search via Scrape API (count ≥ 100)
get_item_metadata Metadata for one item; skips the (possibly huge) files list by default
list_files Files array with optional format / glob filtering — includes download_url per file
get_file_url Build a canonical download URL without hitting the network
download_file Stream a file to disk with resume support and optional MD5 verification

Also exposes an MCP resource template: archive://item/{identifier}.

Install & run

# From a checkout:
uv sync
uv run mcarchive-org

# Or from PyPI (once published):
uvx mcarchive-org

Register with Claude Code:

claude mcp add archive-org -- uvx mcarchive-org
# or, from a local checkout:
claude mcp add archive-org -- uv run --directory /path/to/mcarchive-org mcarchive-org

Environment

Variable Default Purpose
MCARCHIVE_DOWNLOAD_ROOT ./downloads Base directory for download_file

Example flow

search_items(query='mediatype:audio AND creator:"Grateful Dead"', sort=['downloads desc'])
  → identifier 'gd77-05-08.sbd.hicks.4982.sbeok.shnf' (among others)

list_files(identifier='gd77-05-08.sbd.hicks.4982.sbeok.shnf', formats=['VBR MP3'])
  → [{ name: 'gd1977-05-08d1t01.mp3', size: 6342912, md5: '…', download_url: '…' }, …]

download_file(identifier='gd77-…', filename='gd1977-05-08d1t01.mp3', verify_md5='…')
  → { path: './downloads/gd77-…/gd1977-…mp3', bytes: 6342912, md5_ok: True }

Query syntax notes

archive.org uses a Solr/Lucene dialect:

  • mediatype:(audio OR movies) — restrict to media types
  • collection:etree — items in a specific collection
  • date:[1977-01-01 TO 1977-12-31] — date ranges
  • creator:"Grateful Dead" — phrase match
  • -subject:bootleg — exclusion
  • Sort by downloads desc, date asc, addeddate desc, etc.

See archive.org's search docs for the full grammar.

License

MIT