Ryan Malloy 964fd14a26 docs: cover markdown_to_pdf, [markdown] extra, uvx + pacman install

README:
- bump tool count 46 → 47, add Format Conversion bullet
- fix `claude mcp add` syntax (needs `--` separator before uvx)
- show `uvx --from "mcp-pdf[markdown]" mcp-pdf` for the new tool
- note about uvx caching + `--refresh`
- new "Format Conversion" tools subsection (markdown_to_pdf alongside pdf_to_markdown)
- new "Optional Extras" section explaining [forms], [tables], [markdown], [all]
- expand System Dependencies with Arch (pacman) and macOS (brew) recipes for
  pandoc + a PDF engine

QUICKSTART:
- replace stale `mcp-pdf-tools` package name with current `mcp-pdf`
- add uvx as the recommended end-user install path
- add pip install patterns including all optional extras
- add pacman block alongside apt-get and brew
- add markdown_to_pdf troubleshooting (mktexfmt errors, engine fallback)
- add a smoke-test snippet using the new tool

2026-05-05 16:27:28 -06:00

8.8 KiB

Raw Blame History

📄 MCP PDF

A FastMCP server for PDF processing

47 tools for text extraction, OCR, tables, forms, annotations, markdown↔PDF, and more

Works great with MCP Office Tools

What It Does

MCP PDF extracts content from PDFs using multiple libraries with automatic fallbacks. If one method fails, it tries another.

Core capabilities:

Text extraction via PyMuPDF, pdfplumber, or pypdf (auto-fallback)
Table extraction via Camelot, pdfplumber, or Tabula (auto-fallback)
OCR for scanned documents via Tesseract
Form handling - extract, fill, and create PDF forms
Document assembly - merge, split, reorder pages
Annotations - sticky notes, highlights, stamps
Vector graphics - extract to SVG for schematics and technical drawings
Format conversion - PDF ↔ Markdown (PDF→MD via PyMuPDF, MD→PDF via pandoc)

Quick Start

# Run from PyPI (one-shot, no permanent install)
uvx mcp-pdf

# Add to Claude Code — note the `--` separator before uvx
claude mcp add pdf-tools -- uvx mcp-pdf

# Include the markdown_to_pdf tool (requires pandoc on host)
claude mcp add pdf-tools -- uvx --from "mcp-pdf[markdown]" mcp-pdf

uvx caches tool installs aggressively. After upgrading to a new release, force a refresh with uvx --refresh mcp-pdf (or uvx --refresh --from "mcp-pdf[markdown]" mcp-pdf if you're using extras).

Development Installation

git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
uv sync

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript

# For markdown_to_pdf:
sudo apt-get install pandoc texlive-xetex   # or: weasyprint, wkhtmltopdf

# Verify
uv run python examples/verify_installation.py

Tools

Content Extraction

Tool	What it does
`extract_text`	Pull text from PDF pages with automatic chunking for large files
`extract_tables`	Extract tables to JSON, CSV, or Markdown
`extract_images`	Extract embedded images
`extract_links`	Get all hyperlinks with page filtering
`ocr_pdf`	OCR scanned documents using Tesseract
`extract_vector_graphics`	Export vector graphics to SVG (schematics, charts, drawings)

Format Conversion

Tool	What it does
`pdf_to_markdown`	Convert PDF to markdown preserving structure; extracts images and SVG vectors to disk
`markdown_to_pdf`	Convert `.md` files (or inline text) to PDF via pandoc with auto-detected engine

markdown_to_pdf requires: pip install mcp-pdf[markdown] plus the pandoc binary and at least one PDF engine (xelatex, pdflatex, tectonic, weasyprint, or wkhtmltopdf) on PATH. The tool auto-detects what's available and uses the highest-quality one. Pass pdf_engine= to override or extra_args= for raw pandoc options.

Document Analysis

Tool	What it does
`extract_metadata`	Get title, author, creation date, page count, etc.
`get_document_structure`	Extract table of contents and bookmarks
`analyze_layout`	Detect columns, headers, footers
`is_scanned_pdf`	Check if PDF needs OCR
`compare_pdfs`	Diff two PDFs by text, structure, or metadata
`analyze_pdf_health`	Check for corruption, optimization opportunities
`analyze_pdf_security`	Report encryption, permissions, signatures

Forms

Tool	What it does
`extract_form_data`	Get form field names and values
`fill_form_pdf`	Fill form fields from JSON
`create_form_pdf`	Create new forms with text fields, checkboxes, dropdowns
`add_form_fields`	Add fields to existing PDFs

Permit Forms (Coordinate-Based)

For scanned PDFs or forms without interactive fields. Draws text at (x, y) coordinates.

Tool	What it does
`fill_permit_form`	Fill any PDF by drawing at coordinates (works with scanned forms)
`get_field_schema`	Get field definitions for validation or UI generation
`validate_permit_form_data`	Check data against field schema before filling
`preview_field_positions`	Generate PDF showing field boundaries (debugging)
`insert_attachment_pages`	Insert image/text pages with "See page X" references

Requires: pip install mcp-pdf[forms] (adds reportlab dependency)

Document Assembly

Tool	What it does
`merge_pdfs`	Combine multiple PDFs with bookmark preservation
`split_pdf_by_pages`	Split by page ranges
`split_pdf_by_bookmarks`	Split at chapter/section boundaries
`reorder_pdf_pages`	Rearrange pages in custom order

Annotations

Tool	What it does
`add_sticky_notes`	Add comment annotations
`add_highlights`	Highlight text regions
`add_stamps`	Add Approved/Draft/Confidential stamps
`extract_all_annotations`	Export annotations to JSON

How Fallbacks Work

The server tries multiple libraries for each operation:

Text extraction:

PyMuPDF (fastest)
pdfplumber (better for complex layouts)
pypdf (most compatible)

Table extraction:

Camelot (best accuracy, requires Ghostscript)
pdfplumber (no dependencies)
Tabula (requires Java)

If a PDF fails with one library, the next is tried automatically.

Token Management

Large PDFs can overflow MCP response limits. The server handles this:

Automatic chunking splits large documents into page groups
Table row limits prevent huge tables from blowing up responses
Summary mode returns structure without full content

# Get first 10 pages
result = await extract_text("huge.pdf", pages="1-10")

# Limit table rows
tables = await extract_tables("data.pdf", max_rows_per_table=50)

# Structure only
tables = await extract_tables("data.pdf", summary_only=True)

URL Processing

PDFs can be fetched directly from HTTPS URLs:

result = await extract_text("https://example.com/report.pdf")

Files are cached locally for subsequent operations.

System Dependencies

Some features require system packages:

Feature	Dependency
OCR	`tesseract-ocr`
Camelot tables	`ghostscript`
Tabula tables	`default-jre-headless`
PDF to images	`poppler-utils`
`markdown_to_pdf`	`pandoc` + one of: `texlive-xetex`, `texlive-latex-base`, `tectonic`, `weasyprint`, `wkhtmltopdf`

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless

# For markdown_to_pdf — pandoc plus at least one PDF engine
sudo apt-get install pandoc texlive-xetex

Arch Linux:

sudo pacman -S tesseract tesseract-data-eng poppler ghostscript jre-openjdk-headless

# For markdown_to_pdf — pandoc plus at least one PDF engine
sudo pacman -S pandoc texlive-xetex
# Lighter alternatives (pick one): tectonic, wkhtmltopdf (AUR), or pip install weasyprint

macOS (Homebrew):

brew install tesseract poppler ghostscript

# For markdown_to_pdf
brew install pandoc
brew install --cask mactex-no-gui   # for xelatex/pdflatex
# Or a lighter engine:
brew install weasyprint

Optional Extras

The base install stays lean. Heavy or niche dependencies are gated behind extras:

Extra	Adds	When to install
`mcp-pdf[forms]`	`reportlab`	Form creation tools (`create_form_pdf`, permit forms)
`mcp-pdf[tables]`	`camelot-py`, `tabula-py`	Higher-accuracy table extraction (also needs Java + Ghostscript)
`mcp-pdf[markdown]`	`pypandoc`	`markdown_to_pdf` tool (also needs pandoc binary)
`mcp-pdf[all]`	All of the above	Want everything

Configuration

Optional environment variables:

Variable	Purpose
`MCP_PDF_ALLOWED_PATHS`	Colon-separated directories for file output
`PDF_TEMP_DIR`	Temp directory for processing (default: `/tmp/mcp-pdf-processing`)
`TESSDATA_PREFIX`	Tesseract language data location

Development

# Run tests
uv run pytest

# With coverage
uv run pytest --cov=mcp_pdf

# Format
uv run black src/ tests/

# Lint
uv run ruff check src/ tests/

License

MIT

8.8 KiB Raw Blame History