docs: cover markdown_to_pdf, [markdown] extra, uvx + pacman install

README:
- bump tool count 46 → 47, add Format Conversion bullet
- fix `claude mcp add` syntax (needs `--` separator before uvx)
- show `uvx --from "mcp-pdf[markdown]" mcp-pdf` for the new tool
- note about uvx caching + `--refresh`
- new "Format Conversion" tools subsection (markdown_to_pdf alongside pdf_to_markdown)
- new "Optional Extras" section explaining [forms], [tables], [markdown], [all]
- expand System Dependencies with Arch (pacman) and macOS (brew) recipes for
  pandoc + a PDF engine

QUICKSTART:
- replace stale `mcp-pdf-tools` package name with current `mcp-pdf`
- add uvx as the recommended end-user install path
- add pip install patterns including all optional extras
- add pacman block alongside apt-get and brew
- add markdown_to_pdf troubleshooting (mktexfmt errors, engine fallback)
- add a smoke-test snippet using the new tool
This commit is contained in:
Ryan Malloy 2026-05-05 16:27:28 -06:00
parent b2d9073f04
commit 964fd14a26
2 changed files with 173 additions and 37 deletions

View File

@ -2,12 +2,38 @@
## 1. Installation
### Option A: Using UV (Recommended for Development)
### Option A: Run from PyPI with uvx (Recommended for end users)
No clone required — `uvx` fetches and runs in an isolated cached venv:
```bash
# Bare install
uvx mcp-pdf
# With markdown_to_pdf support (requires pandoc on host)
uvx --from "mcp-pdf[markdown]" mcp-pdf
# Force a refresh after a new release
uvx --refresh --from "mcp-pdf[markdown]" mcp-pdf
```
### Option B: pip install from PyPI
```bash
pip install mcp-pdf
# Or with optional extras:
pip install "mcp-pdf[markdown]" # adds markdown_to_pdf
pip install "mcp-pdf[forms]" # adds form creation tools
pip install "mcp-pdf[tables]" # adds Camelot/Tabula table extraction
pip install "mcp-pdf[all]" # everything
```
### Option C: Local development with uv
```bash
# Clone the repository
git clone https://github.com/rpm/mcp-pdf-tools
cd mcp-pdf-tools
git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
# Install with uv
uv sync
@ -16,41 +42,63 @@ uv sync
uv run python examples/verify_installation.py
```
### Option B: Using Docker
### Option D: Using Docker
```bash
# Clone the repository
git clone https://github.com/rpm/mcp-pdf-tools
cd mcp-pdf-tools
git clone https://github.com/rsp2k/mcp-pdf
cd mcp-pdf
# Build and run with Docker
docker-compose build
docker-compose run --rm mcp-pdf-tools python examples/verify_installation.py
```
### Option C: From PyPI
```bash
pip install mcp-pdf-tools
docker compose build
docker compose run --rm mcp-pdf python examples/verify_installation.py
```
## 2. System Dependencies
`uvx` and `pip` only handle Python deps. Some tools call out to system binaries that you'll need to install separately:
| Binary | Required for |
|--------|-------------|
| `tesseract` | `ocr_pdf` |
| `ghostscript` | Camelot table extraction |
| `java` (JRE) | Tabula table extraction |
| `poppler` | PDF→image conversion |
| `pandoc` | `markdown_to_pdf` |
| `xelatex` / `pdflatex` / `tectonic` / `weasyprint` / `wkhtmltopdf` | `markdown_to_pdf` (need at least one) |
### Ubuntu/Debian
```bash
sudo apt-get update
sudo apt-get install -y \
tesseract-ocr \
tesseract-ocr-eng \
poppler-utils \
ghostscript \
python3-tk \
default-jre-headless
tesseract-ocr tesseract-ocr-eng \
poppler-utils ghostscript \
python3-tk default-jre-headless
# For markdown_to_pdf
sudo apt-get install -y pandoc texlive-xetex
```
### macOS
### Arch Linux
```bash
sudo pacman -S \
tesseract tesseract-data-eng \
poppler ghostscript \
jre-openjdk-headless tk
# For markdown_to_pdf
sudo pacman -S pandoc texlive-xetex
# Lighter alternative engines: tectonic (official repo),
# wkhtmltopdf (AUR), or `pip install weasyprint` (works in any venv)
```
### macOS (Homebrew)
```bash
brew install tesseract poppler ghostscript
# For markdown_to_pdf
brew install pandoc
brew install --cask mactex-no-gui # full TeX with xelatex/pdflatex
# Or lighter:
brew install weasyprint
```
### Windows
@ -58,18 +106,33 @@ brew install tesseract poppler ghostscript
- Install Poppler: http://blog.alivate.com.au/poppler-windows/
- Install Ghostscript: https://www.ghostscript.com/download/gsdnld.html
- Install Java: https://www.java.com/download/
- Install Pandoc (for `markdown_to_pdf`): https://pandoc.org/installing.html
- Install MiKTeX or wkhtmltopdf for the PDF engine
## 3. Claude Desktop Configuration
## 3. Adding to Claude Code / Claude Desktop
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
### Easiest — `claude mcp add` with uvx
```bash
# Bare
claude mcp add pdf-tools -- uvx mcp-pdf
# With markdown_to_pdf support
claude mcp add pdf-tools -- uvx --from "mcp-pdf[markdown]" mcp-pdf
```
The `--` separator is required so the Claude CLI doesn't try to parse `--from` as one of its own flags.
### Manual config (Claude Desktop)
Edit `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `~/.config/Claude/claude_desktop_config.json` (Linux):
```json
{
"mcpServers": {
"pdf-tools": {
"command": "uv",
"args": ["run", "mcp-pdf-tools"],
"cwd": "/home/rpm/claude/mcp-pdf-tools"
"command": "uvx",
"args": ["--from", "mcp-pdf[markdown]", "mcp-pdf"]
}
}
}
@ -86,14 +149,20 @@ uv run python examples/test_pdf_tools.py /path/to/your/document.pdf
### OCR not working
- Check Tesseract is installed: `tesseract --version`
- Install language packs: `sudo apt-get install tesseract-ocr-[lang]`
- Install language packs: `sudo apt-get install tesseract-ocr-[lang]` (Debian) or `sudo pacman -S tesseract-data-[lang]` (Arch)
### Table extraction failing
- Check Java is installed: `java -version`
- For Camelot issues, ensure Ghostscript is installed
- For Camelot issues, ensure Ghostscript is installed: `gs --version`
### `markdown_to_pdf` errors
- "pandoc binary not found" → install pandoc (see System Dependencies)
- "No PDF engine found" → install at least one of `xelatex`, `pdflatex`, `tectonic`, `weasyprint`, `wkhtmltopdf`
- "Pandoc died with exitcode 43" + `mktexfmt` errors → your TeX install is missing format files; rebuild with `sudo fmtutil-sys --all` or use a different engine via `pdf_engine="weasyprint"`
- The tool reports `detected_engines` in its response — check that field to see what's actually available
### Large PDF issues
- Process specific pages: `pages=[0, 1, 2]`
- Process specific pages: `pages="1-10"` or `pages="1,3,5"`
- Increase memory: `export JAVA_OPTS="-Xmx2g"`
## 6. Example Usage in Claude
@ -105,6 +174,21 @@ Once configured, you can ask Claude:
- "Extract all tables from /path/to/report.pdf and format as markdown"
- "Convert /path/to/document.pdf to markdown format"
- "Extract images from the first 5 pages of /path/to/presentation.pdf"
- "Build a PDF from /path/to/notes.md with a table of contents"
## 7. Verify the Built-in Test
Convert this README itself to PDF as a smoke test once everything is wired up:
```python
markdown_to_pdf(
markdown_path="QUICKSTART.md",
output_path="/tmp/quickstart.pdf",
toc=True,
)
```
The response includes `detected_engines` so you can see exactly what's installed on your host.
## Need Help?

View File

@ -6,7 +6,7 @@
**A FastMCP server for PDF processing**
*46 tools for text extraction, OCR, tables, forms, annotations, and more*
*47 tools for text extraction, OCR, tables, forms, annotations, markdown↔PDF, and more*
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/)
[![FastMCP](https://img.shields.io/badge/FastMCP-2.0+-green.svg?style=flat-square)](https://github.com/jlowin/fastmcp)
@ -31,19 +31,25 @@ MCP PDF extracts content from PDFs using multiple libraries with automatic fallb
- **Document assembly** - merge, split, reorder pages
- **Annotations** - sticky notes, highlights, stamps
- **Vector graphics** - extract to SVG for schematics and technical drawings
- **Format conversion** - PDF ↔ Markdown (PDF→MD via PyMuPDF, MD→PDF via pandoc)
---
## Quick Start
```bash
# Install from PyPI
# Run from PyPI (one-shot, no permanent install)
uvx mcp-pdf
# Or add to Claude Code
claude mcp add pdf-tools uvx mcp-pdf
# Add to Claude Code — note the `--` separator before uvx
claude mcp add pdf-tools -- uvx mcp-pdf
# Include the markdown_to_pdf tool (requires pandoc on host)
claude mcp add pdf-tools -- uvx --from "mcp-pdf[markdown]" mcp-pdf
```
> `uvx` caches tool installs aggressively. After upgrading to a new release, force a refresh with `uvx --refresh mcp-pdf` (or `uvx --refresh --from "mcp-pdf[markdown]" mcp-pdf` if you're using extras).
<details>
<summary><b>Development Installation</b></summary>
@ -55,6 +61,9 @@ uv sync
# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
# For markdown_to_pdf:
sudo apt-get install pandoc texlive-xetex # or: weasyprint, wkhtmltopdf
# Verify
uv run python examples/verify_installation.py
```
@ -73,10 +82,18 @@ uv run python examples/verify_installation.py
| `extract_tables` | Extract tables to JSON, CSV, or Markdown |
| `extract_images` | Extract embedded images |
| `extract_links` | Get all hyperlinks with page filtering |
| `pdf_to_markdown` | Convert PDF to markdown preserving structure |
| `ocr_pdf` | OCR scanned documents using Tesseract |
| `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) |
### Format Conversion
| Tool | What it does |
|------|-------------|
| `pdf_to_markdown` | Convert PDF to markdown preserving structure; extracts images and SVG vectors to disk |
| `markdown_to_pdf` | Convert `.md` files (or inline text) to PDF via pandoc with auto-detected engine |
**`markdown_to_pdf` requires:** `pip install mcp-pdf[markdown]` plus the pandoc binary and at least one PDF engine (`xelatex`, `pdflatex`, `tectonic`, `weasyprint`, or `wkhtmltopdf`) on PATH. The tool auto-detects what's available and uses the highest-quality one. Pass `pdf_engine=` to override or `extra_args=` for raw pandoc options.
### Document Analysis
| Tool | What it does |
@ -193,12 +210,47 @@ Some features require system packages:
| Camelot tables | `ghostscript` |
| Tabula tables | `default-jre-headless` |
| PDF to images | `poppler-utils` |
| `markdown_to_pdf` | `pandoc` + one of: `texlive-xetex`, `texlive-latex-base`, `tectonic`, `weasyprint`, `wkhtmltopdf` |
Ubuntu/Debian:
**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless
# For markdown_to_pdf — pandoc plus at least one PDF engine
sudo apt-get install pandoc texlive-xetex
```
**Arch Linux:**
```bash
sudo pacman -S tesseract tesseract-data-eng poppler ghostscript jre-openjdk-headless
# For markdown_to_pdf — pandoc plus at least one PDF engine
sudo pacman -S pandoc texlive-xetex
# Lighter alternatives (pick one): tectonic, wkhtmltopdf (AUR), or pip install weasyprint
```
**macOS (Homebrew):**
```bash
brew install tesseract poppler ghostscript
# For markdown_to_pdf
brew install pandoc
brew install --cask mactex-no-gui # for xelatex/pdflatex
# Or a lighter engine:
brew install weasyprint
```
## Optional Extras
The base install stays lean. Heavy or niche dependencies are gated behind extras:
| Extra | Adds | When to install |
|-------|------|----------------|
| `mcp-pdf[forms]` | `reportlab` | Form creation tools (`create_form_pdf`, permit forms) |
| `mcp-pdf[tables]` | `camelot-py`, `tabula-py` | Higher-accuracy table extraction (also needs Java + Ghostscript) |
| `mcp-pdf[markdown]` | `pypandoc` | `markdown_to_pdf` tool (also needs pandoc binary) |
| `mcp-pdf[all]` | All of the above | Want everything |
---
## Configuration