Compare commits
No commits in common. "31b8b2e6d40325603f5ac142b0b40ab88e596a6f" and "b2d9073f04979f1fa45a6ed5eded46da4d7ab1ec" have entirely different histories.
31b8b2e6d4
...
b2d9073f04
149
QUICKSTART.md
149
QUICKSTART.md
@ -2,38 +2,12 @@
|
|||||||
|
|
||||||
## 1. Installation
|
## 1. Installation
|
||||||
|
|
||||||
### Option A: Run from PyPI with uvx (Recommended for end users)
|
### Option A: Using UV (Recommended for Development)
|
||||||
|
|
||||||
No clone required — `uvx` fetches and runs in an isolated cached venv:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Bare install
|
|
||||||
uvx mcp-pdf
|
|
||||||
|
|
||||||
# With markdown_to_pdf support (requires pandoc on host)
|
|
||||||
uvx --from "mcp-pdf[markdown]" mcp-pdf
|
|
||||||
|
|
||||||
# Force a refresh after a new release
|
|
||||||
uvx --refresh --from "mcp-pdf[markdown]" mcp-pdf
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option B: pip install from PyPI
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip install mcp-pdf
|
|
||||||
# Or with optional extras:
|
|
||||||
pip install "mcp-pdf[markdown]" # adds markdown_to_pdf
|
|
||||||
pip install "mcp-pdf[forms]" # adds form creation tools
|
|
||||||
pip install "mcp-pdf[tables]" # adds Camelot/Tabula table extraction
|
|
||||||
pip install "mcp-pdf[all]" # everything
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option C: Local development with uv
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Clone the repository
|
# Clone the repository
|
||||||
git clone https://github.com/rsp2k/mcp-pdf
|
git clone https://github.com/rpm/mcp-pdf-tools
|
||||||
cd mcp-pdf
|
cd mcp-pdf-tools
|
||||||
|
|
||||||
# Install with uv
|
# Install with uv
|
||||||
uv sync
|
uv sync
|
||||||
@ -42,66 +16,41 @@ uv sync
|
|||||||
uv run python examples/verify_installation.py
|
uv run python examples/verify_installation.py
|
||||||
```
|
```
|
||||||
|
|
||||||
### Option D: Using Docker
|
### Option B: Using Docker
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/rsp2k/mcp-pdf
|
# Clone the repository
|
||||||
cd mcp-pdf
|
git clone https://github.com/rpm/mcp-pdf-tools
|
||||||
|
cd mcp-pdf-tools
|
||||||
|
|
||||||
docker compose build
|
# Build and run with Docker
|
||||||
docker compose run --rm mcp-pdf python examples/verify_installation.py
|
docker-compose build
|
||||||
|
docker-compose run --rm mcp-pdf-tools python examples/verify_installation.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option C: From PyPI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install mcp-pdf-tools
|
||||||
```
|
```
|
||||||
|
|
||||||
## 2. System Dependencies
|
## 2. System Dependencies
|
||||||
|
|
||||||
`uvx` and `pip` only handle Python deps. Some tools call out to system binaries that you'll need to install separately:
|
|
||||||
|
|
||||||
| Binary | Required for |
|
|
||||||
|--------|-------------|
|
|
||||||
| `tesseract` | `ocr_pdf` |
|
|
||||||
| `ghostscript` | Camelot table extraction |
|
|
||||||
| `java` (JRE) | Tabula table extraction |
|
|
||||||
| `poppler` | PDF→image conversion |
|
|
||||||
| `pandoc` | `markdown_to_pdf` |
|
|
||||||
| `xelatex` / `pdflatex` / `tectonic` / `weasyprint` / `wkhtmltopdf` | `markdown_to_pdf` (need at least one) |
|
|
||||||
|
|
||||||
> **Note on the LaTeX engine:** `texlive-xetex` alone is often not enough for real markdown docs — pandoc's default template needs LaTeX packages (`lastpage`, `xcolor`, `framed`, `fancyhdr`, etc.) that live in `texlive-latex-extra` (Debian) / `texlive-latexextra` (Arch). If you don't already use TeX, **`tectonic` is a much better choice** — it's a ~30 MB static binary that downloads packages on demand. See the README's "Picking a PDF engine" table for details.
|
|
||||||
|
|
||||||
### Ubuntu/Debian
|
### Ubuntu/Debian
|
||||||
```bash
|
```bash
|
||||||
sudo apt-get update
|
sudo apt-get update
|
||||||
sudo apt-get install -y \
|
sudo apt-get install -y \
|
||||||
tesseract-ocr tesseract-ocr-eng \
|
tesseract-ocr \
|
||||||
poppler-utils ghostscript \
|
tesseract-ocr-eng \
|
||||||
python3-tk default-jre-headless
|
poppler-utils \
|
||||||
|
ghostscript \
|
||||||
# For markdown_to_pdf, pick one of:
|
python3-tk \
|
||||||
sudo apt-get install -y pandoc # then install tectonic separately
|
default-jre-headless
|
||||||
sudo apt-get install -y pandoc texlive-xetex texlive-latex-extra texlive-fonts-extra # full TeX
|
|
||||||
sudo apt-get install -y pandoc && pip install weasyprint # skip TeX
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Arch Linux
|
### macOS
|
||||||
```bash
|
|
||||||
sudo pacman -S \
|
|
||||||
tesseract tesseract-data-eng \
|
|
||||||
poppler ghostscript \
|
|
||||||
jre-openjdk-headless tk
|
|
||||||
|
|
||||||
# For markdown_to_pdf, pick one of:
|
|
||||||
sudo pacman -S pandoc tectonic # recommended
|
|
||||||
sudo pacman -S pandoc texlive-xetex texlive-latexextra texlive-fontsextra # full TeX
|
|
||||||
sudo pacman -S pandoc && pip install weasyprint # skip TeX
|
|
||||||
```
|
|
||||||
|
|
||||||
### macOS (Homebrew)
|
|
||||||
```bash
|
```bash
|
||||||
brew install tesseract poppler ghostscript
|
brew install tesseract poppler ghostscript
|
||||||
|
|
||||||
# For markdown_to_pdf, pick one of:
|
|
||||||
brew install pandoc tectonic # recommended
|
|
||||||
brew install pandoc && brew install --cask mactex-no-gui # full TeX
|
|
||||||
brew install pandoc weasyprint # skip TeX
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Windows
|
### Windows
|
||||||
@ -109,33 +58,18 @@ brew install pandoc weasyprint # skip TeX
|
|||||||
- Install Poppler: http://blog.alivate.com.au/poppler-windows/
|
- Install Poppler: http://blog.alivate.com.au/poppler-windows/
|
||||||
- Install Ghostscript: https://www.ghostscript.com/download/gsdnld.html
|
- Install Ghostscript: https://www.ghostscript.com/download/gsdnld.html
|
||||||
- Install Java: https://www.java.com/download/
|
- Install Java: https://www.java.com/download/
|
||||||
- Install Pandoc (for `markdown_to_pdf`): https://pandoc.org/installing.html
|
|
||||||
- Install MiKTeX or wkhtmltopdf for the PDF engine
|
|
||||||
|
|
||||||
## 3. Adding to Claude Code / Claude Desktop
|
## 3. Claude Desktop Configuration
|
||||||
|
|
||||||
### Easiest — `claude mcp add` with uvx
|
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
|
||||||
|
|
||||||
```bash
|
|
||||||
# Bare
|
|
||||||
claude mcp add pdf-tools -- uvx mcp-pdf
|
|
||||||
|
|
||||||
# With markdown_to_pdf support
|
|
||||||
claude mcp add pdf-tools -- uvx --from "mcp-pdf[markdown]" mcp-pdf
|
|
||||||
```
|
|
||||||
|
|
||||||
The `--` separator is required so the Claude CLI doesn't try to parse `--from` as one of its own flags.
|
|
||||||
|
|
||||||
### Manual config (Claude Desktop)
|
|
||||||
|
|
||||||
Edit `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `~/.config/Claude/claude_desktop_config.json` (Linux):
|
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"mcpServers": {
|
"mcpServers": {
|
||||||
"pdf-tools": {
|
"pdf-tools": {
|
||||||
"command": "uvx",
|
"command": "uv",
|
||||||
"args": ["--from", "mcp-pdf[markdown]", "mcp-pdf"]
|
"args": ["run", "mcp-pdf-tools"],
|
||||||
|
"cwd": "/home/rpm/claude/mcp-pdf-tools"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -152,20 +86,14 @@ uv run python examples/test_pdf_tools.py /path/to/your/document.pdf
|
|||||||
|
|
||||||
### OCR not working
|
### OCR not working
|
||||||
- Check Tesseract is installed: `tesseract --version`
|
- Check Tesseract is installed: `tesseract --version`
|
||||||
- Install language packs: `sudo apt-get install tesseract-ocr-[lang]` (Debian) or `sudo pacman -S tesseract-data-[lang]` (Arch)
|
- Install language packs: `sudo apt-get install tesseract-ocr-[lang]`
|
||||||
|
|
||||||
### Table extraction failing
|
### Table extraction failing
|
||||||
- Check Java is installed: `java -version`
|
- Check Java is installed: `java -version`
|
||||||
- For Camelot issues, ensure Ghostscript is installed: `gs --version`
|
- For Camelot issues, ensure Ghostscript is installed
|
||||||
|
|
||||||
### `markdown_to_pdf` errors
|
|
||||||
- "pandoc binary not found" → install pandoc (see System Dependencies)
|
|
||||||
- "No PDF engine found" → install at least one of `xelatex`, `pdflatex`, `tectonic`, `weasyprint`, `wkhtmltopdf`
|
|
||||||
- "Pandoc died with exitcode 43" + `mktexfmt` errors → your TeX install is missing format files; rebuild with `sudo fmtutil-sys --all` or use a different engine via `pdf_engine="weasyprint"`
|
|
||||||
- The tool reports `detected_engines` in its response — check that field to see what's actually available
|
|
||||||
|
|
||||||
### Large PDF issues
|
### Large PDF issues
|
||||||
- Process specific pages: `pages="1-10"` or `pages="1,3,5"`
|
- Process specific pages: `pages=[0, 1, 2]`
|
||||||
- Increase memory: `export JAVA_OPTS="-Xmx2g"`
|
- Increase memory: `export JAVA_OPTS="-Xmx2g"`
|
||||||
|
|
||||||
## 6. Example Usage in Claude
|
## 6. Example Usage in Claude
|
||||||
@ -177,21 +105,6 @@ Once configured, you can ask Claude:
|
|||||||
- "Extract all tables from /path/to/report.pdf and format as markdown"
|
- "Extract all tables from /path/to/report.pdf and format as markdown"
|
||||||
- "Convert /path/to/document.pdf to markdown format"
|
- "Convert /path/to/document.pdf to markdown format"
|
||||||
- "Extract images from the first 5 pages of /path/to/presentation.pdf"
|
- "Extract images from the first 5 pages of /path/to/presentation.pdf"
|
||||||
- "Build a PDF from /path/to/notes.md with a table of contents"
|
|
||||||
|
|
||||||
## 7. Verify the Built-in Test
|
|
||||||
|
|
||||||
Convert this README itself to PDF as a smoke test once everything is wired up:
|
|
||||||
|
|
||||||
```python
|
|
||||||
markdown_to_pdf(
|
|
||||||
markdown_path="QUICKSTART.md",
|
|
||||||
output_path="/tmp/quickstart.pdf",
|
|
||||||
toc=True,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
The response includes `detected_engines` so you can see exactly what's installed on your host.
|
|
||||||
|
|
||||||
## Need Help?
|
## Need Help?
|
||||||
|
|
||||||
|
|||||||
106
README.md
106
README.md
@ -6,7 +6,7 @@
|
|||||||
|
|
||||||
**A FastMCP server for PDF processing**
|
**A FastMCP server for PDF processing**
|
||||||
|
|
||||||
*47 tools for text extraction, OCR, tables, forms, annotations, markdown↔PDF, and more*
|
*46 tools for text extraction, OCR, tables, forms, annotations, and more*
|
||||||
|
|
||||||
[](https://www.python.org/downloads/)
|
[](https://www.python.org/downloads/)
|
||||||
[](https://github.com/jlowin/fastmcp)
|
[](https://github.com/jlowin/fastmcp)
|
||||||
@ -31,25 +31,19 @@ MCP PDF extracts content from PDFs using multiple libraries with automatic fallb
|
|||||||
- **Document assembly** - merge, split, reorder pages
|
- **Document assembly** - merge, split, reorder pages
|
||||||
- **Annotations** - sticky notes, highlights, stamps
|
- **Annotations** - sticky notes, highlights, stamps
|
||||||
- **Vector graphics** - extract to SVG for schematics and technical drawings
|
- **Vector graphics** - extract to SVG for schematics and technical drawings
|
||||||
- **Format conversion** - PDF ↔ Markdown (PDF→MD via PyMuPDF, MD→PDF via pandoc)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Run from PyPI (one-shot, no permanent install)
|
# Install from PyPI
|
||||||
uvx mcp-pdf
|
uvx mcp-pdf
|
||||||
|
|
||||||
# Add to Claude Code — note the `--` separator before uvx
|
# Or add to Claude Code
|
||||||
claude mcp add pdf-tools -- uvx mcp-pdf
|
claude mcp add pdf-tools uvx mcp-pdf
|
||||||
|
|
||||||
# Include the markdown_to_pdf tool (requires pandoc on host)
|
|
||||||
claude mcp add pdf-tools -- uvx --from "mcp-pdf[markdown]" mcp-pdf
|
|
||||||
```
|
```
|
||||||
|
|
||||||
> `uvx` caches tool installs aggressively. After upgrading to a new release, force a refresh with `uvx --refresh mcp-pdf` (or `uvx --refresh --from "mcp-pdf[markdown]" mcp-pdf` if you're using extras).
|
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><b>Development Installation</b></summary>
|
<summary><b>Development Installation</b></summary>
|
||||||
|
|
||||||
@ -61,11 +55,6 @@ uv sync
|
|||||||
# System dependencies (Ubuntu/Debian)
|
# System dependencies (Ubuntu/Debian)
|
||||||
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
|
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript
|
||||||
|
|
||||||
# For markdown_to_pdf — pick one PDF-engine route:
|
|
||||||
sudo apt-get install pandoc tectonic # recommended (small)
|
|
||||||
# or: sudo apt-get install pandoc texlive-xetex texlive-latex-extra # full TeX
|
|
||||||
# or: sudo apt-get install pandoc && pip install weasyprint # skip TeX
|
|
||||||
|
|
||||||
# Verify
|
# Verify
|
||||||
uv run python examples/verify_installation.py
|
uv run python examples/verify_installation.py
|
||||||
```
|
```
|
||||||
@ -84,18 +73,10 @@ uv run python examples/verify_installation.py
|
|||||||
| `extract_tables` | Extract tables to JSON, CSV, or Markdown |
|
| `extract_tables` | Extract tables to JSON, CSV, or Markdown |
|
||||||
| `extract_images` | Extract embedded images |
|
| `extract_images` | Extract embedded images |
|
||||||
| `extract_links` | Get all hyperlinks with page filtering |
|
| `extract_links` | Get all hyperlinks with page filtering |
|
||||||
|
| `pdf_to_markdown` | Convert PDF to markdown preserving structure |
|
||||||
| `ocr_pdf` | OCR scanned documents using Tesseract |
|
| `ocr_pdf` | OCR scanned documents using Tesseract |
|
||||||
| `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) |
|
| `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) |
|
||||||
|
|
||||||
### Format Conversion
|
|
||||||
|
|
||||||
| Tool | What it does |
|
|
||||||
|------|-------------|
|
|
||||||
| `pdf_to_markdown` | Convert PDF to markdown preserving structure; extracts images and SVG vectors to disk |
|
|
||||||
| `markdown_to_pdf` | Convert `.md` files (or inline text) to PDF via pandoc with auto-detected engine |
|
|
||||||
|
|
||||||
**`markdown_to_pdf` requires:** `pip install mcp-pdf[markdown]` plus the pandoc binary and at least one PDF engine (`xelatex`, `pdflatex`, `tectonic`, `weasyprint`, or `wkhtmltopdf`) on PATH. The tool auto-detects what's available and uses the highest-quality one. Pass `pdf_engine=` to override or `extra_args=` for raw pandoc options.
|
|
||||||
|
|
||||||
### Document Analysis
|
### Document Analysis
|
||||||
|
|
||||||
| Tool | What it does |
|
| Tool | What it does |
|
||||||
@ -212,87 +193,12 @@ Some features require system packages:
|
|||||||
| Camelot tables | `ghostscript` |
|
| Camelot tables | `ghostscript` |
|
||||||
| Tabula tables | `default-jre-headless` |
|
| Tabula tables | `default-jre-headless` |
|
||||||
| PDF to images | `poppler-utils` |
|
| PDF to images | `poppler-utils` |
|
||||||
| `markdown_to_pdf` | `pandoc` + one of: `tectonic`, `texlive-xetex` (+ `texlive-latex-extra`), `weasyprint`, `wkhtmltopdf` |
|
|
||||||
|
|
||||||
### Picking a PDF engine for `markdown_to_pdf`
|
Ubuntu/Debian:
|
||||||
|
|
||||||
Pandoc takes markdown → HTML or LaTeX → PDF. The LaTeX path produces the most polished output but needs a TeX install. Trade-offs:
|
|
||||||
|
|
||||||
| Engine | Disk size | Notes |
|
|
||||||
|--------|----------|-------|
|
|
||||||
| **`tectonic`** | ~30 MB | **Recommended for new installs.** Single static binary. Downloads LaTeX packages on demand — no upfront mass-install. |
|
|
||||||
| `xelatex` + `texlive-latex-extra` | ~500 MB | Best output once installed. Use if you already run TeX. The `-extra` package matters: pandoc's default template needs `lastpage`, `xcolor`, `framed`, `fancyhdr`, etc. — all of which live there, **not** in `texlive-xetex`. |
|
|
||||||
| `xelatex` alone (just `texlive-xetex`) | ~200 MB | **Often breaks.** Expect `! LaTeX Error: File 'X.sty' not found` on real docs. |
|
|
||||||
| `weasyprint` | ~40 MB | Pure-Python (`pip install weasyprint`) + cairo/pango system libs. HTML/CSS path — no LaTeX. Good for simple docs; weaker on math, footnotes, citations. |
|
|
||||||
| `wkhtmltopdf` | ~40 MB | Older HTML-to-PDF tool. Adequate but less actively maintained. |
|
|
||||||
|
|
||||||
**Ubuntu/Debian:**
|
|
||||||
```bash
|
```bash
|
||||||
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless
|
sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless
|
||||||
|
|
||||||
# For markdown_to_pdf — pick one engine route:
|
|
||||||
|
|
||||||
# Option A — tectonic (smallest, downloads packages on demand)
|
|
||||||
sudo apt-get install pandoc
|
|
||||||
# tectonic isn't in apt — install via cargo or download static binary:
|
|
||||||
# https://tectonic-typesetting.github.io/en-US/install.html
|
|
||||||
|
|
||||||
# Option B — full TeX (best quality, large download)
|
|
||||||
sudo apt-get install pandoc texlive-xetex texlive-latex-extra texlive-fonts-extra
|
|
||||||
|
|
||||||
# Option C — weasyprint (skip TeX entirely)
|
|
||||||
sudo apt-get install pandoc
|
|
||||||
pip install weasyprint
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Arch Linux:**
|
|
||||||
```bash
|
|
||||||
sudo pacman -S tesseract tesseract-data-eng poppler ghostscript jre-openjdk-headless
|
|
||||||
|
|
||||||
# For markdown_to_pdf — pick one engine route:
|
|
||||||
|
|
||||||
# Option A — tectonic (recommended for new installs, in official repo)
|
|
||||||
sudo pacman -S pandoc tectonic
|
|
||||||
|
|
||||||
# Option B — full TeX (best output, ~500 MB)
|
|
||||||
sudo pacman -S pandoc texlive-xetex texlive-latexextra texlive-fontsextra
|
|
||||||
|
|
||||||
# Option C — weasyprint (skip TeX)
|
|
||||||
sudo pacman -S pandoc
|
|
||||||
pip install weasyprint # or: uv pip install weasyprint
|
|
||||||
|
|
||||||
# Option D — wkhtmltopdf (from AUR)
|
|
||||||
yay -S wkhtmltopdf-static
|
|
||||||
```
|
|
||||||
|
|
||||||
**macOS (Homebrew):**
|
|
||||||
```bash
|
|
||||||
brew install tesseract poppler ghostscript
|
|
||||||
|
|
||||||
# For markdown_to_pdf — pick one engine route:
|
|
||||||
|
|
||||||
# Option A — tectonic (recommended)
|
|
||||||
brew install pandoc tectonic
|
|
||||||
|
|
||||||
# Option B — full TeX (mactex-no-gui includes the latex-extra equivalent)
|
|
||||||
brew install pandoc
|
|
||||||
brew install --cask mactex-no-gui
|
|
||||||
|
|
||||||
# Option C — weasyprint
|
|
||||||
brew install pandoc weasyprint
|
|
||||||
```
|
|
||||||
|
|
||||||
## Optional Extras
|
|
||||||
|
|
||||||
The base install stays lean. Heavy or niche dependencies are gated behind extras:
|
|
||||||
|
|
||||||
| Extra | Adds | When to install |
|
|
||||||
|-------|------|----------------|
|
|
||||||
| `mcp-pdf[forms]` | `reportlab` | Form creation tools (`create_form_pdf`, permit forms) |
|
|
||||||
| `mcp-pdf[tables]` | `camelot-py`, `tabula-py` | Higher-accuracy table extraction (also needs Java + Ghostscript) |
|
|
||||||
| `mcp-pdf[markdown]` | `pypandoc` | `markdown_to_pdf` tool (also needs pandoc binary) |
|
|
||||||
| `mcp-pdf[all]` | All of the above | Want everything |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user