diff --git a/QUICKSTART.md b/QUICKSTART.md index 36fa35d..1e2dbdc 100644 --- a/QUICKSTART.md +++ b/QUICKSTART.md @@ -2,12 +2,38 @@ ## 1. Installation -### Option A: Using UV (Recommended for Development) +### Option A: Run from PyPI with uvx (Recommended for end users) + +No clone required — `uvx` fetches and runs in an isolated cached venv: + +```bash +# Bare install +uvx mcp-pdf + +# With markdown_to_pdf support (requires pandoc on host) +uvx --from "mcp-pdf[markdown]" mcp-pdf + +# Force a refresh after a new release +uvx --refresh --from "mcp-pdf[markdown]" mcp-pdf +``` + +### Option B: pip install from PyPI + +```bash +pip install mcp-pdf +# Or with optional extras: +pip install "mcp-pdf[markdown]" # adds markdown_to_pdf +pip install "mcp-pdf[forms]" # adds form creation tools +pip install "mcp-pdf[tables]" # adds Camelot/Tabula table extraction +pip install "mcp-pdf[all]" # everything +``` + +### Option C: Local development with uv ```bash # Clone the repository -git clone https://github.com/rpm/mcp-pdf-tools -cd mcp-pdf-tools +git clone https://github.com/rsp2k/mcp-pdf +cd mcp-pdf # Install with uv uv sync @@ -16,41 +42,63 @@ uv sync uv run python examples/verify_installation.py ``` -### Option B: Using Docker +### Option D: Using Docker ```bash -# Clone the repository -git clone https://github.com/rpm/mcp-pdf-tools -cd mcp-pdf-tools +git clone https://github.com/rsp2k/mcp-pdf +cd mcp-pdf -# Build and run with Docker -docker-compose build -docker-compose run --rm mcp-pdf-tools python examples/verify_installation.py -``` - -### Option C: From PyPI - -```bash -pip install mcp-pdf-tools +docker compose build +docker compose run --rm mcp-pdf python examples/verify_installation.py ``` ## 2. System Dependencies +`uvx` and `pip` only handle Python deps. Some tools call out to system binaries that you'll need to install separately: + +| Binary | Required for | +|--------|-------------| +| `tesseract` | `ocr_pdf` | +| `ghostscript` | Camelot table extraction | +| `java` (JRE) | Tabula table extraction | +| `poppler` | PDF→image conversion | +| `pandoc` | `markdown_to_pdf` | +| `xelatex` / `pdflatex` / `tectonic` / `weasyprint` / `wkhtmltopdf` | `markdown_to_pdf` (need at least one) | + ### Ubuntu/Debian ```bash sudo apt-get update sudo apt-get install -y \ - tesseract-ocr \ - tesseract-ocr-eng \ - poppler-utils \ - ghostscript \ - python3-tk \ - default-jre-headless + tesseract-ocr tesseract-ocr-eng \ + poppler-utils ghostscript \ + python3-tk default-jre-headless + +# For markdown_to_pdf +sudo apt-get install -y pandoc texlive-xetex ``` -### macOS +### Arch Linux +```bash +sudo pacman -S \ + tesseract tesseract-data-eng \ + poppler ghostscript \ + jre-openjdk-headless tk + +# For markdown_to_pdf +sudo pacman -S pandoc texlive-xetex +# Lighter alternative engines: tectonic (official repo), +# wkhtmltopdf (AUR), or `pip install weasyprint` (works in any venv) +``` + +### macOS (Homebrew) ```bash brew install tesseract poppler ghostscript + +# For markdown_to_pdf +brew install pandoc +brew install --cask mactex-no-gui # full TeX with xelatex/pdflatex +# Or lighter: +brew install weasyprint ``` ### Windows @@ -58,18 +106,33 @@ brew install tesseract poppler ghostscript - Install Poppler: http://blog.alivate.com.au/poppler-windows/ - Install Ghostscript: https://www.ghostscript.com/download/gsdnld.html - Install Java: https://www.java.com/download/ +- Install Pandoc (for `markdown_to_pdf`): https://pandoc.org/installing.html +- Install MiKTeX or wkhtmltopdf for the PDF engine -## 3. Claude Desktop Configuration +## 3. Adding to Claude Code / Claude Desktop -Add to `~/Library/Application Support/Claude/claude_desktop_config.json`: +### Easiest — `claude mcp add` with uvx + +```bash +# Bare +claude mcp add pdf-tools -- uvx mcp-pdf + +# With markdown_to_pdf support +claude mcp add pdf-tools -- uvx --from "mcp-pdf[markdown]" mcp-pdf +``` + +The `--` separator is required so the Claude CLI doesn't try to parse `--from` as one of its own flags. + +### Manual config (Claude Desktop) + +Edit `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `~/.config/Claude/claude_desktop_config.json` (Linux): ```json { "mcpServers": { "pdf-tools": { - "command": "uv", - "args": ["run", "mcp-pdf-tools"], - "cwd": "/home/rpm/claude/mcp-pdf-tools" + "command": "uvx", + "args": ["--from", "mcp-pdf[markdown]", "mcp-pdf"] } } } @@ -86,14 +149,20 @@ uv run python examples/test_pdf_tools.py /path/to/your/document.pdf ### OCR not working - Check Tesseract is installed: `tesseract --version` -- Install language packs: `sudo apt-get install tesseract-ocr-[lang]` +- Install language packs: `sudo apt-get install tesseract-ocr-[lang]` (Debian) or `sudo pacman -S tesseract-data-[lang]` (Arch) ### Table extraction failing - Check Java is installed: `java -version` -- For Camelot issues, ensure Ghostscript is installed +- For Camelot issues, ensure Ghostscript is installed: `gs --version` + +### `markdown_to_pdf` errors +- "pandoc binary not found" → install pandoc (see System Dependencies) +- "No PDF engine found" → install at least one of `xelatex`, `pdflatex`, `tectonic`, `weasyprint`, `wkhtmltopdf` +- "Pandoc died with exitcode 43" + `mktexfmt` errors → your TeX install is missing format files; rebuild with `sudo fmtutil-sys --all` or use a different engine via `pdf_engine="weasyprint"` +- The tool reports `detected_engines` in its response — check that field to see what's actually available ### Large PDF issues -- Process specific pages: `pages=[0, 1, 2]` +- Process specific pages: `pages="1-10"` or `pages="1,3,5"` - Increase memory: `export JAVA_OPTS="-Xmx2g"` ## 6. Example Usage in Claude @@ -105,6 +174,21 @@ Once configured, you can ask Claude: - "Extract all tables from /path/to/report.pdf and format as markdown" - "Convert /path/to/document.pdf to markdown format" - "Extract images from the first 5 pages of /path/to/presentation.pdf" +- "Build a PDF from /path/to/notes.md with a table of contents" + +## 7. Verify the Built-in Test + +Convert this README itself to PDF as a smoke test once everything is wired up: + +```python +markdown_to_pdf( + markdown_path="QUICKSTART.md", + output_path="/tmp/quickstart.pdf", + toc=True, +) +``` + +The response includes `detected_engines` so you can see exactly what's installed on your host. ## Need Help? diff --git a/README.md b/README.md index bc89858..5e194e3 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ **A FastMCP server for PDF processing** -*46 tools for text extraction, OCR, tables, forms, annotations, and more* +*47 tools for text extraction, OCR, tables, forms, annotations, markdown↔PDF, and more* [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg?style=flat-square)](https://www.python.org/downloads/) [![FastMCP](https://img.shields.io/badge/FastMCP-2.0+-green.svg?style=flat-square)](https://github.com/jlowin/fastmcp) @@ -31,19 +31,25 @@ MCP PDF extracts content from PDFs using multiple libraries with automatic fallb - **Document assembly** - merge, split, reorder pages - **Annotations** - sticky notes, highlights, stamps - **Vector graphics** - extract to SVG for schematics and technical drawings +- **Format conversion** - PDF ↔ Markdown (PDF→MD via PyMuPDF, MD→PDF via pandoc) --- ## Quick Start ```bash -# Install from PyPI +# Run from PyPI (one-shot, no permanent install) uvx mcp-pdf -# Or add to Claude Code -claude mcp add pdf-tools uvx mcp-pdf +# Add to Claude Code — note the `--` separator before uvx +claude mcp add pdf-tools -- uvx mcp-pdf + +# Include the markdown_to_pdf tool (requires pandoc on host) +claude mcp add pdf-tools -- uvx --from "mcp-pdf[markdown]" mcp-pdf ``` +> `uvx` caches tool installs aggressively. After upgrading to a new release, force a refresh with `uvx --refresh mcp-pdf` (or `uvx --refresh --from "mcp-pdf[markdown]" mcp-pdf` if you're using extras). +
Development Installation @@ -55,6 +61,9 @@ uv sync # System dependencies (Ubuntu/Debian) sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript +# For markdown_to_pdf: +sudo apt-get install pandoc texlive-xetex # or: weasyprint, wkhtmltopdf + # Verify uv run python examples/verify_installation.py ``` @@ -73,10 +82,18 @@ uv run python examples/verify_installation.py | `extract_tables` | Extract tables to JSON, CSV, or Markdown | | `extract_images` | Extract embedded images | | `extract_links` | Get all hyperlinks with page filtering | -| `pdf_to_markdown` | Convert PDF to markdown preserving structure | | `ocr_pdf` | OCR scanned documents using Tesseract | | `extract_vector_graphics` | Export vector graphics to SVG (schematics, charts, drawings) | +### Format Conversion + +| Tool | What it does | +|------|-------------| +| `pdf_to_markdown` | Convert PDF to markdown preserving structure; extracts images and SVG vectors to disk | +| `markdown_to_pdf` | Convert `.md` files (or inline text) to PDF via pandoc with auto-detected engine | + +**`markdown_to_pdf` requires:** `pip install mcp-pdf[markdown]` plus the pandoc binary and at least one PDF engine (`xelatex`, `pdflatex`, `tectonic`, `weasyprint`, or `wkhtmltopdf`) on PATH. The tool auto-detects what's available and uses the highest-quality one. Pass `pdf_engine=` to override or `extra_args=` for raw pandoc options. + ### Document Analysis | Tool | What it does | @@ -193,12 +210,47 @@ Some features require system packages: | Camelot tables | `ghostscript` | | Tabula tables | `default-jre-headless` | | PDF to images | `poppler-utils` | +| `markdown_to_pdf` | `pandoc` + one of: `texlive-xetex`, `texlive-latex-base`, `tectonic`, `weasyprint`, `wkhtmltopdf` | -Ubuntu/Debian: +**Ubuntu/Debian:** ```bash sudo apt-get install tesseract-ocr tesseract-ocr-eng poppler-utils ghostscript default-jre-headless + +# For markdown_to_pdf — pandoc plus at least one PDF engine +sudo apt-get install pandoc texlive-xetex ``` +**Arch Linux:** +```bash +sudo pacman -S tesseract tesseract-data-eng poppler ghostscript jre-openjdk-headless + +# For markdown_to_pdf — pandoc plus at least one PDF engine +sudo pacman -S pandoc texlive-xetex +# Lighter alternatives (pick one): tectonic, wkhtmltopdf (AUR), or pip install weasyprint +``` + +**macOS (Homebrew):** +```bash +brew install tesseract poppler ghostscript + +# For markdown_to_pdf +brew install pandoc +brew install --cask mactex-no-gui # for xelatex/pdflatex +# Or a lighter engine: +brew install weasyprint +``` + +## Optional Extras + +The base install stays lean. Heavy or niche dependencies are gated behind extras: + +| Extra | Adds | When to install | +|-------|------|----------------| +| `mcp-pdf[forms]` | `reportlab` | Form creation tools (`create_form_pdf`, permit forms) | +| `mcp-pdf[tables]` | `camelot-py`, `tabula-py` | Higher-accuracy table extraction (also needs Java + Ghostscript) | +| `mcp-pdf[markdown]` | `pypandoc` | `markdown_to_pdf` tool (also needs pandoc binary) | +| `mcp-pdf[all]` | All of the above | Want everything | + --- ## Configuration