Some checks are pending
Security Scan / security-scan (push) Waiting to run
texlive-xetex alone is rarely enough — pandoc's default template needs packages from texlive-latex-extra (Debian) / texlive-latexextra (Arch): lastpage, xcolor, framed, fancyhdr, etc. Real markdown docs fail with "File 'X.sty' not found" without them. Restructure system deps to present three engine routes per platform: - tectonic (recommended): ~30 MB static binary, downloads packages on demand - full TeX: texlive-xetex + texlive-latex-extra + texlive-fonts-extra - weasyprint: skip TeX entirely, pip-installable Add an engine comparison table in the README explaining the disk-size and quality trade-offs so users can pick informed.
201 lines
6.3 KiB
Markdown
201 lines
6.3 KiB
Markdown
# Quick Start Guide
|
|
|
|
## 1. Installation
|
|
|
|
### Option A: Run from PyPI with uvx (Recommended for end users)
|
|
|
|
No clone required — `uvx` fetches and runs in an isolated cached venv:
|
|
|
|
```bash
|
|
# Bare install
|
|
uvx mcp-pdf
|
|
|
|
# With markdown_to_pdf support (requires pandoc on host)
|
|
uvx --from "mcp-pdf[markdown]" mcp-pdf
|
|
|
|
# Force a refresh after a new release
|
|
uvx --refresh --from "mcp-pdf[markdown]" mcp-pdf
|
|
```
|
|
|
|
### Option B: pip install from PyPI
|
|
|
|
```bash
|
|
pip install mcp-pdf
|
|
# Or with optional extras:
|
|
pip install "mcp-pdf[markdown]" # adds markdown_to_pdf
|
|
pip install "mcp-pdf[forms]" # adds form creation tools
|
|
pip install "mcp-pdf[tables]" # adds Camelot/Tabula table extraction
|
|
pip install "mcp-pdf[all]" # everything
|
|
```
|
|
|
|
### Option C: Local development with uv
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://github.com/rsp2k/mcp-pdf
|
|
cd mcp-pdf
|
|
|
|
# Install with uv
|
|
uv sync
|
|
|
|
# Verify installation
|
|
uv run python examples/verify_installation.py
|
|
```
|
|
|
|
### Option D: Using Docker
|
|
|
|
```bash
|
|
git clone https://github.com/rsp2k/mcp-pdf
|
|
cd mcp-pdf
|
|
|
|
docker compose build
|
|
docker compose run --rm mcp-pdf python examples/verify_installation.py
|
|
```
|
|
|
|
## 2. System Dependencies
|
|
|
|
`uvx` and `pip` only handle Python deps. Some tools call out to system binaries that you'll need to install separately:
|
|
|
|
| Binary | Required for |
|
|
|--------|-------------|
|
|
| `tesseract` | `ocr_pdf` |
|
|
| `ghostscript` | Camelot table extraction |
|
|
| `java` (JRE) | Tabula table extraction |
|
|
| `poppler` | PDF→image conversion |
|
|
| `pandoc` | `markdown_to_pdf` |
|
|
| `xelatex` / `pdflatex` / `tectonic` / `weasyprint` / `wkhtmltopdf` | `markdown_to_pdf` (need at least one) |
|
|
|
|
> **Note on the LaTeX engine:** `texlive-xetex` alone is often not enough for real markdown docs — pandoc's default template needs LaTeX packages (`lastpage`, `xcolor`, `framed`, `fancyhdr`, etc.) that live in `texlive-latex-extra` (Debian) / `texlive-latexextra` (Arch). If you don't already use TeX, **`tectonic` is a much better choice** — it's a ~30 MB static binary that downloads packages on demand. See the README's "Picking a PDF engine" table for details.
|
|
|
|
### Ubuntu/Debian
|
|
```bash
|
|
sudo apt-get update
|
|
sudo apt-get install -y \
|
|
tesseract-ocr tesseract-ocr-eng \
|
|
poppler-utils ghostscript \
|
|
python3-tk default-jre-headless
|
|
|
|
# For markdown_to_pdf, pick one of:
|
|
sudo apt-get install -y pandoc # then install tectonic separately
|
|
sudo apt-get install -y pandoc texlive-xetex texlive-latex-extra texlive-fonts-extra # full TeX
|
|
sudo apt-get install -y pandoc && pip install weasyprint # skip TeX
|
|
```
|
|
|
|
### Arch Linux
|
|
```bash
|
|
sudo pacman -S \
|
|
tesseract tesseract-data-eng \
|
|
poppler ghostscript \
|
|
jre-openjdk-headless tk
|
|
|
|
# For markdown_to_pdf, pick one of:
|
|
sudo pacman -S pandoc tectonic # recommended
|
|
sudo pacman -S pandoc texlive-xetex texlive-latexextra texlive-fontsextra # full TeX
|
|
sudo pacman -S pandoc && pip install weasyprint # skip TeX
|
|
```
|
|
|
|
### macOS (Homebrew)
|
|
```bash
|
|
brew install tesseract poppler ghostscript
|
|
|
|
# For markdown_to_pdf, pick one of:
|
|
brew install pandoc tectonic # recommended
|
|
brew install pandoc && brew install --cask mactex-no-gui # full TeX
|
|
brew install pandoc weasyprint # skip TeX
|
|
```
|
|
|
|
### Windows
|
|
- Install Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
|
|
- Install Poppler: http://blog.alivate.com.au/poppler-windows/
|
|
- Install Ghostscript: https://www.ghostscript.com/download/gsdnld.html
|
|
- Install Java: https://www.java.com/download/
|
|
- Install Pandoc (for `markdown_to_pdf`): https://pandoc.org/installing.html
|
|
- Install MiKTeX or wkhtmltopdf for the PDF engine
|
|
|
|
## 3. Adding to Claude Code / Claude Desktop
|
|
|
|
### Easiest — `claude mcp add` with uvx
|
|
|
|
```bash
|
|
# Bare
|
|
claude mcp add pdf-tools -- uvx mcp-pdf
|
|
|
|
# With markdown_to_pdf support
|
|
claude mcp add pdf-tools -- uvx --from "mcp-pdf[markdown]" mcp-pdf
|
|
```
|
|
|
|
The `--` separator is required so the Claude CLI doesn't try to parse `--from` as one of its own flags.
|
|
|
|
### Manual config (Claude Desktop)
|
|
|
|
Edit `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `~/.config/Claude/claude_desktop_config.json` (Linux):
|
|
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"pdf-tools": {
|
|
"command": "uvx",
|
|
"args": ["--from", "mcp-pdf[markdown]", "mcp-pdf"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## 4. Test the Tools
|
|
|
|
```bash
|
|
# Test with a sample PDF
|
|
uv run python examples/test_pdf_tools.py /path/to/your/document.pdf
|
|
```
|
|
|
|
## 5. Common Issues
|
|
|
|
### OCR not working
|
|
- Check Tesseract is installed: `tesseract --version`
|
|
- Install language packs: `sudo apt-get install tesseract-ocr-[lang]` (Debian) or `sudo pacman -S tesseract-data-[lang]` (Arch)
|
|
|
|
### Table extraction failing
|
|
- Check Java is installed: `java -version`
|
|
- For Camelot issues, ensure Ghostscript is installed: `gs --version`
|
|
|
|
### `markdown_to_pdf` errors
|
|
- "pandoc binary not found" → install pandoc (see System Dependencies)
|
|
- "No PDF engine found" → install at least one of `xelatex`, `pdflatex`, `tectonic`, `weasyprint`, `wkhtmltopdf`
|
|
- "Pandoc died with exitcode 43" + `mktexfmt` errors → your TeX install is missing format files; rebuild with `sudo fmtutil-sys --all` or use a different engine via `pdf_engine="weasyprint"`
|
|
- The tool reports `detected_engines` in its response — check that field to see what's actually available
|
|
|
|
### Large PDF issues
|
|
- Process specific pages: `pages="1-10"` or `pages="1,3,5"`
|
|
- Increase memory: `export JAVA_OPTS="-Xmx2g"`
|
|
|
|
## 6. Example Usage in Claude
|
|
|
|
Once configured, you can ask Claude:
|
|
|
|
- "Extract text from the PDF at /path/to/document.pdf"
|
|
- "Check if /path/to/scan.pdf is a scanned document"
|
|
- "Extract all tables from /path/to/report.pdf and format as markdown"
|
|
- "Convert /path/to/document.pdf to markdown format"
|
|
- "Extract images from the first 5 pages of /path/to/presentation.pdf"
|
|
- "Build a PDF from /path/to/notes.md with a table of contents"
|
|
|
|
## 7. Verify the Built-in Test
|
|
|
|
Convert this README itself to PDF as a smoke test once everything is wired up:
|
|
|
|
```python
|
|
markdown_to_pdf(
|
|
markdown_path="QUICKSTART.md",
|
|
output_path="/tmp/quickstart.pdf",
|
|
toc=True,
|
|
)
|
|
```
|
|
|
|
The response includes `detected_engines` so you can see exactly what's installed on your host.
|
|
|
|
## Need Help?
|
|
|
|
- Check the full README.md for detailed documentation
|
|
- Run tests: `uv run pytest`
|
|
- Enable debug mode: Set `DEBUG=true` in your .env file
|