Add markdown_to_pdf tool — convert .md to PDF via pandoc
Some checks are pending
Security Scan / security-scan (push) Waiting to run

New tool in ImageProcessingMixin (sibling of pdf_to_markdown). Accepts
either a markdown file path or inline markdown text, writes a PDF to a
caller-specified output path.

Engine selection auto-detects what's available on PATH, preferring quality:
xelatex > pdflatex > tectonic > weasyprint > wkhtmltopdf. Caller can force
a specific engine or pass raw pandoc args for advanced cases.

pypandoc is gated behind a new [markdown] optional extra so the base
install stays lean. The tool surfaces clear errors if pypandoc, pandoc,
or all PDF engines are missing.

Bumps to v2.2.0 (new feature, minor bump).
This commit is contained in:
Ryan Malloy 2026-05-05 16:21:09 -06:00
parent 0eea85f352
commit b2d9073f04
4 changed files with 229 additions and 5 deletions

View File

@ -91,7 +91,7 @@ uv publish
2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula 2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula
3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options 3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata` 4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
5. **Format Conversion**: `pdf_to_markdown` - Writes markdown + extracted raster images and vector graphics (SVG) to disk by default, returns path + preview. Images use relative `./images/` paths, vectors use `./vectors/` paths. Set `inline=True` for full markdown in response. Set `include_vectors=False` to skip vector extraction. Use `output_filename` to override the default .md filename. When `include_vectors=True`, returns `vector_diagnostics` showing which pages had drawings below the complexity threshold. Set `vector_fallback_raster=True` to render those sub-threshold pages as full-page raster images (PNG at 150 DPI) instead of skipping them. 5. **Format Conversion**: `pdf_to_markdown` - Writes markdown + extracted raster images and vector graphics (SVG) to disk by default, returns path + preview. Images use relative `./images/` paths, vectors use `./vectors/` paths. Set `inline=True` for full markdown in response. Set `include_vectors=False` to skip vector extraction. Use `output_filename` to override the default .md filename. When `include_vectors=True`, returns `vector_diagnostics` showing which pages had drawings below the complexity threshold. Set `vector_fallback_raster=True` to render those sub-threshold pages as full-page raster images (PNG at 150 DPI) instead of skipping them. `markdown_to_pdf` - Reverse direction: converts a `.md` file or inline markdown text to PDF using pandoc. Auto-detects available PDF engines (xelatex, pdflatex, tectonic, weasyprint, wkhtmltopdf) and picks the best one on PATH. Pass `pdf_engine` to override or `extra_args` for raw pandoc options. Requires `pip install mcp-pdf[markdown]` and the pandoc binary.
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output 6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization 7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization
8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management 8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management

View File

@ -1,6 +1,6 @@
[project] [project]
name = "mcp-pdf" name = "mcp-pdf"
version = "2.1.7" version = "2.2.0"
description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more" description = "Secure FastMCP server for comprehensive PDF processing - text extraction, OCR, table extraction, forms, annotations, and more"
authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}] authors = [{name = "Ryan Malloy", email = "ryan@malloys.us"}]
readme = "README.md" readme = "README.md"
@ -68,11 +68,18 @@ tables = [
"tabula-py>=2.8.0", "tabula-py>=2.8.0",
] ]
# Markdown → PDF conversion (requires pandoc binary + a PDF engine such as
# xelatex, pdflatex, tectonic, weasyprint, or wkhtmltopdf)
markdown = [
"pypandoc>=1.13",
]
# All optional features # All optional features
all = [ all = [
"reportlab>=4.0.0", "reportlab>=4.0.0",
"camelot-py[cv]>=0.11.0", "camelot-py[cv]>=0.11.0",
"tabula-py>=2.8.0", "tabula-py>=2.8.0",
"pypandoc>=1.13",
] ]
# Development dependencies # Development dependencies

View File

@ -991,3 +991,205 @@ class ImageProcessingMixin(MCPMixin):
simplified = re.sub(r'-?\d+\.\d{3,}', reduce_precision, svg_content) simplified = re.sub(r'-?\d+\.\d{3,}', reduce_precision, svg_content)
return simplified return simplified
@mcp_tool(
name="markdown_to_pdf",
description=(
"Convert a Markdown file (or inline text) to PDF using pandoc. "
"Auto-detects available PDF engines (xelatex, pdflatex, tectonic, "
"weasyprint, wkhtmltopdf) and falls back through them in that order. "
"Pass pdf_engine to override, or extra_args for custom pandoc options "
"(e.g. ['-V', 'geometry:margin=1in']). Requires pandoc binary on host "
"and at least one PDF engine. Install with: pip install mcp-pdf[markdown]"
)
)
async def markdown_to_pdf(
self,
output_path: str,
markdown_path: Optional[str] = None,
markdown_text: Optional[str] = None,
pdf_engine: Optional[str] = None,
toc: bool = False,
title: Optional[str] = None,
author: Optional[str] = None,
date: Optional[str] = None,
base_path: Optional[str] = None,
extra_args: Optional[List[str]] = None,
) -> Dict[str, Any]:
"""
Convert Markdown to PDF using pandoc as the parser and one of several
available PDF rendering engines as the backend.
Provide either `markdown_path` (a .md file) or `markdown_text` (an
inline string), but not both. The output PDF is written to `output_path`.
Engine selection:
If `pdf_engine` is None, the first engine found on PATH is used,
preferring quality: xelatex > pdflatex > tectonic > weasyprint > wkhtmltopdf.
If a specific engine is requested but not on PATH, an error is returned
listing what *is* available.
Args:
output_path: Where to write the resulting PDF (required).
markdown_path: Path to a .md file. Mutually exclusive with markdown_text.
markdown_text: Inline markdown content. Mutually exclusive with markdown_path.
pdf_engine: Force a specific engine ("xelatex", "pdflatex", "tectonic",
"weasyprint", "wkhtmltopdf"). Default: auto-detect.
toc: Generate a table of contents from headings.
title: Document title (overrides any YAML frontmatter title).
author: Document author (overrides any YAML frontmatter author).
date: Document date string (overrides any YAML frontmatter date).
base_path: Resource resolution base for relative image references.
Defaults to the markdown file's directory when markdown_path is used.
extra_args: Additional raw pandoc CLI arguments (advanced).
Returns:
Dict with output_path, file_size, engine_used, conversion_time, and
(when applicable) detected_engines listing what's available on the host.
"""
import shutil
start_time = time.time()
ENGINE_PREFERENCE = ["xelatex", "pdflatex", "tectonic", "weasyprint", "wkhtmltopdf"]
try:
# Optional dep check — pypandoc is gated behind the [markdown] extra
try:
import pypandoc
except ImportError:
return {
"error": (
"pypandoc is not installed. Install with: "
"pip install mcp-pdf[markdown] (also requires the pandoc "
"binary and a PDF engine on PATH)"
),
"conversion_time": round(time.time() - start_time, 2),
}
# Verify pandoc binary is reachable — pypandoc raises OSError if missing
try:
pypandoc.get_pandoc_version()
except OSError:
return {
"error": (
"pandoc binary not found on PATH. Install pandoc: "
"https://pandoc.org/installing.html"
),
"conversion_time": round(time.time() - start_time, 2),
}
# Validate input — exactly one of markdown_path or markdown_text
if bool(markdown_path) == bool(markdown_text):
return {
"error": "Provide exactly one of markdown_path or markdown_text",
"conversion_time": round(time.time() - start_time, 2),
}
# Validate output path
output = validate_output_path(output_path)
output.parent.mkdir(parents=True, exist_ok=True)
# Detect available PDF engines on PATH
available_engines = [e for e in ENGINE_PREFERENCE if shutil.which(e)]
# Pick engine: explicit override or first available
if pdf_engine:
if not shutil.which(pdf_engine):
return {
"error": (
f"Requested PDF engine '{pdf_engine}' not found on PATH. "
f"Available engines: {available_engines or 'none'}"
),
"detected_engines": available_engines,
"conversion_time": round(time.time() - start_time, 2),
}
engine = pdf_engine
else:
if not available_engines:
return {
"error": (
"No PDF engine found on PATH. Install one of: "
+ ", ".join(ENGINE_PREFERENCE)
),
"conversion_time": round(time.time() - start_time, 2),
}
engine = available_engines[0]
# Build pandoc arguments
args: List[str] = [f"--pdf-engine={engine}"]
if toc:
args.append("--toc")
if title:
args.extend(["-M", f"title={title}"])
if author:
args.extend(["-M", f"author={author}"])
if date:
args.extend(["-M", f"date={date}"])
# Resource path for relative image refs — defaults to source dir
if base_path:
resource_dir = Path(base_path).resolve()
elif markdown_path:
resource_dir = Path(markdown_path).resolve().parent
else:
resource_dir = None
if resource_dir:
args.extend(["--resource-path", str(resource_dir)])
if extra_args:
args.extend(extra_args)
# Convert — file path or inline text
if markdown_path:
source_path = Path(markdown_path).resolve()
if not source_path.is_file():
return {
"error": f"Markdown file not found: {markdown_path}",
"conversion_time": round(time.time() - start_time, 2),
}
pypandoc.convert_file(
str(source_path),
to="pdf",
outputfile=str(output),
extra_args=args,
)
else:
pypandoc.convert_text(
markdown_text,
to="pdf",
format="md",
outputfile=str(output),
extra_args=args,
)
file_size = output.stat().st_size
return {
"output_path": str(output),
"file_size": file_size,
"file_size_kb": round(file_size / 1024, 2),
"engine_used": engine,
"detected_engines": available_engines,
"toc": toc,
"conversion_time": round(time.time() - start_time, 2),
}
except RuntimeError as e:
# pypandoc raises RuntimeError for pandoc subprocess failures —
# the message often contains the engine's stderr, which is the most
# useful signal a user can get for typesetting errors
error_msg = sanitize_error_message(str(e))
logger.error(f"Markdown to PDF conversion failed: {error_msg}")
return {
"error": f"Pandoc conversion failed: {error_msg}",
"conversion_time": round(time.time() - start_time, 2),
}
except Exception as e:
error_msg = sanitize_error_message(str(e))
logger.error(f"Markdown to PDF failed: {error_msg}")
return {
"error": error_msg,
"conversion_time": round(time.time() - start_time, 2),
}

19
uv.lock generated
View File

@ -1032,7 +1032,7 @@ wheels = [
[[package]] [[package]]
name = "mcp-pdf" name = "mcp-pdf"
version = "2.1.7" version = "2.2.0"
source = { editable = "." } source = { editable = "." }
dependencies = [ dependencies = [
{ name = "fastmcp" }, { name = "fastmcp" },
@ -1052,6 +1052,7 @@ dependencies = [
[package.optional-dependencies] [package.optional-dependencies]
all = [ all = [
{ name = "camelot-py", extra = ["cv"] }, { name = "camelot-py", extra = ["cv"] },
{ name = "pypandoc" },
{ name = "reportlab" }, { name = "reportlab" },
{ name = "tabula-py" }, { name = "tabula-py" },
] ]
@ -1069,6 +1070,9 @@ dev = [
forms = [ forms = [
{ name = "reportlab" }, { name = "reportlab" },
] ]
markdown = [
{ name = "pypandoc" },
]
tables = [ tables = [
{ name = "camelot-py", extra = ["cv"] }, { name = "camelot-py", extra = ["cv"] },
{ name = "tabula-py" }, { name = "tabula-py" },
@ -1102,6 +1106,8 @@ requires-dist = [
{ name = "pip-audit", marker = "extra == 'dev'", specifier = ">=2.0.0" }, { name = "pip-audit", marker = "extra == 'dev'", specifier = ">=2.0.0" },
{ name = "pydantic", specifier = ">=2.0.0" }, { name = "pydantic", specifier = ">=2.0.0" },
{ name = "pymupdf", specifier = ">=1.23.0" }, { name = "pymupdf", specifier = ">=1.23.0" },
{ name = "pypandoc", marker = "extra == 'all'", specifier = ">=1.13" },
{ name = "pypandoc", marker = "extra == 'markdown'", specifier = ">=1.13" },
{ name = "pypdf", specifier = ">=6.0.0" }, { name = "pypdf", specifier = ">=6.0.0" },
{ name = "pytesseract", specifier = ">=0.3.10" }, { name = "pytesseract", specifier = ">=0.3.10" },
{ name = "pytest", marker = "extra == 'dev'", specifier = ">=7.0.0" }, { name = "pytest", marker = "extra == 'dev'", specifier = ">=7.0.0" },
@ -1115,7 +1121,7 @@ requires-dist = [
{ name = "tabula-py", marker = "extra == 'tables'", specifier = ">=2.8.0" }, { name = "tabula-py", marker = "extra == 'tables'", specifier = ">=2.8.0" },
{ name = "twine", marker = "extra == 'dev'", specifier = ">=4.0.0" }, { name = "twine", marker = "extra == 'dev'", specifier = ">=4.0.0" },
] ]
provides-extras = ["forms", "tables", "all", "dev"] provides-extras = ["forms", "tables", "markdown", "all", "dev"]
[package.metadata.requires-dev] [package.metadata.requires-dev]
dev = [ dev = [
@ -1938,6 +1944,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/4a/26/8c72973b8833a72785cedc3981eb59b8ac7075942718bbb7b69b352cdde4/pymupdf-1.26.3-cp39-abi3-win_amd64.whl", hash = "sha256:b4cd5124d05737944636cf45fc37ce5824f10e707b0342efe109c7b6bd37a9cc", size = 18735124, upload-time = "2025-07-02T21:31:10.992Z" }, { url = "https://files.pythonhosted.org/packages/4a/26/8c72973b8833a72785cedc3981eb59b8ac7075942718bbb7b69b352cdde4/pymupdf-1.26.3-cp39-abi3-win_amd64.whl", hash = "sha256:b4cd5124d05737944636cf45fc37ce5824f10e707b0342efe109c7b6bd37a9cc", size = 18735124, upload-time = "2025-07-02T21:31:10.992Z" },
] ]
[[package]]
name = "pypandoc"
version = "1.17"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/ea/d6/410615fc433e5d1eacc00db2044ae2a9c82302df0d35366fe2bd15de024d/pypandoc-1.17.tar.gz", hash = "sha256:51179abfd6e582a25ed03477541b48836b5bba5a4c3b282a547630793934d799", size = 69071, upload-time = "2026-03-14T22:39:07.21Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/0c/86/e2ffa604eacfbec3f430b1d850e7e04c4101eca1a5828f9ae54bf51dfba4/pypandoc-1.17-py3-none-any.whl", hash = "sha256:01fdbffa61edb9f8e82e8faad6954efcb7b6f8f0634aead4d89e322a00225a67", size = 23554, upload-time = "2026-03-14T22:38:46.007Z" },
]
[[package]] [[package]]
name = "pyparsing" name = "pyparsing"
version = "3.2.3" version = "3.2.3"