The PII audit run before this publish caught three files that have been
leaking operator-specific paths to PyPI in v2.1.6, v2.1.7, and v2.2.0:
- claude_desktop_config.json (personal Claude Desktop config snapshot)
- mcp-pdf-tools-launcher.sh (obsolete — uvx replaces it)
- mcp-config-example.json (had hardcoded /home/rpm path + old package name)
Fix:
- Delete the personal config and obsolete launcher
- Sanitize the example to use uvx with the [markdown] extra (matches docs)
- Add [tool.hatch.build.targets.sdist] exclude block per
~/.claude/rules/python.md to prevent recurrence — covers dev artifacts,
fixture PDFs, internal architecture notes, and CI scripts
Side benefit: sdist size dropped from 2.4 MB to 304 KB (8× reduction),
mostly from excluding examples/*.pdf and the tests/ fixture PDF.
The /home/rpm leaks in prior versions are not credentials, just operator
paths — not yanking. Going forward the unpacked-sdist grep is mandatory
before each publish.
No code changes — docs-only bump. Surfaces the rewritten README,
QUICKSTART, and LOCAL_DEVELOPMENT docs to anyone landing on
https://pypi.org/project/mcp-pdf/
CLAUDE_DESKTOP_SETUP.md was actively misleading — listed 8 tools (out of
47), referenced the old `mcp-pdf-tools` package name, and had hardcoded
user paths. README.md and QUICKSTART.md cover the same territory
correctly now, and nothing in the repo links to it.
LOCAL_DEVELOPMENT.md kept its structure (setup → wiring up → testing →
publishing → gotchas) but updated to reflect current reality:
- `claude mcp add` syntax now uses the required `--` separator
- Three patterns shown (local source, pinned PyPI version, latest PyPI
with --refresh) since they each serve different dev workflows
- markdown_to_pdf added to manual verification checklist
- Publishing pipeline now matches what we actually do (clean dist/,
PII audit per global rules, twine for upload since uv publish
doesn't read ~/.pypirc)
- Common gotchas section: mktexfmt errors, FunctionTool test failures,
PyPI JSON caching — all real things hit during this session
- Removed claim that the server has "23 PDF tools"
texlive-xetex alone is rarely enough — pandoc's default template needs
packages from texlive-latex-extra (Debian) / texlive-latexextra (Arch):
lastpage, xcolor, framed, fancyhdr, etc. Real markdown docs fail with
"File 'X.sty' not found" without them.
Restructure system deps to present three engine routes per platform:
- tectonic (recommended): ~30 MB static binary, downloads packages on demand
- full TeX: texlive-xetex + texlive-latex-extra + texlive-fonts-extra
- weasyprint: skip TeX entirely, pip-installable
Add an engine comparison table in the README explaining the disk-size
and quality trade-offs so users can pick informed.
README:
- bump tool count 46 → 47, add Format Conversion bullet
- fix `claude mcp add` syntax (needs `--` separator before uvx)
- show `uvx --from "mcp-pdf[markdown]" mcp-pdf` for the new tool
- note about uvx caching + `--refresh`
- new "Format Conversion" tools subsection (markdown_to_pdf alongside pdf_to_markdown)
- new "Optional Extras" section explaining [forms], [tables], [markdown], [all]
- expand System Dependencies with Arch (pacman) and macOS (brew) recipes for
pandoc + a PDF engine
QUICKSTART:
- replace stale `mcp-pdf-tools` package name with current `mcp-pdf`
- add uvx as the recommended end-user install path
- add pip install patterns including all optional extras
- add pacman block alongside apt-get and brew
- add markdown_to_pdf troubleshooting (mktexfmt errors, engine fallback)
- add a smoke-test snippet using the new tool
New tool in ImageProcessingMixin (sibling of pdf_to_markdown). Accepts
either a markdown file path or inline markdown text, writes a PDF to a
caller-specified output path.
Engine selection auto-detects what's available on PATH, preferring quality:
xelatex > pdflatex > tectonic > weasyprint > wkhtmltopdf. Caller can force
a specific engine or pass raw pandoc args for advanced cases.
pypandoc is gated behind a new [markdown] optional extra so the base
install stays lean. The tool surfaces clear errors if pypandoc, pandoc,
or all PDF engines are missing.
Bumps to v2.2.0 (new feature, minor bump).
- Capture total_pages before doc.close() in content_analysis,
security_analysis, annotations, and misc_tools mixins
- Fix invalid PyMuPDF font name "helv-bold" → "helv" in add_stamps
- Bump to v2.1.7
ocr_pdf: writes OCR text to file by default, returns path + preview
instead of full text dump (~17k tokens → ~500 tokens). inline=True
for old behavior.
split_pdf_by_structure: sections are now one-line summaries instead
of full path objects. Removed detected_structure dump from response.
Bookmark list was unbounded — a 346-bookmark parts manual produced
~12.5k tokens. Now returns indented bookmark preview (20 lines + count),
folds page_analysis and document_organization into structure_summary.
~406 tokens for the same document.
Preview capped at 10 sections as human-readable lines, detection_info
moved into the JSON file. Response went from ~22k tokens (inline) to
~1.6k (v2.1.2) to ~224 tokens now.
detect_structure now writes full JSON to disk and returns a compact
summary (~1k tokens) instead of the full structure tree (~20k tokens).
Prevents MCP context overflow on large documents. Set inline=True to
get full data in response (used internally by split_pdf_by_structure).
- Two-pass span collection includes sandwiched non-heading spans (e.g. ² in I²C)
so superscripts between heading-sized spans aren't dropped
- Join heading line parts without spaces ("".join) for proper glyph concatenation
- Cap numbering-pattern title at first newline + 80 chars with word boundary break
- Reduce _sanitize_dirname max from 80→50 chars with word-boundary truncation
New StructureDetectionMixin with 3 tools:
- detect_structure: finds chapters/sections via bookmarks, font-size
heuristics, numbering patterns, and user-supplied regex
- split_pdf_by_structure: auto-splits PDF into per-chapter directories
with markdown + images + vectors in one call
- batch_extract: process N user-specified page ranges from one PDF
Enhanced pdf_to_markdown:
- output_filename parameter for custom .md filenames
- vector_diagnostics reporting for skipped pages
- vector_fallback_raster: render sub-threshold pages as PNG at 150 DPI
Bumps version to 2.1.0
Centralize PDF size limit in security.py, controlled by MCP_PDF_MAX_SIZE
(in MB). Default: disabled (no limit). Set e.g. MCP_PDF_MAX_SIZE=500 to
cap at 500MB. Remove unused self.max_file_size from all 13 mixins.
Detect significant vector graphics (charts, schematics, diagrams) during
markdown conversion and extract them as full-page SVGs to vectors/ subdir.
Uses multi-tier heuristic (drawing count, path complexity, bounding box)
adapted from extract_charts to avoid false positives on decorative borders.
New params: include_vectors, vector_min_drawings, vector_min_complexity
Both tools now write to disk by default and return file path + short
preview instead of full content inline. Prevents MCP context overflow
on large PDFs. Set inline=True for the old behavior.
pdf_to_markdown always extracts images to ./images/ with relative paths
(no more dead pdf-image:// URIs). extract_text writes a .txt file.
pdf_to_markdown generated pdf-image:// URIs that never resolved — the
resource handler only existed in the legacy server. Add output_directory
parameter: when set, images extract to disk with relative ./images/ paths.
Without it, existing pdf-image:// behavior preserved for backward compat.
Also adds min_width/min_height filtering (matching extract_images),
save_markdown option, and fixes missing extract_vector_graphics in
list_capabilities.
Coordinate-based PDF form filling for scanned/flat PDFs.
reportlab is now optional - only loaded when permit tools used.
Install with: pip install mcp-pdf[forms]
New tool extracts vector graphics from PDF pages as SVG files, supporting
three modes: full_page (PyMuPDF native SVG), drawings_only (raw vector
paths), and both. Handles lines, curves, rectangles, quads with proper
color space conversion (RGB, grayscale, CMYK). No new dependencies.
PROBLEM:
Table extraction from large PDFs was exceeding MCP's 25,000 token limit,
causing "response too large" errors. A 5-page PDF with large tables
generated 59,005 tokens, more than double the allowed limit.
SOLUTION:
Added flexible table data limiting with two new parameters:
- max_rows_per_table: Limit rows returned per table (prevents overflow)
- summary_only: Return only metadata without table data
IMPLEMENTATION:
1. Added new parameters to extract_tables() method signature
2. Created _process_table_data() helper for consistent limiting logic
3. Updated all 3 extraction methods (Camelot, pdfplumber, Tabula)
4. Enhanced table metadata with truncation tracking:
- total_rows: Full row count from PDF
- rows_returned: Actual rows in response (after limiting)
- rows_truncated: Number of rows omitted (if limited)
USAGE EXAMPLES:
# Summary mode - metadata only (smallest response)
extract_tables(pdf_path, pages="1-5", summary_only=True)
# Limited data - first 100 rows per table
extract_tables(pdf_path, pages="1-5", max_rows_per_table=100)
# Full data (default behavior, may overflow on large tables)
extract_tables(pdf_path, pages="1-5")
BENEFITS:
- Prevents MCP token overflow errors
- Maintains backward compatibility (new params are optional)
- Clear guidance through metadata (shows when truncation occurred)
- Flexible - users choose between summary/limited/full modes
FILES MODIFIED:
- src/mcp_pdf/mixins_official/table_extraction.py (all changes)
- src/mcp_pdf/server.py (version bump to 2.0.7)
- pyproject.toml (version bump to 2.0.7)
VERSION: 2.0.7
PUBLISHED: https://pypi.org/project/mcp-pdf/2.0.7/
BREAKING ISSUE FIXED:
- Users reported "Output path not allowed: images" error
- extract_images tool was rejecting relative paths due to overly restrictive security
NEW SECURITY MODEL:
- MCP_PDF_ALLOWED_PATHS environment variable controls allowed output directories
- If unset: Allows any directory with "security theater" warnings
- If set: Restricts outputs to specified colon-separated paths
- Cross-platform compatible (: on Unix, ; on Windows)
SECURITY PHILOSOPHY ENHANCED:
- "TRUST NO ONE" - honest about application-level security limitations
- Clear warnings that this is "security theater"
- Emphasis on OS-level permissions and process isolation
- Educational guidance on real security practices
TECHNICAL CHANGES:
- validate_output_path() rewritten with environment variable control
- Path validation uses relative_to() for proper containment checking
- Enhanced warning messages with security education
- Updated documentation with honest security assessment
DOCUMENTATION UPDATES:
- Added MCP_PDF_ALLOWED_PATHS to configuration section
- New "REAL Security" section with OS-level recommendations
- Clear explanation of security theater vs actual protection
Version: 1.1.1 (patch version for critical bugfix)
New Features:
- extract_links: Extract all PDF hyperlinks with advanced filtering
- Page-specific filtering (e.g., "1,3,5" or "1-5,8,10-12")
- Link type categorization: external URLs, internal pages, emails, documents
- Coordinate tracking for precise link positioning
- FastMCP integration with proper tool registration
- Version banner display following CLAUDE.md guidelines
Technical Improvements:
- Enhanced startup banner with package version display
- Updated documentation to reflect 24 specialized tools
- Proper FastMCP @mcp.tool() decorator usage
- Comprehensive error handling and security validation
Documentation Updates:
- README.md: Updated tool count and installation guides
- CLAUDE.md: Added link extraction to implemented features
- LOCAL_DEVELOPMENT.md: Enhanced with scoped installation commands
Version: 1.1.0 (minor version bump for new feature)
- Fix variable scope bug in extract_text function
- Add local development setup with claude-mcp-manager
- Update author information
- Add comprehensive local development documentation
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
**Package Rebranding:**
- Renamed package from mcp-pdf-tools to mcp-pdf (cleaner name)
- Updated version to 1.0.0 (production ready with security hardening)
- Updated all import paths and references throughout codebase
**PyPI Preparation:**
- Enhanced package description and metadata
- Added proper project URLs and homepage
- Updated CLI command from mcp-pdf-tools to mcp-pdf
- Built distribution packages (wheel + source)
**Testing & Validation:**
- All 20 security tests pass with new package structure
- Local installation and import tests successful
- CLI command working correctly
- Package ready for PyPI publication
The secure, production-ready PDF processing platform is now ready
for public distribution and installation via pip.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Implement complete collaboration toolkit with:
- add_sticky_notes: Comment annotations with color support
- add_highlights: Text highlighting with 8 color options
- add_stamps: Approval stamps (APPROVED, DRAFT, CONFIDENTIAL, etc.)
- extract_all_annotations: Export to JSON/CSV formats
Also includes document assembly features:
- merge_pdfs_advanced: Combine PDFs with bookmark preservation
- split_pdf_by_pages: Extract specific page ranges
- split_pdf_by_bookmarks: Auto-split by chapters/sections
- reorder_pdf_pages: Rearrange page sequences
All tools tested and working with proper error handling.
- Add complete PDF form lifecycle management
- Create new forms with text, checkbox, dropdown, signature fields
- Fill existing forms with JSON data and optional flattening
- Add fields to existing PDFs with flexible positioning
- Advanced field types: radio groups, textareas, date fields
- Comprehensive validation engine with regex patterns
- Email, phone, number, date format validation
- Required field checking and length constraints
- Visual validation cues with asterisks and format hints
- Multi-field error reporting with detailed feedback
- International character support and edge case handling
- Enterprise-ready for complex business forms
Implement proper MCP resource protocol for image access, eliminating the need
for clients to handle local file paths and enabling seamless image integration.
Key Features:
• MCP Resource Endpoint: pdf-image://{image_id} for direct image access
• extract_images(): Returns resource_uri field with MCP resource links
• pdf_to_markdown(): Embeds resource URIs in markdown image references
• Automatic MIME type detection (image/png, image/jpeg)
• Seamless client integration without file path handling
Benefits:
✅ Direct image access via MCP resource protocol
✅ No local file path dependencies for MCP clients
✅ Proper MIME type handling for image display
✅ Clean markdown with working image links
✅ Standards-compliant MCP resource implementation
Response Format Enhancement:
+ "resource_uri": "pdf-image://page_1_image_0"
+ Works in markdown: \
+ MIME Type: image/png or image/jpeg
+ Direct client access without file system dependencies
This resolves the limitation where extracted images were only available
as local file paths, making them truly accessible to MCP clients
through the standardized resource protocol.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Feature prominent Claude Code integration instructions:
- Add recommended one-line command for Claude Code users
- Update installation section with uvx commands
- Include git.supported.systems repository URLs
- Highlight seamless AI-powered document processing integration
Command for Claude Code users:
claude mcp add -s local -- legacy-files uvx --from git+https://git.supported.systems/MCP/mcp-legacy-files.git mcp-legacy-files
This enables direct access to all 9 vintage format processors within Claude Code
for seamless AI-enhanced document processing workflows.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
**Major Enhancement**: Combined blog post storytelling with technical documentation
to create an engaging, comprehensive project showcase.
**What's New:**
📖 **Compelling Narrative**: Tells the complete story from 8 tools → 23 tools
🎯 **Real-World Examples**: Business intelligence, academic research, security workflows
🧠 **Technical Deep-Dives**: Architecture decisions, intelligent fallbacks, UX design
⚡ **Performance Insights**: Async architecture, caching strategies, resource management
🔧 **Complete Documentation**: Installation, usage, troubleshooting, contributing
**Key Sections Added:**
- "What We Built" - Project overview and use cases
- "Key Innovations" - Document intelligence, layout processing, web integration
- "Real-World Usage Examples" - 4 comprehensive workflow examples
- "Performance & Architecture" - Technical implementation details
- "Architecture Deep-Dive" - Code examples and design decisions
- "Why MCP PDF Tools?" - Value proposition and differentiators
**Impact**:
- Much more engaging for new users and contributors
- Showcases the full scope of capabilities (23 tools\!)
- Provides clear guidance for different use cases
- Demonstrates technical sophistication and quality
- Perfect for sharing, contributing, and adoption
Now developers can understand not just HOW to use the tools, but WHY this
project exists and what makes it special in the PDF processing landscape.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
**Problem**: Zero-based page numbers were confusing for users who naturally
think of pages starting from 1.
**Solution**:
- Updated `parse_pages_parameter()` to convert 1-based user input to 0-based internal representation
- All user-facing documentation now uses 1-based page numbering (page 1 = first page)
- Internal processing continues to use 0-based indexing for PyMuPDF compatibility
- Output page numbers are consistently displayed as 1-based for users
**Changes**:
- Enhanced documentation strings to clarify "1-based" page numbering
- Updated README examples with 1-based page numbers and clarifying comments
- Fixed split_pdf function to handle 1-based input correctly
- Updated test cases to verify 1-based -> 0-based conversion
- Added feature highlight: "User-Friendly: All page numbers use 1-based indexing"
**Impact**: Much more intuitive for users - no more confusion about which page is "page 0"\!
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major expansion from 8 to 23 total tools covering:
**Document Analysis & Intelligence:**
- analyze_pdf_health: Comprehensive quality and health analysis
- analyze_pdf_security: Security features and vulnerability assessment
- classify_content: AI-powered document type classification
- summarize_content: Intelligent content summarization with key insights
- compare_pdfs: Advanced document comparison (text, structure, metadata)
**Layout & Visual Analysis:**
- analyze_layout: Page layout analysis with column detection
- extract_charts: Chart, diagram, and visual element extraction
- detect_watermarks: Watermark detection and analysis
**Content Manipulation:**
- extract_form_data: Interactive PDF form data extraction
- split_pdf: Split PDFs at specified pages
- merge_pdfs: Merge multiple PDFs into one
- rotate_pages: Rotate pages by 90°/180°/270°
**Optimization & Utilities:**
- convert_to_images: Convert PDF pages to image files
- optimize_pdf: File size optimization with quality levels
- repair_pdf: Corrupted PDF repair and recovery
**Technical Enhancements:**
- All tools support HTTPS URLs with intelligent caching
- Fixed MCP parameter validation for pages parameter
- Comprehensive error handling and validation
- Updated documentation with usage examples
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Features:
- HTTPS URL support: Process PDFs directly from URLs with intelligent caching
- Smart caching: 1-hour cache to avoid repeated downloads
- Content validation: Verify downloads are actually PDF files
- Security: Proper User-Agent headers, HTTPS preferred over HTTP
- MCP parameter fixes: Handle pages parameter as string "[2,3]" format
- Backward compatibility: Still supports local file paths and list parameters
Technical changes:
- Added download_pdf_from_url() with caching and validation
- Updated validate_pdf_path() to handle URLs and local paths
- Added parse_pages_parameter() for flexible parameter parsing
- Updated all 8 tools to accept string pages parameters
- Enhanced error handling for network and validation issues
All tools now support:
- Local paths: "/path/to/file.pdf"
- HTTPS URLs: "https://example.com/document.pdf"
- Flexible pages: "[2,3]", "1,2,3", or [1,2,3]
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Resolved README.md conflict by preserving comprehensive documentation
while maintaining repository structure from git.supported.systems
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Features:
- 8 comprehensive PDF processing tools with intelligent fallbacks
- Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection)
- Table extraction (Camelot → pdfplumber → Tabula fallback chain)
- OCR processing with Tesseract and preprocessing options
- Document analysis (structure, metadata, scanned detection)
- Image extraction with filtering capabilities
- PDF to markdown conversion with metadata
- Built on FastMCP framework with full MCP protocol support
- Comprehensive error handling and user-friendly messages
- Docker support and cross-platform compatibility
- Complete test suite and examples
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>