mcp-pdf-tools

Author	SHA1	Message	Date
Ryan Malloy	48c44e941c	v2.2.1: Republish for updated README on PyPI Some checks are pending Security Scan / security-scan (push) Waiting to run Details No code changes — docs-only bump. Surfaces the rewritten README, QUICKSTART, and LOCAL_DEVELOPMENT docs to anyone landing on https://pypi.org/project/mcp-pdf/	2026-05-05 17:36:39 -06:00
Ryan Malloy	b2d9073f04	Add markdown_to_pdf tool — convert .md to PDF via pandoc Some checks are pending Security Scan / security-scan (push) Waiting to run Details New tool in ImageProcessingMixin (sibling of pdf_to_markdown). Accepts either a markdown file path or inline markdown text, writes a PDF to a caller-specified output path. Engine selection auto-detects what's available on PATH, preferring quality: xelatex > pdflatex > tectonic > weasyprint > wkhtmltopdf. Caller can force a specific engine or pass raw pandoc args for advanced cases. pypandoc is gated behind a new [markdown] optional extra so the base install stays lean. The tool surfaces clear errors if pypandoc, pandoc, or all PDF engines are missing. Bumps to v2.2.0 (new feature, minor bump).	2026-05-05 16:21:09 -06:00
Ryan Malloy	0eea85f352	Sync uv.lock to v2.1.7 Some checks failed Security Scan / security-scan (push) Has been cancelled Details	2026-04-25 10:47:43 -06:00
Ryan Malloy	b53d8ab998	Fix document-closed errors in 7 tools, fix stamp font name - Capture total_pages before doc.close() in content_analysis, security_analysis, annotations, and misc_tools mixins - Fix invalid PyMuPDF font name "helv-bold" → "helv" in add_stamps - Bump to v2.1.7	2026-04-07 04:19:20 -06:00
Ryan Malloy	057aa5be40	📉 File-first output for ocr_pdf, slim split_pdf_by_structure response ocr_pdf: writes OCR text to file by default, returns path + preview instead of full text dump (~17k tokens → ~500 tokens). inline=True for old behavior. split_pdf_by_structure: sections are now one-line summaries instead of full path objects. Removed detected_structure dump from response.	2026-03-08 05:30:57 -06:00
Ryan Malloy	d413438fea	📦 Make camelot-py and tabula-py optional dependencies Moves camelot-py[cv] and tabula-py from core to optional deps (pip install mcp-pdf[tables]). Fixes Python 3.14 install failure caused by pdftopng lacking cp314 wheels. - Lazy-import camelot/tabula in all extraction methods - Auto-fallback skips unavailable methods in table extraction - pdfplumber (pure Python, always available) handles tables by default - Also slims get_document_structure response (~12.5k → ~400 tokens)	2026-03-08 03:20:01 -06:00
Ryan Malloy	6af3104633	📉 Slim get_document_structure: cap bookmarks to 20 preview lines Bookmark list was unbounded — a 346-bookmark parts manual produced ~12.5k tokens. Now returns indented bookmark preview (20 lines + count), folds page_analysis and document_organization into structure_summary. ~406 tokens for the same document.	2026-03-06 21:26:30 -07:00
Ryan Malloy	81a3619144	📉 Slim detect_structure response to ~224 tokens Preview capped at 10 sections as human-readable lines, detection_info moved into the JSON file. Response went from ~22k tokens (inline) to ~1.6k (v2.1.2) to ~224 tokens now.	2026-03-04 17:15:32 -07:00
Ryan Malloy	a23fd8467a	📉 File-first output for detect_structure — 20× context reduction detect_structure now writes full JSON to disk and returns a compact summary (~1k tokens) instead of the full structure tree (~20k tokens). Prevents MCP context overflow on large documents. Set inline=True to get full data in response (used internally by split_pdf_by_structure).	2026-03-04 17:12:36 -07:00
Ryan Malloy	56ab8356bc	🐛 Fix superscript handling and directory name truncation in detect_structure - Two-pass span collection includes sandwiched non-heading spans (e.g. ² in I²C) so superscripts between heading-sized spans aren't dropped - Join heading line parts without spaces ("".join) for proper glyph concatenation - Cap numbering-pattern title at first newline + 80 chars with word boundary break - Reduce _sanitize_dirname max from 80→50 chars with word-boundary truncation	2026-03-02 02:14:26 -07:00
Ryan Malloy	5161a5f952	🚀 v2.0.14: Configurable PDF size limit via MCP_PDF_MAX_SIZE	2026-02-19 15:52:00 -07:00
Ryan Malloy	38af9ee2c9	🚀 v2.0.13: Smart vector extraction in pdf_to_markdown	2026-02-18 15:29:41 -07:00
Ryan Malloy	271e4c71d6	🔧 v2.0.9: Remove unreleased permit_forms mixin that broke PyPI install Some checks failed Security Scan / security-scan (push) Has been cancelled Details	2026-02-08 13:48:32 -07:00
Ryan Malloy	e4f77008bb	🚀 v2.0.8: Add extract_vector_graphics tool for PDF to SVG extraction New tool extracts vector graphics from PDF pages as SVG files, supporting three modes: full_page (PyMuPDF native SVG), drawings_only (raw vector paths), and both. Handles lines, curves, rectangles, quads with proper color space conversion (RGB, grayscale, CMYK). No new dependencies.	2026-02-02 13:56:17 -07:00
Ryan Malloy	dfbf3d1870	🔧 v2.0.7: Fix table extraction token overflow with smart limiting PROBLEM: Table extraction from large PDFs was exceeding MCP's 25,000 token limit, causing "response too large" errors. A 5-page PDF with large tables generated 59,005 tokens, more than double the allowed limit. SOLUTION: Added flexible table data limiting with two new parameters: - max_rows_per_table: Limit rows returned per table (prevents overflow) - summary_only: Return only metadata without table data IMPLEMENTATION: 1. Added new parameters to extract_tables() method signature 2. Created _process_table_data() helper for consistent limiting logic 3. Updated all 3 extraction methods (Camelot, pdfplumber, Tabula) 4. Enhanced table metadata with truncation tracking: - total_rows: Full row count from PDF - rows_returned: Actual rows in response (after limiting) - rows_truncated: Number of rows omitted (if limited) USAGE EXAMPLES: # Summary mode - metadata only (smallest response) extract_tables(pdf_path, pages="1-5", summary_only=True) # Limited data - first 100 rows per table extract_tables(pdf_path, pages="1-5", max_rows_per_table=100) # Full data (default behavior, may overflow on large tables) extract_tables(pdf_path, pages="1-5") BENEFITS: - Prevents MCP token overflow errors - Maintains backward compatibility (new params are optional) - Clear guidance through metadata (shows when truncation occurred) - Flexible - users choose between summary/limited/full modes FILES MODIFIED: - src/mcp_pdf/mixins_official/table_extraction.py (all changes) - src/mcp_pdf/server.py (version bump to 2.0.7) - pyproject.toml (version bump to 2.0.7) VERSION: 2.0.7 PUBLISHED: https://pypi.org/project/mcp-pdf/2.0.7/	2025-11-03 18:26:34 -07:00
Ryan Malloy	fa65fa6e0c	🔧 v2.0.6: Fix async/await bug in validate_output_path calls Remove incorrect 'await' keywords from validate_output_path() calls across all mixins. validate_output_path() is a synchronous function, not async. Fixed in 15 locations across 6 mixins: - advanced_forms.py (4 calls) - annotations.py (3 calls) - document_assembly.py (2 calls) - form_management.py (2 calls) - image_processing.py (1 call) - misc_tools.py (4 calls) Error: 'object PosixPath can't be used in 'await' expression' Root cause: Incorrectly awaiting synchronous Path validation function Fix: Removed await keyword from all validate_output_path() calls PyPI: https://pypi.org/project/mcp-pdf/2.0.6/	2025-11-03 18:03:34 -07:00
Ryan Malloy	3327137536	🚀 v2.0.5: Fix page range parsing across all PDF tools Major architectural improvements and bug fixes in the v2.0.x series: ## v2.0.5 - Page Range Parsing (Current Release) - Fix page range parsing bug affecting 6 mixins (e.g., "93-95" or "11-30") - Create shared parse_pages_parameter() utility function - Support mixed formats: "1,3-5,7,10-15" - Update: pdf_utilities, content_analysis, image_processing, misc_tools, table_extraction, text_extraction ## v2.0.4 - Chunk Hint Fix - Fix next_chunk_hint to show correct page ranges - Dynamic calculation based on actual pages being extracted - Example: "30-50" now correctly shows "40-49" for next chunk ## v2.0.3 - Initial Range Support - Add page range support to text extraction ("11-30") - Fix _parse_pages_parameter to handle ranges with Python's range() - Convert 1-based user input to 0-based internal indexing ## v2.0.2 - Lazy Import Fix - Fix ModuleNotFoundError for reportlab on startup - Implement lazy imports for optional dependencies - Graceful degradation with helpful error messages ## v2.0.1 - Dependency Restructuring - Move reportlab to optional [forms] extra - Document installation: uvx --with mcp-pdf[forms] mcp-pdf ## v2.0.0 - Official FastMCP Pattern Migration - Migrate to official fastmcp.contrib.mcp_mixin pattern - Create 12 specialized mixins with 42 tools total - Architecture: mixins_official/ using MCPMixin base class - Backwards compatibility: server_legacy.py preserved Technical Improvements: - Centralized utility functions (DRY principle) - Consistent behavior across all PDF tools - Better error messages with actionable instructions - Library-specific adapters for table extraction Files Changed: - New: src/mcp_pdf/mixins_official/utils.py (shared utilities) - Updated: 6 mixins with improved page parsing - Version: pyproject.toml, server.py → 2.0.5 PyPI: https://pypi.org/project/mcp-pdf/2.0.5/	2025-11-03 17:12:37 -07:00
Ryan Malloy	8cbf542df1	🔧 Fix output path security with MCP_PDF_ALLOWED_PATHS environment variable BREAKING ISSUE FIXED: - Users reported "Output path not allowed: images" error - extract_images tool was rejecting relative paths due to overly restrictive security NEW SECURITY MODEL: - MCP_PDF_ALLOWED_PATHS environment variable controls allowed output directories - If unset: Allows any directory with "security theater" warnings - If set: Restricts outputs to specified colon-separated paths - Cross-platform compatible (: on Unix, ; on Windows) SECURITY PHILOSOPHY ENHANCED: - "TRUST NO ONE" - honest about application-level security limitations - Clear warnings that this is "security theater" - Emphasis on OS-level permissions and process isolation - Educational guidance on real security practices TECHNICAL CHANGES: - validate_output_path() rewritten with environment variable control - Path validation uses relative_to() for proper containment checking - Enhanced warning messages with security education - Updated documentation with honest security assessment DOCUMENTATION UPDATES: - Added MCP_PDF_ALLOWED_PATHS to configuration section - New "REAL Security" section with OS-level recommendations - Clear explanation of security theater vs actual protection Version: 1.1.1 (patch version for critical bugfix)	2025-09-23 23:40:05 -06:00
Ryan Malloy	ebf6bb8a43	🚀 Release v1.0.1: Bug fixes and local development tools - Fix variable scope bug in extract_text function - Add local development setup with claude-mcp-manager - Update author information - Add comprehensive local development documentation 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-07 00:58:51 -06:00
Ryan Malloy	8d01c44d4f	🚀 Rename to mcp-pdf and prepare for PyPI publication Package Rebranding: - Renamed package from mcp-pdf-tools to mcp-pdf (cleaner name) - Updated version to 1.0.0 (production ready with security hardening) - Updated all import paths and references throughout codebase PyPI Preparation: - Enhanced package description and metadata - Added proper project URLs and homepage - Updated CLI command from mcp-pdf-tools to mcp-pdf - Built distribution packages (wheel + source) Testing & Validation: - All 20 security tests pass with new package structure - Local installation and import tests successful - CLI command working correctly - Package ready for PyPI publication The secure, production-ready PDF processing platform is now ready for public distribution and installation via pip. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:42:59 -06:00
Ryan Malloy	75f8548668	🔒 Comprehensive security hardening and vulnerability fixes Some checks failed Security Scan / security-scan (push) Has been cancelled Details Implemented extensive security improvements to prevent attacks and ensure production readiness: Critical Security Fixes: - Fixed path traversal vulnerability in get_pdf_image function - Added file size limits (100MB PDFs, 50MB images) to prevent DoS - Implemented secure output path validation with directory restrictions - Added page count limits (1000 pages max) for resource protection - Secured JSON parameter parsing with 10KB size limits Access Control & Validation: - URL allowlisting with SSRF protection (blocks localhost, internal IPs) - IPv6 security handling for comprehensive host blocking - Input validation framework with length limits and sanitization - Secure file permissions (0o700 dirs, 0o600 files) Error Handling & Privacy: - Sanitized error messages to prevent information disclosure - Automatic removal of sensitive patterns (paths, emails, SSNs) - Generic error responses for failed operations Infrastructure & Monitoring: - Added security scanning tools (safety, pip-audit) - GitHub Actions workflow for continuous vulnerability monitoring - Daily automated security assessments - Fixed pypdf vulnerability (5.9.0 → 6.0.0) Testing & Validation: - 20 comprehensive security tests (all passing) - Integration tests confirming functionality preservation - Zero known vulnerabilities in dependencies - Validated all security functions work correctly All security measures tested and verified. Project now production-ready with enterprise-grade security posture. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:35:31 -06:00
Ryan Malloy	95596e0236	✨ Add comprehensive PDF form creation and validation tools - Add complete PDF form lifecycle management - Create new forms with text, checkbox, dropdown, signature fields - Fill existing forms with JSON data and optional flattening - Add fields to existing PDFs with flexible positioning - Advanced field types: radio groups, textareas, date fields - Comprehensive validation engine with regex patterns - Email, phone, number, date format validation - Required field checking and length constraints - Visual validation cues with asterisks and format hints - Multi-field error reporting with detailed feedback - International character support and edge case handling - Enterprise-ready for complex business forms	2025-09-03 02:33:01 -06:00
Ryan Malloy	c902e81e4d	Initial commit: Complete MCP PDF Tools server implementation Features: - 8 comprehensive PDF processing tools with intelligent fallbacks - Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection) - Table extraction (Camelot → pdfplumber → Tabula fallback chain) - OCR processing with Tesseract and preprocessing options - Document analysis (structure, metadata, scanned detection) - Image extraction with filtering capabilities - PDF to markdown conversion with metadata - Built on FastMCP framework with full MCP protocol support - Comprehensive error handling and user-friendly messages - Docker support and cross-platform compatibility - Complete test suite and examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 16:36:21 -06:00

23 Commits