mcp-pdf-tools

Author	SHA1	Message	Date
Ryan Malloy	a23fd8467a	📉 File-first output for detect_structure — 20× context reduction detect_structure now writes full JSON to disk and returns a compact summary (~1k tokens) instead of the full structure tree (~20k tokens). Prevents MCP context overflow on large documents. Set inline=True to get full data in response (used internally by split_pdf_by_structure).	2026-03-04 17:12:36 -07:00
Ryan Malloy	823318ec15	✨ Chapter-aware PDF extraction: detect_structure, split_pdf_by_structure, batch_extract New StructureDetectionMixin with 3 tools: - detect_structure: finds chapters/sections via bookmarks, font-size heuristics, numbering patterns, and user-supplied regex - split_pdf_by_structure: auto-splits PDF into per-chapter directories with markdown + images + vectors in one call - batch_extract: process N user-specified page ranges from one PDF Enhanced pdf_to_markdown: - output_filename parameter for custom .md filenames - vector_diagnostics reporting for skipped pages - vector_fallback_raster: render sub-threshold pages as PNG at 150 DPI Bumps version to 2.1.0	2026-03-01 23:52:15 -07:00
Ryan Malloy	62d9b176c8	🔧 Replace hardcoded 100MB PDF limit with MCP_PDF_MAX_SIZE env var Centralize PDF size limit in security.py, controlled by MCP_PDF_MAX_SIZE (in MB). Default: disabled (no limit). Set e.g. MCP_PDF_MAX_SIZE=500 to cap at 500MB. Remove unused self.max_file_size from all 13 mixins.	2026-02-19 15:51:41 -07:00
Ryan Malloy	f759634687	✨ Smart vector extraction in pdf_to_markdown Detect significant vector graphics (charts, schematics, diagrams) during markdown conversion and extract them as full-page SVGs to vectors/ subdir. Uses multi-tier heuristic (drawing count, path complexity, bounding box) adapted from extract_charts to avoid false positives on decorative borders. New params: include_vectors, vector_min_drawings, vector_min_complexity	2026-02-18 15:29:25 -07:00
Ryan Malloy	772bcac0df	🐛 File-first output for extract_text and pdf_to_markdown Both tools now write to disk by default and return file path + short preview instead of full content inline. Prevents MCP context overflow on large PDFs. Set inline=True for the old behavior. pdf_to_markdown always extracts images to ./images/ with relative paths (no more dead pdf-image:// URIs). extract_text writes a .txt file.	2026-02-18 15:01:43 -07:00
Ryan Malloy	8b5783585f	🐛 Fix pdf_to_markdown broken image references pdf_to_markdown generated pdf-image:// URIs that never resolved — the resource handler only existed in the legacy server. Add output_directory parameter: when set, images extract to disk with relative ./images/ paths. Without it, existing pdf-image:// behavior preserved for backward compat. Also adds min_width/min_height filtering (matching extract_images), save_markdown option, and fixes missing extract_vector_graphics in list_capabilities.	2026-02-12 20:24:19 -07:00
Ryan Malloy	8cbf542df1	🔧 Fix output path security with MCP_PDF_ALLOWED_PATHS environment variable BREAKING ISSUE FIXED: - Users reported "Output path not allowed: images" error - extract_images tool was rejecting relative paths due to overly restrictive security NEW SECURITY MODEL: - MCP_PDF_ALLOWED_PATHS environment variable controls allowed output directories - If unset: Allows any directory with "security theater" warnings - If set: Restricts outputs to specified colon-separated paths - Cross-platform compatible (: on Unix, ; on Windows) SECURITY PHILOSOPHY ENHANCED: - "TRUST NO ONE" - honest about application-level security limitations - Clear warnings that this is "security theater" - Emphasis on OS-level permissions and process isolation - Educational guidance on real security practices TECHNICAL CHANGES: - validate_output_path() rewritten with environment variable control - Path validation uses relative_to() for proper containment checking - Enhanced warning messages with security education - Updated documentation with honest security assessment DOCUMENTATION UPDATES: - Added MCP_PDF_ALLOWED_PATHS to configuration section - New "REAL Security" section with OS-level recommendations - Clear explanation of security theater vs actual protection Version: 1.1.1 (patch version for critical bugfix)	2025-09-23 23:40:05 -06:00
Ryan Malloy	856dd41996	✨ Add comprehensive link extraction tool (24th PDF tool) New Features: - extract_links: Extract all PDF hyperlinks with advanced filtering - Page-specific filtering (e.g., "1,3,5" or "1-5,8,10-12") - Link type categorization: external URLs, internal pages, emails, documents - Coordinate tracking for precise link positioning - FastMCP integration with proper tool registration - Version banner display following CLAUDE.md guidelines Technical Improvements: - Enhanced startup banner with package version display - Updated documentation to reflect 24 specialized tools - Proper FastMCP @mcp.tool() decorator usage - Comprehensive error handling and security validation Documentation Updates: - README.md: Updated tool count and installation guides - CLAUDE.md: Added link extraction to implemented features - LOCAL_DEVELOPMENT.md: Enhanced with scoped installation commands Version: 1.1.0 (minor version bump for new feature)	2025-09-23 20:41:16 -06:00
Ryan Malloy	8d01c44d4f	🚀 Rename to mcp-pdf and prepare for PyPI publication Package Rebranding: - Renamed package from mcp-pdf-tools to mcp-pdf (cleaner name) - Updated version to 1.0.0 (production ready with security hardening) - Updated all import paths and references throughout codebase PyPI Preparation: - Enhanced package description and metadata - Added proper project URLs and homepage - Updated CLI command from mcp-pdf-tools to mcp-pdf - Built distribution packages (wheel + source) Testing & Validation: - All 20 security tests pass with new package structure - Local installation and import tests successful - CLI command working correctly - Package ready for PyPI publication The secure, production-ready PDF processing platform is now ready for public distribution and installation via pip. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:42:59 -06:00
Ryan Malloy	75f8548668	🔒 Comprehensive security hardening and vulnerability fixes Some checks failed Security Scan / security-scan (push) Has been cancelled Details Implemented extensive security improvements to prevent attacks and ensure production readiness: Critical Security Fixes: - Fixed path traversal vulnerability in get_pdf_image function - Added file size limits (100MB PDFs, 50MB images) to prevent DoS - Implemented secure output path validation with directory restrictions - Added page count limits (1000 pages max) for resource protection - Secured JSON parameter parsing with 10KB size limits Access Control & Validation: - URL allowlisting with SSRF protection (blocks localhost, internal IPs) - IPv6 security handling for comprehensive host blocking - Input validation framework with length limits and sanitization - Secure file permissions (0o700 dirs, 0o600 files) Error Handling & Privacy: - Sanitized error messages to prevent information disclosure - Automatic removal of sensitive patterns (paths, emails, SSNs) - Generic error responses for failed operations Infrastructure & Monitoring: - Added security scanning tools (safety, pip-audit) - GitHub Actions workflow for continuous vulnerability monitoring - Daily automated security assessments - Fixed pypdf vulnerability (5.9.0 → 6.0.0) Testing & Validation: - 20 comprehensive security tests (all passing) - Integration tests confirming functionality preservation - Zero known vulnerabilities in dependencies - Validated all security functions work correctly All security measures tested and verified. Project now production-ready with enterprise-grade security posture. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:35:31 -06:00
Ryan Malloy	ab1d9ed13e	✨ Add comprehensive PDF annotations and markup tools Implement complete collaboration toolkit with: - add_sticky_notes: Comment annotations with color support - add_highlights: Text highlighting with 8 color options - add_stamps: Approval stamps (APPROVED, DRAFT, CONFIDENTIAL, etc.) - extract_all_annotations: Export to JSON/CSV formats Also includes document assembly features: - merge_pdfs_advanced: Combine PDFs with bookmark preservation - split_pdf_by_pages: Extract specific page ranges - split_pdf_by_bookmarks: Auto-split by chapters/sections - reorder_pdf_pages: Rearrange page sequences All tools tested and working with proper error handling.	2025-09-04 17:18:06 -06:00
Ryan Malloy	95596e0236	✨ Add comprehensive PDF form creation and validation tools - Add complete PDF form lifecycle management - Create new forms with text, checkbox, dropdown, signature fields - Fill existing forms with JSON data and optional flattening - Add fields to existing PDFs with flexible positioning - Advanced field types: radio groups, textareas, date fields - Comprehensive validation engine with regex patterns - Email, phone, number, date format validation - Required field checking and length constraints - Visual validation cues with asterisks and format hints - Multi-field error reporting with detailed feedback - International character support and edge case handling - Enterprise-ready for complex business forms	2025-09-03 02:33:01 -06:00
Ryan Malloy	ae80388ec4	🎯 Add custom output paths and clean summary for image extraction Enhance extract_images with user-specified output directories and concise summary responses to improve user control and reduce context window clutter. Key Features: • Custom Output Directory: Users can specify where images are saved • Clean Summary Output: Concise extraction results instead of verbose metadata • Automatic Directory Creation: Creates output directories as needed • File-Level Details: Individual file info with human-readable sizes • Extraction Summary: Quick overview with total size and file count New Parameters: + output_directory: Optional custom path for saving extracted images + Defaults to cache directory if not specified + Creates directories automatically with proper permissions Response Format: - Removed: Verbose image metadata arrays that fill context windows + Added: Clean summary with extraction statistics + Added: File list with essential details (filename, path, size, dimensions) + Added: Human-readable extraction summary Benefits: ✅ User control over image file locations ✅ Reduced context window pollution ✅ Essential information without verbosity ✅ Better integration with user workflows ✅ Maintains MCP resource compatibility for cached images Example Response: { "success": true, "images_extracted": 3, "total_size": "2.4 MB", "output_directory": "/path/to/custom/dir", "files": [{"filename": "page_1_image_0.png", "path": "/path/...", "size": "800 KB", "dimensions": "1920x1080"}] } 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 13:50:09 -06:00
Ryan Malloy	e087a3b7a0	✨ Add MCP resource URIs for extracted PDF images Implement proper MCP resource protocol for image access, eliminating the need for clients to handle local file paths and enabling seamless image integration. Key Features: • MCP Resource Endpoint: pdf-image://{image_id} for direct image access • extract_images(): Returns resource_uri field with MCP resource links • pdf_to_markdown(): Embeds resource URIs in markdown image references • Automatic MIME type detection (image/png, image/jpeg) • Seamless client integration without file path handling Benefits: ✅ Direct image access via MCP resource protocol ✅ No local file path dependencies for MCP clients ✅ Proper MIME type handling for image display ✅ Clean markdown with working image links ✅ Standards-compliant MCP resource implementation Response Format Enhancement: + "resource_uri": "pdf-image://page_1_image_0" + Works in markdown: \![Image](pdf-image://page_1_image_0) + MIME Type: image/png or image/jpeg + Direct client access without file system dependencies This resolves the limitation where extracted images were only available as local file paths, making them truly accessible to MCP clients through the standardized resource protocol. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 11:42:46 -06:00
Ryan Malloy	374339a15d	🔧 Fix verbose base64 output in image extraction functions Resolve MCP client context overflow by saving images to files instead of returning base64-encoded data that fills client message windows. Key Changes: • extract_images(): Save images to CACHE_DIR with file paths in response • pdf_to_markdown(): Save embedded images to files with path references • Add format_file_size() utility for human-readable file sizes • Update function descriptions to clarify file-based output Benefits: ✅ Prevents context message window overflow in MCP clients ✅ Returns clean, concise metadata with file paths ✅ Maintains full image access through saved files ✅ Improves user experience with readable file sizes ✅ Reduces memory usage and response payload sizes Response Format Changes: - Remove: "data": "<base64_string>" (verbose) + Add: "file_path": "/tmp/mcp-pdf-processing/image.png" + Add: "filename": "page_1_image_0.png" + Add: "size_bytes": 12345 + Add: "size_human": "12.1 KB" This resolves the issue where image extraction caused excessive verbose output that overwhelmed MCP client interfaces. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 11:34:42 -06:00
Ryan Malloy	c902e81e4d	Initial commit: Complete MCP PDF Tools server implementation Features: - 8 comprehensive PDF processing tools with intelligent fallbacks - Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection) - Table extraction (Camelot → pdfplumber → Tabula fallback chain) - OCR processing with Tesseract and preprocessing options - Document analysis (structure, metadata, scanned detection) - Image extraction with filtering capabilities - PDF to markdown conversion with metadata - Built on FastMCP framework with full MCP protocol support - Comprehensive error handling and user-friendly messages - Docker support and cross-platform compatibility - Complete test suite and examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 16:36:21 -06:00

16 Commits