mcp-pdf-tools

Author	SHA1	Message	Date
Ryan Malloy	4090c788a2	Strip operator-private files from sdist + add structural defense Some checks failed Security Scan / security-scan (push) Has been cancelled Details The PII audit run before this publish caught three files that have been leaking operator-specific paths to PyPI in v2.1.6, v2.1.7, and v2.2.0: - claude_desktop_config.json (personal Claude Desktop config snapshot) - mcp-pdf-tools-launcher.sh (obsolete — uvx replaces it) - mcp-config-example.json (had hardcoded /home/rpm path + old package name) Fix: - Delete the personal config and obsolete launcher - Sanitize the example to use uvx with the [markdown] extra (matches docs) - Add [tool.hatch.build.targets.sdist] exclude block per ~/.claude/rules/python.md to prevent recurrence — covers dev artifacts, fixture PDFs, internal architecture notes, and CI scripts Side benefit: sdist size dropped from 2.4 MB to 304 KB (8× reduction), mostly from excluding examples/*.pdf and the tests/ fixture PDF. The /home/rpm leaks in prior versions are not credentials, just operator paths — not yanking. Going forward the unpacked-sdist grep is mandatory before each publish.	2026-05-05 17:38:13 -06:00
Ryan Malloy	48c44e941c	v2.2.1: Republish for updated README on PyPI Some checks are pending Security Scan / security-scan (push) Waiting to run Details No code changes — docs-only bump. Surfaces the rewritten README, QUICKSTART, and LOCAL_DEVELOPMENT docs to anyone landing on https://pypi.org/project/mcp-pdf/	2026-05-05 17:36:39 -06:00
Ryan Malloy	c3dd788120	docs: rewrite LOCAL_DEVELOPMENT.md, delete stale CLAUDE_DESKTOP_SETUP.md Some checks are pending Security Scan / security-scan (push) Waiting to run Details CLAUDE_DESKTOP_SETUP.md was actively misleading — listed 8 tools (out of 47), referenced the old `mcp-pdf-tools` package name, and had hardcoded user paths. README.md and QUICKSTART.md cover the same territory correctly now, and nothing in the repo links to it. LOCAL_DEVELOPMENT.md kept its structure (setup → wiring up → testing → publishing → gotchas) but updated to reflect current reality: - `claude mcp add` syntax now uses the required `--` separator - Three patterns shown (local source, pinned PyPI version, latest PyPI with --refresh) since they each serve different dev workflows - markdown_to_pdf added to manual verification checklist - Publishing pipeline now matches what we actually do (clean dist/, PII audit per global rules, twine for upload since uv publish doesn't read ~/.pypirc) - Common gotchas section: mktexfmt errors, FunctionTool test failures, PyPI JSON caching — all real things hit during this session - Removed claim that the server has "23 PDF tools"	2026-05-05 17:23:14 -06:00
Ryan Malloy	31b8b2e6d4	docs: flag texlive-latex-extra requirement, recommend tectonic Some checks are pending Security Scan / security-scan (push) Waiting to run Details texlive-xetex alone is rarely enough — pandoc's default template needs packages from texlive-latex-extra (Debian) / texlive-latexextra (Arch): lastpage, xcolor, framed, fancyhdr, etc. Real markdown docs fail with "File 'X.sty' not found" without them. Restructure system deps to present three engine routes per platform: - tectonic (recommended): ~30 MB static binary, downloads packages on demand - full TeX: texlive-xetex + texlive-latex-extra + texlive-fonts-extra - weasyprint: skip TeX entirely, pip-installable Add an engine comparison table in the README explaining the disk-size and quality trade-offs so users can pick informed.	2026-05-05 16:29:05 -06:00
Ryan Malloy	964fd14a26	docs: cover markdown_to_pdf, [markdown] extra, uvx + pacman install README: - bump tool count 46 → 47, add Format Conversion bullet - fix `claude mcp add` syntax (needs `--` separator before uvx) - show `uvx --from "mcp-pdf[markdown]" mcp-pdf` for the new tool - note about uvx caching + `--refresh` - new "Format Conversion" tools subsection (markdown_to_pdf alongside pdf_to_markdown) - new "Optional Extras" section explaining [forms], [tables], [markdown], [all] - expand System Dependencies with Arch (pacman) and macOS (brew) recipes for pandoc + a PDF engine QUICKSTART: - replace stale `mcp-pdf-tools` package name with current `mcp-pdf` - add uvx as the recommended end-user install path - add pip install patterns including all optional extras - add pacman block alongside apt-get and brew - add markdown_to_pdf troubleshooting (mktexfmt errors, engine fallback) - add a smoke-test snippet using the new tool	2026-05-05 16:27:28 -06:00
Ryan Malloy	b2d9073f04	Add markdown_to_pdf tool — convert .md to PDF via pandoc Some checks are pending Security Scan / security-scan (push) Waiting to run Details New tool in ImageProcessingMixin (sibling of pdf_to_markdown). Accepts either a markdown file path or inline markdown text, writes a PDF to a caller-specified output path. Engine selection auto-detects what's available on PATH, preferring quality: xelatex > pdflatex > tectonic > weasyprint > wkhtmltopdf. Caller can force a specific engine or pass raw pandoc args for advanced cases. pypandoc is gated behind a new [markdown] optional extra so the base install stays lean. The tool surfaces clear errors if pypandoc, pandoc, or all PDF engines are missing. Bumps to v2.2.0 (new feature, minor bump).	2026-05-05 16:21:09 -06:00
Ryan Malloy	0eea85f352	Sync uv.lock to v2.1.7 Some checks failed Security Scan / security-scan (push) Has been cancelled Details	2026-04-25 10:47:43 -06:00
Ryan Malloy	b53d8ab998	Fix document-closed errors in 7 tools, fix stamp font name - Capture total_pages before doc.close() in content_analysis, security_analysis, annotations, and misc_tools mixins - Fix invalid PyMuPDF font name "helv-bold" → "helv" in add_stamps - Bump to v2.1.7	2026-04-07 04:19:20 -06:00
Ryan Malloy	057aa5be40	📉 File-first output for ocr_pdf, slim split_pdf_by_structure response ocr_pdf: writes OCR text to file by default, returns path + preview instead of full text dump (~17k tokens → ~500 tokens). inline=True for old behavior. split_pdf_by_structure: sections are now one-line summaries instead of full path objects. Removed detected_structure dump from response.	2026-03-08 05:30:57 -06:00
Ryan Malloy	d413438fea	📦 Make camelot-py and tabula-py optional dependencies Moves camelot-py[cv] and tabula-py from core to optional deps (pip install mcp-pdf[tables]). Fixes Python 3.14 install failure caused by pdftopng lacking cp314 wheels. - Lazy-import camelot/tabula in all extraction methods - Auto-fallback skips unavailable methods in table extraction - pdfplumber (pure Python, always available) handles tables by default - Also slims get_document_structure response (~12.5k → ~400 tokens)	2026-03-08 03:20:01 -06:00
Ryan Malloy	6af3104633	📉 Slim get_document_structure: cap bookmarks to 20 preview lines Bookmark list was unbounded — a 346-bookmark parts manual produced ~12.5k tokens. Now returns indented bookmark preview (20 lines + count), folds page_analysis and document_organization into structure_summary. ~406 tokens for the same document.	2026-03-06 21:26:30 -07:00
Ryan Malloy	a1aa3f7363	🚀 v2.1.3: bump version for PyPI (2.1.2 was already published)	2026-03-04 17:15:48 -07:00
Ryan Malloy	81a3619144	📉 Slim detect_structure response to ~224 tokens Preview capped at 10 sections as human-readable lines, detection_info moved into the JSON file. Response went from ~22k tokens (inline) to ~1.6k (v2.1.2) to ~224 tokens now.	2026-03-04 17:15:32 -07:00
Ryan Malloy	a23fd8467a	📉 File-first output for detect_structure — 20× context reduction detect_structure now writes full JSON to disk and returns a compact summary (~1k tokens) instead of the full structure tree (~20k tokens). Prevents MCP context overflow on large documents. Set inline=True to get full data in response (used internally by split_pdf_by_structure).	2026-03-04 17:12:36 -07:00
Ryan Malloy	56ab8356bc	🐛 Fix superscript handling and directory name truncation in detect_structure - Two-pass span collection includes sandwiched non-heading spans (e.g. ² in I²C) so superscripts between heading-sized spans aren't dropped - Join heading line parts without spaces ("".join) for proper glyph concatenation - Cap numbering-pattern title at first newline + 80 chars with word boundary break - Reduce _sanitize_dirname max from 80→50 chars with word-boundary truncation	2026-03-02 02:14:26 -07:00
Ryan Malloy	823318ec15	✨ Chapter-aware PDF extraction: detect_structure, split_pdf_by_structure, batch_extract New StructureDetectionMixin with 3 tools: - detect_structure: finds chapters/sections via bookmarks, font-size heuristics, numbering patterns, and user-supplied regex - split_pdf_by_structure: auto-splits PDF into per-chapter directories with markdown + images + vectors in one call - batch_extract: process N user-specified page ranges from one PDF Enhanced pdf_to_markdown: - output_filename parameter for custom .md filenames - vector_diagnostics reporting for skipped pages - vector_fallback_raster: render sub-threshold pages as PNG at 150 DPI Bumps version to 2.1.0	2026-03-01 23:52:15 -07:00
Ryan Malloy	5161a5f952	🚀 v2.0.14: Configurable PDF size limit via MCP_PDF_MAX_SIZE	2026-02-19 15:52:00 -07:00
Ryan Malloy	62d9b176c8	🔧 Replace hardcoded 100MB PDF limit with MCP_PDF_MAX_SIZE env var Centralize PDF size limit in security.py, controlled by MCP_PDF_MAX_SIZE (in MB). Default: disabled (no limit). Set e.g. MCP_PDF_MAX_SIZE=500 to cap at 500MB. Remove unused self.max_file_size from all 13 mixins.	2026-02-19 15:51:41 -07:00
Ryan Malloy	38af9ee2c9	🚀 v2.0.13: Smart vector extraction in pdf_to_markdown	2026-02-18 15:29:41 -07:00
Ryan Malloy	f759634687	✨ Smart vector extraction in pdf_to_markdown Detect significant vector graphics (charts, schematics, diagrams) during markdown conversion and extract them as full-page SVGs to vectors/ subdir. Uses multi-tier heuristic (drawing count, path complexity, bounding box) adapted from extract_charts to avoid false positives on decorative borders. New params: include_vectors, vector_min_drawings, vector_min_complexity	2026-02-18 15:29:25 -07:00
Ryan Malloy	213a721949	🚀 v2.0.12: File-first output for extract_text and pdf_to_markdown	2026-02-18 15:02:04 -07:00
Ryan Malloy	772bcac0df	🐛 File-first output for extract_text and pdf_to_markdown Both tools now write to disk by default and return file path + short preview instead of full content inline. Prevents MCP context overflow on large PDFs. Set inline=True for the old behavior. pdf_to_markdown always extracts images to ./images/ with relative paths (no more dead pdf-image:// URIs). extract_text writes a .txt file.	2026-02-18 15:01:43 -07:00
Ryan Malloy	2d5f7e241d	🚀 v2.0.11: Fix pdf_to_markdown broken image references	2026-02-12 20:24:40 -07:00
Ryan Malloy	8b5783585f	🐛 Fix pdf_to_markdown broken image references pdf_to_markdown generated pdf-image:// URIs that never resolved — the resource handler only existed in the legacy server. Add output_directory parameter: when set, images extract to disk with relative ./images/ paths. Without it, existing pdf-image:// behavior preserved for backward compat. Also adds min_width/min_height filtering (matching extract_images), save_markdown option, and fixes missing extract_vector_graphics in list_capabilities.	2026-02-12 20:24:19 -07:00
Ryan Malloy	febe6dae13	🔧 Add permit_forms with lazy reportlab imports Coordinate-based PDF form filling for scanned/flat PDFs. reportlab is now optional - only loaded when permit tools used. Install with: pip install mcp-pdf[forms]	2026-02-08 13:59:48 -07:00
Ryan Malloy	271e4c71d6	🔧 v2.0.9: Remove unreleased permit_forms mixin that broke PyPI install Some checks failed Security Scan / security-scan (push) Has been cancelled Details	2026-02-08 13:48:32 -07:00
Ryan Malloy	f32a014909	📝 Rewrite README: remove marketing fluff, describe what tools do Some checks failed Security Scan / security-scan (push) Has been cancelled Details	2026-02-06 22:43:02 -07:00
Ryan Malloy	e4f77008bb	🚀 v2.0.8: Add extract_vector_graphics tool for PDF to SVG extraction New tool extracts vector graphics from PDF pages as SVG files, supporting three modes: full_page (PyMuPDF native SVG), drawings_only (raw vector paths), and both. Handles lines, curves, rectangles, quads with proper color space conversion (RGB, grayscale, CMYK). No new dependencies.	2026-02-02 13:56:17 -07:00
Ryan Malloy	19bdeddcdf	📝 Update README: 40 tools, v2.0.7 table features, token management Some checks failed Security Scan / security-scan (push) Has been cancelled Details	2025-11-08 20:12:40 -07:00
Ryan Malloy	dfbf3d1870	🔧 v2.0.7: Fix table extraction token overflow with smart limiting PROBLEM: Table extraction from large PDFs was exceeding MCP's 25,000 token limit, causing "response too large" errors. A 5-page PDF with large tables generated 59,005 tokens, more than double the allowed limit. SOLUTION: Added flexible table data limiting with two new parameters: - max_rows_per_table: Limit rows returned per table (prevents overflow) - summary_only: Return only metadata without table data IMPLEMENTATION: 1. Added new parameters to extract_tables() method signature 2. Created _process_table_data() helper for consistent limiting logic 3. Updated all 3 extraction methods (Camelot, pdfplumber, Tabula) 4. Enhanced table metadata with truncation tracking: - total_rows: Full row count from PDF - rows_returned: Actual rows in response (after limiting) - rows_truncated: Number of rows omitted (if limited) USAGE EXAMPLES: # Summary mode - metadata only (smallest response) extract_tables(pdf_path, pages="1-5", summary_only=True) # Limited data - first 100 rows per table extract_tables(pdf_path, pages="1-5", max_rows_per_table=100) # Full data (default behavior, may overflow on large tables) extract_tables(pdf_path, pages="1-5") BENEFITS: - Prevents MCP token overflow errors - Maintains backward compatibility (new params are optional) - Clear guidance through metadata (shows when truncation occurred) - Flexible - users choose between summary/limited/full modes FILES MODIFIED: - src/mcp_pdf/mixins_official/table_extraction.py (all changes) - src/mcp_pdf/server.py (version bump to 2.0.7) - pyproject.toml (version bump to 2.0.7) VERSION: 2.0.7 PUBLISHED: https://pypi.org/project/mcp-pdf/2.0.7/	2025-11-03 18:26:34 -07:00
Ryan Malloy	fa65fa6e0c	🔧 v2.0.6: Fix async/await bug in validate_output_path calls Remove incorrect 'await' keywords from validate_output_path() calls across all mixins. validate_output_path() is a synchronous function, not async. Fixed in 15 locations across 6 mixins: - advanced_forms.py (4 calls) - annotations.py (3 calls) - document_assembly.py (2 calls) - form_management.py (2 calls) - image_processing.py (1 call) - misc_tools.py (4 calls) Error: 'object PosixPath can't be used in 'await' expression' Root cause: Incorrectly awaiting synchronous Path validation function Fix: Removed await keyword from all validate_output_path() calls PyPI: https://pypi.org/project/mcp-pdf/2.0.6/	2025-11-03 18:03:34 -07:00
Ryan Malloy	3327137536	🚀 v2.0.5: Fix page range parsing across all PDF tools Major architectural improvements and bug fixes in the v2.0.x series: ## v2.0.5 - Page Range Parsing (Current Release) - Fix page range parsing bug affecting 6 mixins (e.g., "93-95" or "11-30") - Create shared parse_pages_parameter() utility function - Support mixed formats: "1,3-5,7,10-15" - Update: pdf_utilities, content_analysis, image_processing, misc_tools, table_extraction, text_extraction ## v2.0.4 - Chunk Hint Fix - Fix next_chunk_hint to show correct page ranges - Dynamic calculation based on actual pages being extracted - Example: "30-50" now correctly shows "40-49" for next chunk ## v2.0.3 - Initial Range Support - Add page range support to text extraction ("11-30") - Fix _parse_pages_parameter to handle ranges with Python's range() - Convert 1-based user input to 0-based internal indexing ## v2.0.2 - Lazy Import Fix - Fix ModuleNotFoundError for reportlab on startup - Implement lazy imports for optional dependencies - Graceful degradation with helpful error messages ## v2.0.1 - Dependency Restructuring - Move reportlab to optional [forms] extra - Document installation: uvx --with mcp-pdf[forms] mcp-pdf ## v2.0.0 - Official FastMCP Pattern Migration - Migrate to official fastmcp.contrib.mcp_mixin pattern - Create 12 specialized mixins with 42 tools total - Architecture: mixins_official/ using MCPMixin base class - Backwards compatibility: server_legacy.py preserved Technical Improvements: - Centralized utility functions (DRY principle) - Consistent behavior across all PDF tools - Better error messages with actionable instructions - Library-specific adapters for table extraction Files Changed: - New: src/mcp_pdf/mixins_official/utils.py (shared utilities) - Updated: 6 mixins with improved page parsing - Version: pyproject.toml, server.py → 2.0.5 PyPI: https://pypi.org/project/mcp-pdf/2.0.5/	2025-11-03 17:12:37 -07:00
Ryan Malloy	8cbf542df1	🔧 Fix output path security with MCP_PDF_ALLOWED_PATHS environment variable BREAKING ISSUE FIXED: - Users reported "Output path not allowed: images" error - extract_images tool was rejecting relative paths due to overly restrictive security NEW SECURITY MODEL: - MCP_PDF_ALLOWED_PATHS environment variable controls allowed output directories - If unset: Allows any directory with "security theater" warnings - If set: Restricts outputs to specified colon-separated paths - Cross-platform compatible (: on Unix, ; on Windows) SECURITY PHILOSOPHY ENHANCED: - "TRUST NO ONE" - honest about application-level security limitations - Clear warnings that this is "security theater" - Emphasis on OS-level permissions and process isolation - Educational guidance on real security practices TECHNICAL CHANGES: - validate_output_path() rewritten with environment variable control - Path validation uses relative_to() for proper containment checking - Enhanced warning messages with security education - Updated documentation with honest security assessment DOCUMENTATION UPDATES: - Added MCP_PDF_ALLOWED_PATHS to configuration section - New "REAL Security" section with OS-level recommendations - Clear explanation of security theater vs actual protection Version: 1.1.1 (patch version for critical bugfix)	2025-09-23 23:40:05 -06:00
Ryan Malloy	856dd41996	✨ Add comprehensive link extraction tool (24th PDF tool) New Features: - extract_links: Extract all PDF hyperlinks with advanced filtering - Page-specific filtering (e.g., "1,3,5" or "1-5,8,10-12") - Link type categorization: external URLs, internal pages, emails, documents - Coordinate tracking for precise link positioning - FastMCP integration with proper tool registration - Version banner display following CLAUDE.md guidelines Technical Improvements: - Enhanced startup banner with package version display - Updated documentation to reflect 24 specialized tools - Proper FastMCP @mcp.tool() decorator usage - Comprehensive error handling and security validation Documentation Updates: - README.md: Updated tool count and installation guides - CLAUDE.md: Added link extraction to implemented features - LOCAL_DEVELOPMENT.md: Enhanced with scoped installation commands Version: 1.1.0 (minor version bump for new feature)	2025-09-23 20:41:16 -06:00
Ryan Malloy	ebf6bb8a43	🚀 Release v1.0.1: Bug fixes and local development tools - Fix variable scope bug in extract_text function - Add local development setup with claude-mcp-manager - Update author information - Add comprehensive local development documentation 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-07 00:58:51 -06:00
Ryan Malloy	8d01c44d4f	🚀 Rename to mcp-pdf and prepare for PyPI publication Package Rebranding: - Renamed package from mcp-pdf-tools to mcp-pdf (cleaner name) - Updated version to 1.0.0 (production ready with security hardening) - Updated all import paths and references throughout codebase PyPI Preparation: - Enhanced package description and metadata - Added proper project URLs and homepage - Updated CLI command from mcp-pdf-tools to mcp-pdf - Built distribution packages (wheel + source) Testing & Validation: - All 20 security tests pass with new package structure - Local installation and import tests successful - CLI command working correctly - Package ready for PyPI publication The secure, production-ready PDF processing platform is now ready for public distribution and installation via pip. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:42:59 -06:00
Ryan Malloy	75f8548668	🔒 Comprehensive security hardening and vulnerability fixes Some checks failed Security Scan / security-scan (push) Has been cancelled Details Implemented extensive security improvements to prevent attacks and ensure production readiness: Critical Security Fixes: - Fixed path traversal vulnerability in get_pdf_image function - Added file size limits (100MB PDFs, 50MB images) to prevent DoS - Implemented secure output path validation with directory restrictions - Added page count limits (1000 pages max) for resource protection - Secured JSON parameter parsing with 10KB size limits Access Control & Validation: - URL allowlisting with SSRF protection (blocks localhost, internal IPs) - IPv6 security handling for comprehensive host blocking - Input validation framework with length limits and sanitization - Secure file permissions (0o700 dirs, 0o600 files) Error Handling & Privacy: - Sanitized error messages to prevent information disclosure - Automatic removal of sensitive patterns (paths, emails, SSNs) - Generic error responses for failed operations Infrastructure & Monitoring: - Added security scanning tools (safety, pip-audit) - GitHub Actions workflow for continuous vulnerability monitoring - Daily automated security assessments - Fixed pypdf vulnerability (5.9.0 → 6.0.0) Testing & Validation: - 20 comprehensive security tests (all passing) - Integration tests confirming functionality preservation - Zero known vulnerabilities in dependencies - Validated all security functions work correctly All security measures tested and verified. Project now production-ready with enterprise-grade security posture. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-06 15:35:31 -06:00
Ryan Malloy	ab1d9ed13e	✨ Add comprehensive PDF annotations and markup tools Implement complete collaboration toolkit with: - add_sticky_notes: Comment annotations with color support - add_highlights: Text highlighting with 8 color options - add_stamps: Approval stamps (APPROVED, DRAFT, CONFIDENTIAL, etc.) - extract_all_annotations: Export to JSON/CSV formats Also includes document assembly features: - merge_pdfs_advanced: Combine PDFs with bookmark preservation - split_pdf_by_pages: Extract specific page ranges - split_pdf_by_bookmarks: Auto-split by chapters/sections - reorder_pdf_pages: Rearrange page sequences All tools tested and working with proper error handling.	2025-09-04 17:18:06 -06:00
Ryan Malloy	95596e0236	✨ Add comprehensive PDF form creation and validation tools - Add complete PDF form lifecycle management - Create new forms with text, checkbox, dropdown, signature fields - Fill existing forms with JSON data and optional flattening - Add fields to existing PDFs with flexible positioning - Advanced field types: radio groups, textareas, date fields - Comprehensive validation engine with regex patterns - Email, phone, number, date format validation - Required field checking and length constraints - Visual validation cues with asterisks and format hints - Multi-field error reporting with detailed feedback - International character support and edge case handling - Enterprise-ready for complex business forms	2025-09-03 02:33:01 -06:00
Ryan Malloy	ae80388ec4	🎯 Add custom output paths and clean summary for image extraction Enhance extract_images with user-specified output directories and concise summary responses to improve user control and reduce context window clutter. Key Features: • Custom Output Directory: Users can specify where images are saved • Clean Summary Output: Concise extraction results instead of verbose metadata • Automatic Directory Creation: Creates output directories as needed • File-Level Details: Individual file info with human-readable sizes • Extraction Summary: Quick overview with total size and file count New Parameters: + output_directory: Optional custom path for saving extracted images + Defaults to cache directory if not specified + Creates directories automatically with proper permissions Response Format: - Removed: Verbose image metadata arrays that fill context windows + Added: Clean summary with extraction statistics + Added: File list with essential details (filename, path, size, dimensions) + Added: Human-readable extraction summary Benefits: ✅ User control over image file locations ✅ Reduced context window pollution ✅ Essential information without verbosity ✅ Better integration with user workflows ✅ Maintains MCP resource compatibility for cached images Example Response: { "success": true, "images_extracted": 3, "total_size": "2.4 MB", "output_directory": "/path/to/custom/dir", "files": [{"filename": "page_1_image_0.png", "path": "/path/...", "size": "800 KB", "dimensions": "1920x1080"}] } 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 13:50:09 -06:00
Ryan Malloy	e087a3b7a0	✨ Add MCP resource URIs for extracted PDF images Implement proper MCP resource protocol for image access, eliminating the need for clients to handle local file paths and enabling seamless image integration. Key Features: • MCP Resource Endpoint: pdf-image://{image_id} for direct image access • extract_images(): Returns resource_uri field with MCP resource links • pdf_to_markdown(): Embeds resource URIs in markdown image references • Automatic MIME type detection (image/png, image/jpeg) • Seamless client integration without file path handling Benefits: ✅ Direct image access via MCP resource protocol ✅ No local file path dependencies for MCP clients ✅ Proper MIME type handling for image display ✅ Clean markdown with working image links ✅ Standards-compliant MCP resource implementation Response Format Enhancement: + "resource_uri": "pdf-image://page_1_image_0" + Works in markdown: \![Image](pdf-image://page_1_image_0) + MIME Type: image/png or image/jpeg + Direct client access without file system dependencies This resolves the limitation where extracted images were only available as local file paths, making them truly accessible to MCP clients through the standardized resource protocol. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 11:42:46 -06:00
Ryan Malloy	374339a15d	🔧 Fix verbose base64 output in image extraction functions Resolve MCP client context overflow by saving images to files instead of returning base64-encoded data that fills client message windows. Key Changes: • extract_images(): Save images to CACHE_DIR with file paths in response • pdf_to_markdown(): Save embedded images to files with path references • Add format_file_size() utility for human-readable file sizes • Update function descriptions to clarify file-based output Benefits: ✅ Prevents context message window overflow in MCP clients ✅ Returns clean, concise metadata with file paths ✅ Maintains full image access through saved files ✅ Improves user experience with readable file sizes ✅ Reduces memory usage and response payload sizes Response Format Changes: - Remove: "data": "<base64_string>" (verbose) + Add: "file_path": "/tmp/mcp-pdf-processing/image.png" + Add: "filename": "page_1_image_0.png" + Add: "size_bytes": 12345 + Add: "size_human": "12.1 KB" This resolves the issue where image extraction caused excessive verbose output that overwhelmed MCP client interfaces. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-20 11:34:42 -06:00
Ryan Malloy	10ef5028eb	📖 Add Claude Code integration command to documentation Feature prominent Claude Code integration instructions: - Add recommended one-line command for Claude Code users - Update installation section with uvx commands - Include git.supported.systems repository URLs - Highlight seamless AI-powered document processing integration Command for Claude Code users: claude mcp add -s local -- legacy-files uvx --from git+https://git.supported.systems/MCP/mcp-legacy-files.git mcp-legacy-files This enables direct access to all 9 vintage format processors within Claude Code for seamless AI-enhanced document processing workflows. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 23:11:28 -06:00
Ryan Malloy	78a8c40e71	Transform README into comprehensive project showcase Major Enhancement: Combined blog post storytelling with technical documentation to create an engaging, comprehensive project showcase. What's New: 📖 Compelling Narrative: Tells the complete story from 8 tools → 23 tools 🎯 Real-World Examples: Business intelligence, academic research, security workflows 🧠 Technical Deep-Dives: Architecture decisions, intelligent fallbacks, UX design ⚡ Performance Insights: Async architecture, caching strategies, resource management 🔧 Complete Documentation: Installation, usage, troubleshooting, contributing Key Sections Added: - "What We Built" - Project overview and use cases - "Key Innovations" - Document intelligence, layout processing, web integration - "Real-World Usage Examples" - 4 comprehensive workflow examples - "Performance & Architecture" - Technical implementation details - "Architecture Deep-Dive" - Code examples and design decisions - "Why MCP PDF Tools?" - Value proposition and differentiators Impact: - Much more engaging for new users and contributors - Showcases the full scope of capabilities (23 tools\!) - Provides clear guidance for different use cases - Demonstrates technical sophistication and quality - Perfect for sharing, contributing, and adoption Now developers can understand not just HOW to use the tools, but WHY this project exists and what makes it special in the PDF processing landscape. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-12 08:40:59 -06:00
Ryan Malloy	f601d44d99	Fix page numbering: Switch to user-friendly 1-based indexing Problem: Zero-based page numbers were confusing for users who naturally think of pages starting from 1. Solution: - Updated `parse_pages_parameter()` to convert 1-based user input to 0-based internal representation - All user-facing documentation now uses 1-based page numbering (page 1 = first page) - Internal processing continues to use 0-based indexing for PyMuPDF compatibility - Output page numbers are consistently displayed as 1-based for users Changes: - Enhanced documentation strings to clarify "1-based" page numbering - Updated README examples with 1-based page numbers and clarifying comments - Fixed split_pdf function to handle 1-based input correctly - Updated test cases to verify 1-based -> 0-based conversion - Added feature highlight: "User-Friendly: All page numbers use 1-based indexing" Impact: Much more intuitive for users - no more confusion about which page is "page 0"\! 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 04:32:20 -06:00
Ryan Malloy	f0365a0d75	Implement comprehensive PDF processing suite with 15 additional advanced tools Major expansion from 8 to 23 total tools covering: Document Analysis & Intelligence: - analyze_pdf_health: Comprehensive quality and health analysis - analyze_pdf_security: Security features and vulnerability assessment - classify_content: AI-powered document type classification - summarize_content: Intelligent content summarization with key insights - compare_pdfs: Advanced document comparison (text, structure, metadata) Layout & Visual Analysis: - analyze_layout: Page layout analysis with column detection - extract_charts: Chart, diagram, and visual element extraction - detect_watermarks: Watermark detection and analysis Content Manipulation: - extract_form_data: Interactive PDF form data extraction - split_pdf: Split PDFs at specified pages - merge_pdfs: Merge multiple PDFs into one - rotate_pages: Rotate pages by 90°/180°/270° Optimization & Utilities: - convert_to_images: Convert PDF pages to image files - optimize_pdf: File size optimization with quality levels - repair_pdf: Corrupted PDF repair and recovery Technical Enhancements: - All tools support HTTPS URLs with intelligent caching - Fixed MCP parameter validation for pages parameter - Comprehensive error handling and validation - Updated documentation with usage examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 04:27:04 -06:00
Ryan Malloy	58d43851b9	Add HTTPS URL support and fix MCP parameter validation Features: - HTTPS URL support: Process PDFs directly from URLs with intelligent caching - Smart caching: 1-hour cache to avoid repeated downloads - Content validation: Verify downloads are actually PDF files - Security: Proper User-Agent headers, HTTPS preferred over HTTP - MCP parameter fixes: Handle pages parameter as string "[2,3]" format - Backward compatibility: Still supports local file paths and list parameters Technical changes: - Added download_pdf_from_url() with caching and validation - Updated validate_pdf_path() to handle URLs and local paths - Added parse_pages_parameter() for flexible parameter parsing - Updated all 8 tools to accept string pages parameters - Enhanced error handling for network and validation issues All tools now support: - Local paths: "/path/to/file.pdf" - HTTPS URLs: "https://example.com/document.pdf" - Flexible pages: "[2,3]", "1,2,3", or [1,2,3] 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-11 02:25:53 -06:00
Ryan Malloy	478ab41b1f	Merge remote repository with local MCP PDF Tools implementation Resolved README.md conflict by preserving comprehensive documentation while maintaining repository structure from git.supported.systems 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 17:00:49 -06:00
Ryan Malloy	dfc6fe1149	Initial commit	2025-08-10 22:59:46 +00:00
Ryan Malloy	c902e81e4d	Initial commit: Complete MCP PDF Tools server implementation Features: - 8 comprehensive PDF processing tools with intelligent fallbacks - Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection) - Table extraction (Camelot → pdfplumber → Tabula fallback chain) - OCR processing with Tesseract and preprocessing options - Document analysis (structure, metadata, scanned detection) - Image extraction with filtering capabilities - PDF to markdown conversion with metadata - Built on FastMCP framework with full MCP protocol support - Comprehensive error handling and user-friendly messages - Docker support and cross-platform compatibility - Complete test suite and examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-10 16:36:21 -06:00

50 Commits