16 Commits

Author SHA1 Message Date
a23fd8467a 📉 File-first output for detect_structure — 20× context reduction
detect_structure now writes full JSON to disk and returns a compact
summary (~1k tokens) instead of the full structure tree (~20k tokens).
Prevents MCP context overflow on large documents. Set inline=True to
get full data in response (used internally by split_pdf_by_structure).
2026-03-04 17:12:36 -07:00
823318ec15 Chapter-aware PDF extraction: detect_structure, split_pdf_by_structure, batch_extract
New StructureDetectionMixin with 3 tools:
- detect_structure: finds chapters/sections via bookmarks, font-size
  heuristics, numbering patterns, and user-supplied regex
- split_pdf_by_structure: auto-splits PDF into per-chapter directories
  with markdown + images + vectors in one call
- batch_extract: process N user-specified page ranges from one PDF

Enhanced pdf_to_markdown:
- output_filename parameter for custom .md filenames
- vector_diagnostics reporting for skipped pages
- vector_fallback_raster: render sub-threshold pages as PNG at 150 DPI

Bumps version to 2.1.0
2026-03-01 23:52:15 -07:00
62d9b176c8 🔧 Replace hardcoded 100MB PDF limit with MCP_PDF_MAX_SIZE env var
Centralize PDF size limit in security.py, controlled by MCP_PDF_MAX_SIZE
(in MB). Default: disabled (no limit). Set e.g. MCP_PDF_MAX_SIZE=500 to
cap at 500MB. Remove unused self.max_file_size from all 13 mixins.
2026-02-19 15:51:41 -07:00
f759634687 Smart vector extraction in pdf_to_markdown
Detect significant vector graphics (charts, schematics, diagrams) during
markdown conversion and extract them as full-page SVGs to vectors/ subdir.
Uses multi-tier heuristic (drawing count, path complexity, bounding box)
adapted from extract_charts to avoid false positives on decorative borders.

New params: include_vectors, vector_min_drawings, vector_min_complexity
2026-02-18 15:29:25 -07:00
772bcac0df 🐛 File-first output for extract_text and pdf_to_markdown
Both tools now write to disk by default and return file path + short
preview instead of full content inline. Prevents MCP context overflow
on large PDFs. Set inline=True for the old behavior.

pdf_to_markdown always extracts images to ./images/ with relative paths
(no more dead pdf-image:// URIs). extract_text writes a .txt file.
2026-02-18 15:01:43 -07:00
8b5783585f 🐛 Fix pdf_to_markdown broken image references
pdf_to_markdown generated pdf-image:// URIs that never resolved — the
resource handler only existed in the legacy server. Add output_directory
parameter: when set, images extract to disk with relative ./images/ paths.
Without it, existing pdf-image:// behavior preserved for backward compat.

Also adds min_width/min_height filtering (matching extract_images),
save_markdown option, and fixes missing extract_vector_graphics in
list_capabilities.
2026-02-12 20:24:19 -07:00
8cbf542df1 🔧 Fix output path security with MCP_PDF_ALLOWED_PATHS environment variable
BREAKING ISSUE FIXED:
- Users reported "Output path not allowed: images" error
- extract_images tool was rejecting relative paths due to overly restrictive security

NEW SECURITY MODEL:
- MCP_PDF_ALLOWED_PATHS environment variable controls allowed output directories
- If unset: Allows any directory with "security theater" warnings
- If set: Restricts outputs to specified colon-separated paths
- Cross-platform compatible (: on Unix, ; on Windows)

SECURITY PHILOSOPHY ENHANCED:
- "TRUST NO ONE" - honest about application-level security limitations
- Clear warnings that this is "security theater"
- Emphasis on OS-level permissions and process isolation
- Educational guidance on real security practices

TECHNICAL CHANGES:
- validate_output_path() rewritten with environment variable control
- Path validation uses relative_to() for proper containment checking
- Enhanced warning messages with security education
- Updated documentation with honest security assessment

DOCUMENTATION UPDATES:
- Added MCP_PDF_ALLOWED_PATHS to configuration section
- New "REAL Security" section with OS-level recommendations
- Clear explanation of security theater vs actual protection

Version: 1.1.1 (patch version for critical bugfix)
2025-09-23 23:40:05 -06:00
856dd41996 Add comprehensive link extraction tool (24th PDF tool)
New Features:
- extract_links: Extract all PDF hyperlinks with advanced filtering
- Page-specific filtering (e.g., "1,3,5" or "1-5,8,10-12")
- Link type categorization: external URLs, internal pages, emails, documents
- Coordinate tracking for precise link positioning
- FastMCP integration with proper tool registration
- Version banner display following CLAUDE.md guidelines

Technical Improvements:
- Enhanced startup banner with package version display
- Updated documentation to reflect 24 specialized tools
- Proper FastMCP @mcp.tool() decorator usage
- Comprehensive error handling and security validation

Documentation Updates:
- README.md: Updated tool count and installation guides
- CLAUDE.md: Added link extraction to implemented features
- LOCAL_DEVELOPMENT.md: Enhanced with scoped installation commands

Version: 1.1.0 (minor version bump for new feature)
2025-09-23 20:41:16 -06:00
8d01c44d4f 🚀 Rename to mcp-pdf and prepare for PyPI publication
**Package Rebranding:**
- Renamed package from mcp-pdf-tools to mcp-pdf (cleaner name)
- Updated version to 1.0.0 (production ready with security hardening)
- Updated all import paths and references throughout codebase

**PyPI Preparation:**
- Enhanced package description and metadata
- Added proper project URLs and homepage
- Updated CLI command from mcp-pdf-tools to mcp-pdf
- Built distribution packages (wheel + source)

**Testing & Validation:**
- All 20 security tests pass with new package structure
- Local installation and import tests successful
- CLI command working correctly
- Package ready for PyPI publication

The secure, production-ready PDF processing platform is now ready
for public distribution and installation via pip.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-06 15:42:59 -06:00
75f8548668 🔒 Comprehensive security hardening and vulnerability fixes
Some checks failed
Security Scan / security-scan (push) Has been cancelled
Implemented extensive security improvements to prevent attacks and ensure
production readiness:

**Critical Security Fixes:**
- Fixed path traversal vulnerability in get_pdf_image function
- Added file size limits (100MB PDFs, 50MB images) to prevent DoS
- Implemented secure output path validation with directory restrictions
- Added page count limits (1000 pages max) for resource protection
- Secured JSON parameter parsing with 10KB size limits

**Access Control & Validation:**
- URL allowlisting with SSRF protection (blocks localhost, internal IPs)
- IPv6 security handling for comprehensive host blocking
- Input validation framework with length limits and sanitization
- Secure file permissions (0o700 dirs, 0o600 files)

**Error Handling & Privacy:**
- Sanitized error messages to prevent information disclosure
- Automatic removal of sensitive patterns (paths, emails, SSNs)
- Generic error responses for failed operations

**Infrastructure & Monitoring:**
- Added security scanning tools (safety, pip-audit)
- GitHub Actions workflow for continuous vulnerability monitoring
- Daily automated security assessments
- Fixed pypdf vulnerability (5.9.0 → 6.0.0)

**Testing & Validation:**
- 20 comprehensive security tests (all passing)
- Integration tests confirming functionality preservation
- Zero known vulnerabilities in dependencies
- Validated all security functions work correctly

All security measures tested and verified. Project now production-ready
with enterprise-grade security posture.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-06 15:35:31 -06:00
ab1d9ed13e Add comprehensive PDF annotations and markup tools
Implement complete collaboration toolkit with:
- add_sticky_notes: Comment annotations with color support
- add_highlights: Text highlighting with 8 color options
- add_stamps: Approval stamps (APPROVED, DRAFT, CONFIDENTIAL, etc.)
- extract_all_annotations: Export to JSON/CSV formats

Also includes document assembly features:
- merge_pdfs_advanced: Combine PDFs with bookmark preservation
- split_pdf_by_pages: Extract specific page ranges
- split_pdf_by_bookmarks: Auto-split by chapters/sections
- reorder_pdf_pages: Rearrange page sequences

All tools tested and working with proper error handling.
2025-09-04 17:18:06 -06:00
95596e0236 Add comprehensive PDF form creation and validation tools
- Add complete PDF form lifecycle management
- Create new forms with text, checkbox, dropdown, signature fields
- Fill existing forms with JSON data and optional flattening
- Add fields to existing PDFs with flexible positioning
- Advanced field types: radio groups, textareas, date fields
- Comprehensive validation engine with regex patterns
- Email, phone, number, date format validation
- Required field checking and length constraints
- Visual validation cues with asterisks and format hints
- Multi-field error reporting with detailed feedback
- International character support and edge case handling
- Enterprise-ready for complex business forms
2025-09-03 02:33:01 -06:00
ae80388ec4 🎯 Add custom output paths and clean summary for image extraction
Enhance extract_images with user-specified output directories and concise
summary responses to improve user control and reduce context window clutter.

Key Features:
• Custom Output Directory: Users can specify where images are saved
• Clean Summary Output: Concise extraction results instead of verbose metadata
• Automatic Directory Creation: Creates output directories as needed
• File-Level Details: Individual file info with human-readable sizes
• Extraction Summary: Quick overview with total size and file count

New Parameters:
+ output_directory: Optional custom path for saving extracted images
+ Defaults to cache directory if not specified
+ Creates directories automatically with proper permissions

Response Format:
- Removed: Verbose image metadata arrays that fill context windows
+ Added: Clean summary with extraction statistics
+ Added: File list with essential details (filename, path, size, dimensions)
+ Added: Human-readable extraction summary

Benefits:
 User control over image file locations
 Reduced context window pollution
 Essential information without verbosity
 Better integration with user workflows
 Maintains MCP resource compatibility for cached images

Example Response:
{
  "success": true,
  "images_extracted": 3,
  "total_size": "2.4 MB",
  "output_directory": "/path/to/custom/dir",
  "files": [{"filename": "page_1_image_0.png", "path": "/path/...", "size": "800 KB", "dimensions": "1920x1080"}]
}

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-20 13:50:09 -06:00
e087a3b7a0 Add MCP resource URIs for extracted PDF images
Implement proper MCP resource protocol for image access, eliminating the need
for clients to handle local file paths and enabling seamless image integration.

Key Features:
• MCP Resource Endpoint: pdf-image://{image_id} for direct image access
• extract_images(): Returns resource_uri field with MCP resource links
• pdf_to_markdown(): Embeds resource URIs in markdown image references
• Automatic MIME type detection (image/png, image/jpeg)
• Seamless client integration without file path handling

Benefits:
 Direct image access via MCP resource protocol
 No local file path dependencies for MCP clients
 Proper MIME type handling for image display
 Clean markdown with working image links
 Standards-compliant MCP resource implementation

Response Format Enhancement:
+ "resource_uri": "pdf-image://page_1_image_0"
+ Works in markdown: \![Image](pdf-image://page_1_image_0)
+ MIME Type: image/png or image/jpeg
+ Direct client access without file system dependencies

This resolves the limitation where extracted images were only available
as local file paths, making them truly accessible to MCP clients
through the standardized resource protocol.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-20 11:42:46 -06:00
374339a15d 🔧 Fix verbose base64 output in image extraction functions
Resolve MCP client context overflow by saving images to files instead of
returning base64-encoded data that fills client message windows.

Key Changes:
• extract_images(): Save images to CACHE_DIR with file paths in response
• pdf_to_markdown(): Save embedded images to files with path references
• Add format_file_size() utility for human-readable file sizes
• Update function descriptions to clarify file-based output

Benefits:
 Prevents context message window overflow in MCP clients
 Returns clean, concise metadata with file paths
 Maintains full image access through saved files
 Improves user experience with readable file sizes
 Reduces memory usage and response payload sizes

Response Format Changes:
- Remove: "data": "<base64_string>" (verbose)
+ Add: "file_path": "/tmp/mcp-pdf-processing/image.png"
+ Add: "filename": "page_1_image_0.png"
+ Add: "size_bytes": 12345
+ Add: "size_human": "12.1 KB"

This resolves the issue where image extraction caused excessive verbose
output that overwhelmed MCP client interfaces.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-20 11:34:42 -06:00
c902e81e4d Initial commit: Complete MCP PDF Tools server implementation
Features:
- 8 comprehensive PDF processing tools with intelligent fallbacks
- Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection)
- Table extraction (Camelot → pdfplumber → Tabula fallback chain)
- OCR processing with Tesseract and preprocessing options
- Document analysis (structure, metadata, scanned detection)
- Image extraction with filtering capabilities
- PDF to markdown conversion with metadata
- Built on FastMCP framework with full MCP protocol support
- Comprehensive error handling and user-friendly messages
- Docker support and cross-platform compatibility
- Complete test suite and examples

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 16:36:21 -06:00