23 Commits

Author SHA1 Message Date
48c44e941c v2.2.1: Republish for updated README on PyPI
Some checks are pending
Security Scan / security-scan (push) Waiting to run
No code changes — docs-only bump. Surfaces the rewritten README,
QUICKSTART, and LOCAL_DEVELOPMENT docs to anyone landing on
https://pypi.org/project/mcp-pdf/
2026-05-05 17:36:39 -06:00
b2d9073f04 Add markdown_to_pdf tool — convert .md to PDF via pandoc
Some checks are pending
Security Scan / security-scan (push) Waiting to run
New tool in ImageProcessingMixin (sibling of pdf_to_markdown). Accepts
either a markdown file path or inline markdown text, writes a PDF to a
caller-specified output path.

Engine selection auto-detects what's available on PATH, preferring quality:
xelatex > pdflatex > tectonic > weasyprint > wkhtmltopdf. Caller can force
a specific engine or pass raw pandoc args for advanced cases.

pypandoc is gated behind a new [markdown] optional extra so the base
install stays lean. The tool surfaces clear errors if pypandoc, pandoc,
or all PDF engines are missing.

Bumps to v2.2.0 (new feature, minor bump).
2026-05-05 16:21:09 -06:00
0eea85f352 Sync uv.lock to v2.1.7
Some checks failed
Security Scan / security-scan (push) Has been cancelled
2026-04-25 10:47:43 -06:00
Ryan Malloy
b53d8ab998 Fix document-closed errors in 7 tools, fix stamp font name
- Capture total_pages before doc.close() in content_analysis,
  security_analysis, annotations, and misc_tools mixins
- Fix invalid PyMuPDF font name "helv-bold" → "helv" in add_stamps
- Bump to v2.1.7
2026-04-07 04:19:20 -06:00
057aa5be40 📉 File-first output for ocr_pdf, slim split_pdf_by_structure response
ocr_pdf: writes OCR text to file by default, returns path + preview
instead of full text dump (~17k tokens → ~500 tokens). inline=True
for old behavior.

split_pdf_by_structure: sections are now one-line summaries instead
of full path objects. Removed detected_structure dump from response.
2026-03-08 05:30:57 -06:00
d413438fea 📦 Make camelot-py and tabula-py optional dependencies
Moves camelot-py[cv] and tabula-py from core to optional deps
(pip install mcp-pdf[tables]). Fixes Python 3.14 install failure
caused by pdftopng lacking cp314 wheels.

- Lazy-import camelot/tabula in all extraction methods
- Auto-fallback skips unavailable methods in table extraction
- pdfplumber (pure Python, always available) handles tables by default
- Also slims get_document_structure response (~12.5k → ~400 tokens)
2026-03-08 03:20:01 -06:00
6af3104633 📉 Slim get_document_structure: cap bookmarks to 20 preview lines
Bookmark list was unbounded — a 346-bookmark parts manual produced
~12.5k tokens. Now returns indented bookmark preview (20 lines + count),
folds page_analysis and document_organization into structure_summary.
~406 tokens for the same document.
2026-03-06 21:26:30 -07:00
81a3619144 📉 Slim detect_structure response to ~224 tokens
Preview capped at 10 sections as human-readable lines, detection_info
moved into the JSON file. Response went from ~22k tokens (inline) to
~1.6k (v2.1.2) to ~224 tokens now.
2026-03-04 17:15:32 -07:00
a23fd8467a 📉 File-first output for detect_structure — 20× context reduction
detect_structure now writes full JSON to disk and returns a compact
summary (~1k tokens) instead of the full structure tree (~20k tokens).
Prevents MCP context overflow on large documents. Set inline=True to
get full data in response (used internally by split_pdf_by_structure).
2026-03-04 17:12:36 -07:00
56ab8356bc 🐛 Fix superscript handling and directory name truncation in detect_structure
- Two-pass span collection includes sandwiched non-heading spans (e.g. ² in I²C)
  so superscripts between heading-sized spans aren't dropped
- Join heading line parts without spaces ("".join) for proper glyph concatenation
- Cap numbering-pattern title at first newline + 80 chars with word boundary break
- Reduce _sanitize_dirname max from 80→50 chars with word-boundary truncation
2026-03-02 02:14:26 -07:00
5161a5f952 🚀 v2.0.14: Configurable PDF size limit via MCP_PDF_MAX_SIZE 2026-02-19 15:52:00 -07:00
38af9ee2c9 🚀 v2.0.13: Smart vector extraction in pdf_to_markdown 2026-02-18 15:29:41 -07:00
271e4c71d6 🔧 v2.0.9: Remove unreleased permit_forms mixin that broke PyPI install
Some checks failed
Security Scan / security-scan (push) Has been cancelled
2026-02-08 13:48:32 -07:00
e4f77008bb 🚀 v2.0.8: Add extract_vector_graphics tool for PDF to SVG extraction
New tool extracts vector graphics from PDF pages as SVG files, supporting
three modes: full_page (PyMuPDF native SVG), drawings_only (raw vector
paths), and both. Handles lines, curves, rectangles, quads with proper
color space conversion (RGB, grayscale, CMYK). No new dependencies.
2026-02-02 13:56:17 -07:00
dfbf3d1870 🔧 v2.0.7: Fix table extraction token overflow with smart limiting
PROBLEM:
Table extraction from large PDFs was exceeding MCP's 25,000 token limit,
causing "response too large" errors. A 5-page PDF with large tables
generated 59,005 tokens, more than double the allowed limit.

SOLUTION:
Added flexible table data limiting with two new parameters:
- max_rows_per_table: Limit rows returned per table (prevents overflow)
- summary_only: Return only metadata without table data

IMPLEMENTATION:
1. Added new parameters to extract_tables() method signature
2. Created _process_table_data() helper for consistent limiting logic
3. Updated all 3 extraction methods (Camelot, pdfplumber, Tabula)
4. Enhanced table metadata with truncation tracking:
   - total_rows: Full row count from PDF
   - rows_returned: Actual rows in response (after limiting)
   - rows_truncated: Number of rows omitted (if limited)

USAGE EXAMPLES:
# Summary mode - metadata only (smallest response)
extract_tables(pdf_path, pages="1-5", summary_only=True)

# Limited data - first 100 rows per table
extract_tables(pdf_path, pages="1-5", max_rows_per_table=100)

# Full data (default behavior, may overflow on large tables)
extract_tables(pdf_path, pages="1-5")

BENEFITS:
- Prevents MCP token overflow errors
- Maintains backward compatibility (new params are optional)
- Clear guidance through metadata (shows when truncation occurred)
- Flexible - users choose between summary/limited/full modes

FILES MODIFIED:
- src/mcp_pdf/mixins_official/table_extraction.py (all changes)
- src/mcp_pdf/server.py (version bump to 2.0.7)
- pyproject.toml (version bump to 2.0.7)

VERSION: 2.0.7
PUBLISHED: https://pypi.org/project/mcp-pdf/2.0.7/
2025-11-03 18:26:34 -07:00
fa65fa6e0c 🔧 v2.0.6: Fix async/await bug in validate_output_path calls
Remove incorrect 'await' keywords from validate_output_path() calls across all mixins.
validate_output_path() is a synchronous function, not async.

Fixed in 15 locations across 6 mixins:
- advanced_forms.py (4 calls)
- annotations.py (3 calls)
- document_assembly.py (2 calls)
- form_management.py (2 calls)
- image_processing.py (1 call)
- misc_tools.py (4 calls)

Error: 'object PosixPath can't be used in 'await' expression'
Root cause: Incorrectly awaiting synchronous Path validation function
Fix: Removed await keyword from all validate_output_path() calls

PyPI: https://pypi.org/project/mcp-pdf/2.0.6/
2025-11-03 18:03:34 -07:00
3327137536 🚀 v2.0.5: Fix page range parsing across all PDF tools
Major architectural improvements and bug fixes in the v2.0.x series:

## v2.0.5 - Page Range Parsing (Current Release)
- Fix page range parsing bug affecting 6 mixins (e.g., "93-95" or "11-30")
- Create shared parse_pages_parameter() utility function
- Support mixed formats: "1,3-5,7,10-15"
- Update: pdf_utilities, content_analysis, image_processing, misc_tools, table_extraction, text_extraction

## v2.0.4 - Chunk Hint Fix
- Fix next_chunk_hint to show correct page ranges
- Dynamic calculation based on actual pages being extracted
- Example: "30-50" now correctly shows "40-49" for next chunk

## v2.0.3 - Initial Range Support
- Add page range support to text extraction ("11-30")
- Fix _parse_pages_parameter to handle ranges with Python's range()
- Convert 1-based user input to 0-based internal indexing

## v2.0.2 - Lazy Import Fix
- Fix ModuleNotFoundError for reportlab on startup
- Implement lazy imports for optional dependencies
- Graceful degradation with helpful error messages

## v2.0.1 - Dependency Restructuring
- Move reportlab to optional [forms] extra
- Document installation: uvx --with mcp-pdf[forms] mcp-pdf

## v2.0.0 - Official FastMCP Pattern Migration
- Migrate to official fastmcp.contrib.mcp_mixin pattern
- Create 12 specialized mixins with 42 tools total
- Architecture: mixins_official/ using MCPMixin base class
- Backwards compatibility: server_legacy.py preserved

Technical Improvements:
- Centralized utility functions (DRY principle)
- Consistent behavior across all PDF tools
- Better error messages with actionable instructions
- Library-specific adapters for table extraction

Files Changed:
- New: src/mcp_pdf/mixins_official/utils.py (shared utilities)
- Updated: 6 mixins with improved page parsing
- Version: pyproject.toml, server.py → 2.0.5

PyPI: https://pypi.org/project/mcp-pdf/2.0.5/
2025-11-03 17:12:37 -07:00
8cbf542df1 🔧 Fix output path security with MCP_PDF_ALLOWED_PATHS environment variable
BREAKING ISSUE FIXED:
- Users reported "Output path not allowed: images" error
- extract_images tool was rejecting relative paths due to overly restrictive security

NEW SECURITY MODEL:
- MCP_PDF_ALLOWED_PATHS environment variable controls allowed output directories
- If unset: Allows any directory with "security theater" warnings
- If set: Restricts outputs to specified colon-separated paths
- Cross-platform compatible (: on Unix, ; on Windows)

SECURITY PHILOSOPHY ENHANCED:
- "TRUST NO ONE" - honest about application-level security limitations
- Clear warnings that this is "security theater"
- Emphasis on OS-level permissions and process isolation
- Educational guidance on real security practices

TECHNICAL CHANGES:
- validate_output_path() rewritten with environment variable control
- Path validation uses relative_to() for proper containment checking
- Enhanced warning messages with security education
- Updated documentation with honest security assessment

DOCUMENTATION UPDATES:
- Added MCP_PDF_ALLOWED_PATHS to configuration section
- New "REAL Security" section with OS-level recommendations
- Clear explanation of security theater vs actual protection

Version: 1.1.1 (patch version for critical bugfix)
2025-09-23 23:40:05 -06:00
ebf6bb8a43 🚀 Release v1.0.1: Bug fixes and local development tools
- Fix variable scope bug in extract_text function
- Add local development setup with claude-mcp-manager
- Update author information
- Add comprehensive local development documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-07 00:58:51 -06:00
8d01c44d4f 🚀 Rename to mcp-pdf and prepare for PyPI publication
**Package Rebranding:**
- Renamed package from mcp-pdf-tools to mcp-pdf (cleaner name)
- Updated version to 1.0.0 (production ready with security hardening)
- Updated all import paths and references throughout codebase

**PyPI Preparation:**
- Enhanced package description and metadata
- Added proper project URLs and homepage
- Updated CLI command from mcp-pdf-tools to mcp-pdf
- Built distribution packages (wheel + source)

**Testing & Validation:**
- All 20 security tests pass with new package structure
- Local installation and import tests successful
- CLI command working correctly
- Package ready for PyPI publication

The secure, production-ready PDF processing platform is now ready
for public distribution and installation via pip.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-06 15:42:59 -06:00
75f8548668 🔒 Comprehensive security hardening and vulnerability fixes
Some checks failed
Security Scan / security-scan (push) Has been cancelled
Implemented extensive security improvements to prevent attacks and ensure
production readiness:

**Critical Security Fixes:**
- Fixed path traversal vulnerability in get_pdf_image function
- Added file size limits (100MB PDFs, 50MB images) to prevent DoS
- Implemented secure output path validation with directory restrictions
- Added page count limits (1000 pages max) for resource protection
- Secured JSON parameter parsing with 10KB size limits

**Access Control & Validation:**
- URL allowlisting with SSRF protection (blocks localhost, internal IPs)
- IPv6 security handling for comprehensive host blocking
- Input validation framework with length limits and sanitization
- Secure file permissions (0o700 dirs, 0o600 files)

**Error Handling & Privacy:**
- Sanitized error messages to prevent information disclosure
- Automatic removal of sensitive patterns (paths, emails, SSNs)
- Generic error responses for failed operations

**Infrastructure & Monitoring:**
- Added security scanning tools (safety, pip-audit)
- GitHub Actions workflow for continuous vulnerability monitoring
- Daily automated security assessments
- Fixed pypdf vulnerability (5.9.0 → 6.0.0)

**Testing & Validation:**
- 20 comprehensive security tests (all passing)
- Integration tests confirming functionality preservation
- Zero known vulnerabilities in dependencies
- Validated all security functions work correctly

All security measures tested and verified. Project now production-ready
with enterprise-grade security posture.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-06 15:35:31 -06:00
95596e0236 Add comprehensive PDF form creation and validation tools
- Add complete PDF form lifecycle management
- Create new forms with text, checkbox, dropdown, signature fields
- Fill existing forms with JSON data and optional flattening
- Add fields to existing PDFs with flexible positioning
- Advanced field types: radio groups, textareas, date fields
- Comprehensive validation engine with regex patterns
- Email, phone, number, date format validation
- Required field checking and length constraints
- Visual validation cues with asterisks and format hints
- Multi-field error reporting with detailed feedback
- International character support and edge case handling
- Enterprise-ready for complex business forms
2025-09-03 02:33:01 -06:00
c902e81e4d Initial commit: Complete MCP PDF Tools server implementation
Features:
- 8 comprehensive PDF processing tools with intelligent fallbacks
- Text extraction (PyMuPDF, pdfplumber, pypdf with auto-selection)
- Table extraction (Camelot → pdfplumber → Tabula fallback chain)
- OCR processing with Tesseract and preprocessing options
- Document analysis (structure, metadata, scanned detection)
- Image extraction with filtering capabilities
- PDF to markdown conversion with metadata
- Built on FastMCP framework with full MCP protocol support
- Comprehensive error handling and user-friendly messages
- Docker support and cross-platform compatibility
- Complete test suite and examples

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 16:36:21 -06:00