mcp-pdf-tools

Author	SHA1	Message	Date
Ryan Malloy	81a3619144	📉 Slim detect_structure response to ~224 tokens Preview capped at 10 sections as human-readable lines, detection_info moved into the JSON file. Response went from ~22k tokens (inline) to ~1.6k (v2.1.2) to ~224 tokens now.	2026-03-04 17:15:32 -07:00
Ryan Malloy	a23fd8467a	📉 File-first output for detect_structure — 20× context reduction detect_structure now writes full JSON to disk and returns a compact summary (~1k tokens) instead of the full structure tree (~20k tokens). Prevents MCP context overflow on large documents. Set inline=True to get full data in response (used internally by split_pdf_by_structure).	2026-03-04 17:12:36 -07:00
Ryan Malloy	56ab8356bc	🐛 Fix superscript handling and directory name truncation in detect_structure - Two-pass span collection includes sandwiched non-heading spans (e.g. ² in I²C) so superscripts between heading-sized spans aren't dropped - Join heading line parts without spaces ("".join) for proper glyph concatenation - Cap numbering-pattern title at first newline + 80 chars with word boundary break - Reduce _sanitize_dirname max from 80→50 chars with word-boundary truncation	2026-03-02 02:14:26 -07:00
Ryan Malloy	823318ec15	✨ Chapter-aware PDF extraction: detect_structure, split_pdf_by_structure, batch_extract New StructureDetectionMixin with 3 tools: - detect_structure: finds chapters/sections via bookmarks, font-size heuristics, numbering patterns, and user-supplied regex - split_pdf_by_structure: auto-splits PDF into per-chapter directories with markdown + images + vectors in one call - batch_extract: process N user-specified page ranges from one PDF Enhanced pdf_to_markdown: - output_filename parameter for custom .md filenames - vector_diagnostics reporting for skipped pages - vector_fallback_raster: render sub-threshold pages as PNG at 150 DPI Bumps version to 2.1.0	2026-03-01 23:52:15 -07:00

4 Commits