Preview capped at 10 sections as human-readable lines, detection_info
moved into the JSON file. Response went from ~22k tokens (inline) to
~1.6k (v2.1.2) to ~224 tokens now.
detect_structure now writes full JSON to disk and returns a compact
summary (~1k tokens) instead of the full structure tree (~20k tokens).
Prevents MCP context overflow on large documents. Set inline=True to
get full data in response (used internally by split_pdf_by_structure).
- Two-pass span collection includes sandwiched non-heading spans (e.g. ² in I²C)
so superscripts between heading-sized spans aren't dropped
- Join heading line parts without spaces ("".join) for proper glyph concatenation
- Cap numbering-pattern title at first newline + 80 chars with word boundary break
- Reduce _sanitize_dirname max from 80→50 chars with word-boundary truncation
New StructureDetectionMixin with 3 tools:
- detect_structure: finds chapters/sections via bookmarks, font-size
heuristics, numbering patterns, and user-supplied regex
- split_pdf_by_structure: auto-splits PDF into per-chapter directories
with markdown + images + vectors in one call
- batch_extract: process N user-specified page ranges from one PDF
Enhanced pdf_to_markdown:
- output_filename parameter for custom .md filenames
- vector_diagnostics reporting for skipped pages
- vector_fallback_raster: render sub-threshold pages as PNG at 150 DPI
Bumps version to 2.1.0