4 Commits

Author SHA1 Message Date
81a3619144 📉 Slim detect_structure response to ~224 tokens
Preview capped at 10 sections as human-readable lines, detection_info
moved into the JSON file. Response went from ~22k tokens (inline) to
~1.6k (v2.1.2) to ~224 tokens now.
2026-03-04 17:15:32 -07:00
a23fd8467a 📉 File-first output for detect_structure — 20× context reduction
detect_structure now writes full JSON to disk and returns a compact
summary (~1k tokens) instead of the full structure tree (~20k tokens).
Prevents MCP context overflow on large documents. Set inline=True to
get full data in response (used internally by split_pdf_by_structure).
2026-03-04 17:12:36 -07:00
56ab8356bc 🐛 Fix superscript handling and directory name truncation in detect_structure
- Two-pass span collection includes sandwiched non-heading spans (e.g. ² in I²C)
  so superscripts between heading-sized spans aren't dropped
- Join heading line parts without spaces ("".join) for proper glyph concatenation
- Cap numbering-pattern title at first newline + 80 chars with word boundary break
- Reduce _sanitize_dirname max from 80→50 chars with word-boundary truncation
2026-03-02 02:14:26 -07:00
823318ec15 Chapter-aware PDF extraction: detect_structure, split_pdf_by_structure, batch_extract
New StructureDetectionMixin with 3 tools:
- detect_structure: finds chapters/sections via bookmarks, font-size
  heuristics, numbering patterns, and user-supplied regex
- split_pdf_by_structure: auto-splits PDF into per-chapter directories
  with markdown + images + vectors in one call
- batch_extract: process N user-specified page ranges from one PDF

Enhanced pdf_to_markdown:
- output_filename parameter for custom .md filenames
- vector_diagnostics reporting for skipped pages
- vector_fallback_raster: render sub-threshold pages as PNG at 150 DPI

Bumps version to 2.1.0
2026-03-01 23:52:15 -07:00