✨ Smart vector extraction in pdf_to_markdown
Detect significant vector graphics (charts, schematics, diagrams) during markdown conversion and extract them as full-page SVGs to vectors/ subdir. Uses multi-tier heuristic (drawing count, path complexity, bounding box) adapted from extract_charts to avoid false positives on decorative borders. New params: include_vectors, vector_min_drawings, vector_min_complexity
This commit is contained in:
parent
213a721949
commit
f759634687
@ -91,7 +91,7 @@ uv publish
|
|||||||
2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula
|
2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula
|
||||||
3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options
|
3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options
|
||||||
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
|
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
|
||||||
5. **Format Conversion**: `pdf_to_markdown` - Writes markdown + extracted images to disk by default, returns path + preview. Images use relative `./images/` paths. Set `inline=True` for full markdown in response.
|
5. **Format Conversion**: `pdf_to_markdown` - Writes markdown + extracted raster images and vector graphics (SVG) to disk by default, returns path + preview. Images use relative `./images/` paths, vectors use `./vectors/` paths. Set `inline=True` for full markdown in response. Set `include_vectors=False` to skip vector extraction.
|
||||||
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
|
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
|
||||||
7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization
|
7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization
|
||||||
8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management
|
8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management
|
||||||
@ -104,7 +104,8 @@ uv publish
|
|||||||
- **Custom Output Paths**: `extract_images` allows users to specify where images are saved
|
- **Custom Output Paths**: `extract_images` allows users to specify where images are saved
|
||||||
- **Clean Summary Output**: Returns concise extraction summary instead of verbose image metadata
|
- **Clean Summary Output**: Returns concise extraction summary instead of verbose image metadata
|
||||||
- **File-First Output**: `extract_text` and `pdf_to_markdown` write results to files by default, returning paths + short previews instead of full content — prevents MCP context overflow on large PDFs
|
- **File-First Output**: `extract_text` and `pdf_to_markdown` write results to files by default, returning paths + short previews instead of full content — prevents MCP context overflow on large PDFs
|
||||||
- **Disk-Based Images**: `pdf_to_markdown` always extracts images to `{output_directory}/images/` with relative `./images/` paths — compatible with Starlight, browsers, and standard renderers
|
- **Disk-Based Images**: `pdf_to_markdown` extracts raster images to `{output_directory}/images/` with relative `./images/` paths — compatible with Starlight, browsers, and standard renderers
|
||||||
|
- **Vector Graphics**: `pdf_to_markdown` auto-detects significant vector content (charts, schematics, diagrams) and extracts full-page SVGs to `{output_directory}/vectors/` with relative `./vectors/` paths. Controlled by `include_vectors` parameter (default: True)
|
||||||
- **Inline Escape Hatch**: Both tools accept `inline=True` to return full content in the response for small queries
|
- **Inline Escape Hatch**: Both tools accept `inline=True` to return full content in the response for small queries
|
||||||
- **User Control**: Flexible output directory support with automatic directory creation
|
- **User Control**: Flexible output directory support with automatic directory creation
|
||||||
|
|
||||||
|
|||||||
@ -215,8 +215,9 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
@mcp_tool(
|
@mcp_tool(
|
||||||
name="pdf_to_markdown",
|
name="pdf_to_markdown",
|
||||||
description=(
|
description=(
|
||||||
"Convert PDF to markdown and write to a .md file. Images are extracted "
|
"Convert PDF to markdown and write to a .md file. Raster images are "
|
||||||
"to {output_directory}/images/ with relative ./images/ paths. Returns "
|
"extracted to {output_directory}/images/ and vector graphics (charts, "
|
||||||
|
"schematics, diagrams) to {output_directory}/vectors/ as SVG. Returns "
|
||||||
"the output file path and a short preview — full markdown is in the file. "
|
"the output file path and a short preview — full markdown is in the file. "
|
||||||
"Set inline=True to get full markdown in the response instead."
|
"Set inline=True to get full markdown in the response instead."
|
||||||
)
|
)
|
||||||
@ -231,19 +232,24 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
min_width: int = 100,
|
min_width: int = 100,
|
||||||
min_height: int = 100,
|
min_height: int = 100,
|
||||||
image_format: str = "png",
|
image_format: str = "png",
|
||||||
inline: bool = False
|
inline: bool = False,
|
||||||
|
include_vectors: bool = True,
|
||||||
|
vector_min_drawings: int = 5,
|
||||||
|
vector_min_complexity: int = 50,
|
||||||
) -> Dict[str, Any]:
|
) -> Dict[str, Any]:
|
||||||
"""
|
"""
|
||||||
Convert PDF to clean markdown format and write to file.
|
Convert PDF to clean markdown format and write to file.
|
||||||
|
|
||||||
By default, writes markdown to a file and extracts images to an images/
|
By default, writes markdown to a file, extracts raster images to an images/
|
||||||
subdirectory with relative paths. Returns file path + summary to avoid
|
subdirectory, and extracts significant vector graphics (charts, schematics,
|
||||||
filling the MCP context window. Set inline=True for full markdown in response.
|
diagrams) to a vectors/ subdirectory as SVG. Returns file path + summary to
|
||||||
|
avoid filling the MCP context window. Set inline=True for full markdown in
|
||||||
|
response.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
pdf_path: Path to PDF file or HTTPS URL
|
pdf_path: Path to PDF file or HTTPS URL
|
||||||
pages: Page numbers to convert (comma-separated, 1-based), None for all
|
pages: Page numbers to convert (comma-separated, 1-based), None for all
|
||||||
include_images: Whether to include images in markdown
|
include_images: Whether to include raster images in markdown
|
||||||
include_metadata: Whether to include document metadata
|
include_metadata: Whether to include document metadata
|
||||||
output_directory: Directory for output .md file and images/ subdirectory.
|
output_directory: Directory for output .md file and images/ subdirectory.
|
||||||
Defaults to a temp directory if not specified.
|
Defaults to a temp directory if not specified.
|
||||||
@ -251,6 +257,10 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
min_height: Minimum image height to extract (filters small decorative images)
|
min_height: Minimum image height to extract (filters small decorative images)
|
||||||
image_format: Image format - "png" or "jpg"
|
image_format: Image format - "png" or "jpg"
|
||||||
inline: Return full markdown in response instead of writing to file
|
inline: Return full markdown in response instead of writing to file
|
||||||
|
include_vectors: Extract significant vector graphics as SVG (default: True).
|
||||||
|
Detects charts, schematics, and technical drawings automatically.
|
||||||
|
vector_min_drawings: Minimum drawing count per page to consider (default: 5)
|
||||||
|
vector_min_complexity: Minimum total path items for extraction (default: 50)
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Dictionary with output_file path and summary, or full markdown if inline=True
|
Dictionary with output_file path and summary, or full markdown if inline=True
|
||||||
@ -275,7 +285,9 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
# Setup output directory — always needed (file output is the default)
|
# Setup output directory — always needed (file output is the default)
|
||||||
images_extracted = 0
|
images_extracted = 0
|
||||||
images_skipped = 0
|
images_skipped = 0
|
||||||
|
vectors_extracted = 0
|
||||||
extracted_image_info = []
|
extracted_image_info = []
|
||||||
|
extracted_vector_info = []
|
||||||
|
|
||||||
if output_directory:
|
if output_directory:
|
||||||
output_dir = validate_output_path(output_directory)
|
output_dir = validate_output_path(output_directory)
|
||||||
@ -284,6 +296,9 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
images_dir = output_dir / "images"
|
images_dir = output_dir / "images"
|
||||||
images_dir.mkdir(parents=True, exist_ok=True)
|
images_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
if include_vectors:
|
||||||
|
vectors_dir = output_dir / "vectors"
|
||||||
|
vectors_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
markdown_parts = []
|
markdown_parts = []
|
||||||
|
|
||||||
@ -361,6 +376,37 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
logger.warning(f"Failed to process image {img_index + 1} on page {page_num + 1}: {e}")
|
logger.warning(f"Failed to process image {img_index + 1} on page {page_num + 1}: {e}")
|
||||||
images_skipped += 1
|
images_skipped += 1
|
||||||
|
|
||||||
|
# Extract significant vector graphics as SVG
|
||||||
|
if include_vectors:
|
||||||
|
try:
|
||||||
|
drawings = page.get_drawings()
|
||||||
|
if self._is_vector_significant(
|
||||||
|
drawings, vector_min_drawings, vector_min_complexity
|
||||||
|
):
|
||||||
|
base_name = input_pdf_path.stem
|
||||||
|
svg_content = page.get_svg_image(text_as_path=False)
|
||||||
|
svg_filename = f"{base_name}_page_{page_num + 1}.svg"
|
||||||
|
svg_path = vectors_dir / svg_filename
|
||||||
|
with open(svg_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(svg_content)
|
||||||
|
file_size = svg_path.stat().st_size
|
||||||
|
extracted_vector_info.append({
|
||||||
|
"filename": svg_filename,
|
||||||
|
"path": str(svg_path),
|
||||||
|
"page": page_num + 1,
|
||||||
|
"drawing_count": len(drawings),
|
||||||
|
"total_items": sum(
|
||||||
|
len(d.get("items", [])) for d in drawings
|
||||||
|
),
|
||||||
|
"size_bytes": file_size,
|
||||||
|
})
|
||||||
|
vectors_extracted += 1
|
||||||
|
markdown_parts.append(
|
||||||
|
f"\n\n"
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Failed to extract vectors from page {page_num + 1}: {e}")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"Failed to process page {page_num + 1}: {e}")
|
logger.warning(f"Failed to process page {page_num + 1}: {e}")
|
||||||
markdown_parts.append(f"*[Error processing page {page_num + 1}: {str(e)[:100]}]*\n\n")
|
markdown_parts.append(f"*[Error processing page {page_num + 1}: {str(e)[:100]}]*\n\n")
|
||||||
@ -382,25 +428,33 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
"line_count": line_count,
|
"line_count": line_count,
|
||||||
"character_count": char_count,
|
"character_count": char_count,
|
||||||
"images_extracted": images_extracted,
|
"images_extracted": images_extracted,
|
||||||
"images_skipped": images_skipped
|
"images_skipped": images_skipped,
|
||||||
|
"vectors_extracted": vectors_extracted,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Inline mode: return full markdown in response
|
# Inline mode: return full markdown in response
|
||||||
if inline:
|
if inline:
|
||||||
return {
|
result = {
|
||||||
"success": True,
|
"success": True,
|
||||||
"markdown": full_markdown,
|
"markdown": full_markdown,
|
||||||
"conversion_summary": conversion_summary,
|
"conversion_summary": conversion_summary,
|
||||||
"image_output": {
|
"image_output": {
|
||||||
"images_directory": str(images_dir),
|
"images_directory": str(images_dir),
|
||||||
"images": extracted_image_info
|
"images": extracted_image_info,
|
||||||
},
|
},
|
||||||
"file_info": {
|
"file_info": {
|
||||||
"input_path": str(input_pdf_path),
|
"input_path": str(input_pdf_path),
|
||||||
"pages_processed": pages or "all"
|
"pages_processed": pages or "all",
|
||||||
},
|
},
|
||||||
"conversion_time": round(time.time() - start_time, 2)
|
"conversion_time": round(time.time() - start_time, 2),
|
||||||
}
|
}
|
||||||
|
if include_vectors and extracted_vector_info:
|
||||||
|
result["vector_output"] = {
|
||||||
|
"vectors_directory": str(vectors_dir),
|
||||||
|
"vectors_extracted": vectors_extracted,
|
||||||
|
"vectors": extracted_vector_info,
|
||||||
|
}
|
||||||
|
return result
|
||||||
|
|
||||||
# File output mode (default): write .md file, return path + summary
|
# File output mode (default): write .md file, return path + summary
|
||||||
md_path = output_dir / f"{input_pdf_path.stem}.md"
|
md_path = output_dir / f"{input_pdf_path.stem}.md"
|
||||||
@ -415,7 +469,7 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
preview = preview[:last_period + 1]
|
preview = preview[:last_period + 1]
|
||||||
preview += " [...]"
|
preview += " [...]"
|
||||||
|
|
||||||
return {
|
result = {
|
||||||
"success": True,
|
"success": True,
|
||||||
"output_file": str(md_path),
|
"output_file": str(md_path),
|
||||||
"markdown_preview": preview,
|
"markdown_preview": preview,
|
||||||
@ -427,17 +481,24 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
"filter_settings": {
|
"filter_settings": {
|
||||||
"min_width": min_width,
|
"min_width": min_width,
|
||||||
"min_height": min_height,
|
"min_height": min_height,
|
||||||
"image_format": image_format
|
"image_format": image_format,
|
||||||
},
|
},
|
||||||
"images": extracted_image_info
|
"images": extracted_image_info,
|
||||||
},
|
},
|
||||||
"file_info": {
|
"file_info": {
|
||||||
"input_path": str(input_pdf_path),
|
"input_path": str(input_pdf_path),
|
||||||
"output_directory": str(output_dir),
|
"output_directory": str(output_dir),
|
||||||
"pages_processed": pages or "all"
|
"pages_processed": pages or "all",
|
||||||
},
|
},
|
||||||
"conversion_time": round(time.time() - start_time, 2)
|
"conversion_time": round(time.time() - start_time, 2),
|
||||||
}
|
}
|
||||||
|
if include_vectors and extracted_vector_info:
|
||||||
|
result["vector_output"] = {
|
||||||
|
"vectors_directory": str(vectors_dir),
|
||||||
|
"vectors_extracted": vectors_extracted,
|
||||||
|
"vectors": extracted_vector_info,
|
||||||
|
}
|
||||||
|
return result
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
error_msg = sanitize_error_message(str(e))
|
error_msg = sanitize_error_message(str(e))
|
||||||
@ -482,6 +543,26 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
markdown_patterns = ['# ', '## ', '### ', '* ', '- ', '1. ', '**', '__']
|
markdown_patterns = ['# ', '## ', '### ', '* ', '- ', '1. ', '**', '__']
|
||||||
return any(pattern in line for pattern in markdown_patterns)
|
return any(pattern in line for pattern in markdown_patterns)
|
||||||
|
|
||||||
|
def _is_vector_significant(self, drawings, min_drawings=5, min_complexity=50):
|
||||||
|
"""Detect if a page's drawings represent meaningful vector content (charts, schematics).
|
||||||
|
|
||||||
|
Uses a multi-tier heuristic adapted from extract_charts:
|
||||||
|
1. Drawing count gate — filters pages with only border lines
|
||||||
|
2. Total path complexity — charts and schematics have many path items
|
||||||
|
3. Single complex drawing — catches large diagrams even on sparse pages
|
||||||
|
"""
|
||||||
|
if len(drawings) < min_drawings:
|
||||||
|
return False
|
||||||
|
total_items = sum(len(d.get("items", [])) for d in drawings)
|
||||||
|
if total_items >= min_complexity:
|
||||||
|
return True
|
||||||
|
for d in drawings:
|
||||||
|
items = d.get("items", [])
|
||||||
|
rect = d.get("rect", fitz.Rect(0, 0, 0, 0))
|
||||||
|
if len(items) > 20 and (rect.width > 200 or rect.height > 150):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
@mcp_tool(
|
@mcp_tool(
|
||||||
name="extract_vector_graphics",
|
name="extract_vector_graphics",
|
||||||
description="Extract vector graphics from PDF to SVG format. Ideal for schematics, charts, and technical drawings."
|
description="Extract vector graphics from PDF to SVG format. Ideal for schematics, charts, and technical drawings."
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user