🐛 File-first output for extract_text and pdf_to_markdown

Both tools now write to disk by default and return file path + short
preview instead of full content inline. Prevents MCP context overflow
on large PDFs. Set inline=True for the old behavior.

pdf_to_markdown always extracts images to ./images/ with relative paths
(no more dead pdf-image:// URIs). extract_text writes a .txt file.
This commit is contained in:
Ryan Malloy 2026-02-18 15:01:43 -07:00
parent 2d5f7e241d
commit 772bcac0df
3 changed files with 205 additions and 140 deletions

View File

@ -87,11 +87,11 @@ uv publish
### Tool Categories ### Tool Categories
1. **Text Extraction**: `extract_text` - Intelligent method selection with automatic chunking for large files 1. **Text Extraction**: `extract_text` - Writes extracted text to a .txt file by default, returns path + preview. Set `inline=True` for full text in response.
2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula 2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula
3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options 3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata` 4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
5. **Format Conversion**: `pdf_to_markdown` - Convert PDF to markdown. With `output_directory`, extracts images to disk with relative `./images/` paths. Without, uses `pdf-image://` MCP resource URIs (legacy). 5. **Format Conversion**: `pdf_to_markdown` - Writes markdown + extracted images to disk by default, returns path + preview. Images use relative `./images/` paths. Set `inline=True` for full markdown in response.
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output 6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization 7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization
8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management 8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management
@ -103,9 +103,9 @@ uv publish
**Optimized for MCP Context Management:** **Optimized for MCP Context Management:**
- **Custom Output Paths**: `extract_images` allows users to specify where images are saved - **Custom Output Paths**: `extract_images` allows users to specify where images are saved
- **Clean Summary Output**: Returns concise extraction summary instead of verbose image metadata - **Clean Summary Output**: Returns concise extraction summary instead of verbose image metadata
- **Resource URIs**: `pdf_to_markdown` uses `pdf-image://{image_id}` protocol when no `output_directory` is set (legacy mode) - **File-First Output**: `extract_text` and `pdf_to_markdown` write results to files by default, returning paths + short previews instead of full content — prevents MCP context overflow on large PDFs
- **Disk-Based Images**: When `output_directory` is provided, `pdf_to_markdown` extracts images to `{output_directory}/images/` with relative `./images/` paths — compatible with Starlight, browsers, and standard renderers - **Disk-Based Images**: `pdf_to_markdown` always extracts images to `{output_directory}/images/` with relative `./images/` paths — compatible with Starlight, browsers, and standard renderers
- **Prevents Context Overflow**: Avoids verbose output that fills client message windows - **Inline Escape Hatch**: Both tools accept `inline=True` to return full content in the response for small queries
- **User Control**: Flexible output directory support with automatic directory creation - **User Control**: Flexible output directory support with automatic directory creation
### Intelligent Fallbacks and Token Management ### Intelligent Fallbacks and Token Management

View File

@ -215,10 +215,10 @@ class ImageProcessingMixin(MCPMixin):
@mcp_tool( @mcp_tool(
name="pdf_to_markdown", name="pdf_to_markdown",
description=( description=(
"Convert PDF to markdown. When output_directory is provided, images are " "Convert PDF to markdown and write to a .md file. Images are extracted "
"extracted to {output_directory}/images/ with relative ./images/ paths in " "to {output_directory}/images/ with relative ./images/ paths. Returns "
"the markdown — ready for Starlight, browsers, or any renderer. " "the output file path and a short preview — full markdown is in the file. "
"Without output_directory, images use pdf-image:// MCP resource URIs." "Set inline=True to get full markdown in the response instead."
) )
) )
async def pdf_to_markdown( async def pdf_to_markdown(
@ -231,30 +231,29 @@ class ImageProcessingMixin(MCPMixin):
min_width: int = 100, min_width: int = 100,
min_height: int = 100, min_height: int = 100,
image_format: str = "png", image_format: str = "png",
save_markdown: bool = False inline: bool = False
) -> Dict[str, Any]: ) -> Dict[str, Any]:
""" """
Convert PDF to clean markdown format. Convert PDF to clean markdown format and write to file.
Two image modes: By default, writes markdown to a file and extracts images to an images/
- With output_directory: extracts images to disk, uses relative paths in markdown. subdirectory with relative paths. Returns file path + summary to avoid
Images are filtered by min_width/min_height (matching extract_images behavior). filling the MCP context window. Set inline=True for full markdown in response.
- Without output_directory: uses pdf-image:// MCP resource URIs (legacy behavior).
Args: Args:
pdf_path: Path to PDF file or HTTPS URL pdf_path: Path to PDF file or HTTPS URL
pages: Page numbers to convert (comma-separated, 1-based), None for all pages: Page numbers to convert (comma-separated, 1-based), None for all
include_images: Whether to include images in markdown include_images: Whether to include images in markdown
include_metadata: Whether to include document metadata include_metadata: Whether to include document metadata
output_directory: Directory for extracted images and optional markdown file. output_directory: Directory for output .md file and images/ subdirectory.
When set, images go to {output_directory}/images/ with relative paths. Defaults to a temp directory if not specified.
min_width: Minimum image width to extract (only when output_directory is set) min_width: Minimum image width to extract (filters small decorative images)
min_height: Minimum image height to extract (only when output_directory is set) min_height: Minimum image height to extract (filters small decorative images)
image_format: Image format - "png" or "jpg" (only when output_directory is set) image_format: Image format - "png" or "jpg"
save_markdown: Save markdown to {output_directory}/{filename}.md inline: Return full markdown in response instead of writing to file
Returns: Returns:
Dictionary containing markdown content and metadata Dictionary with output_file path and summary, or full markdown if inline=True
""" """
start_time = time.time() start_time = time.time()
@ -273,17 +272,18 @@ class ImageProcessingMixin(MCPMixin):
pages_to_process = parsed_pages if parsed_pages else list(range(total_pages)) pages_to_process = parsed_pages if parsed_pages else list(range(total_pages))
pages_to_process = [p for p in pages_to_process if 0 <= p < total_pages] pages_to_process = [p for p in pages_to_process if 0 <= p < total_pages]
# Setup output directory for image extraction # Setup output directory — always needed (file output is the default)
images_dir = None
images_extracted = 0 images_extracted = 0
images_skipped = 0 images_skipped = 0
extracted_image_info = [] extracted_image_info = []
if output_directory: if output_directory:
output_dir = validate_output_path(output_directory) output_dir = validate_output_path(output_directory)
output_dir.mkdir(parents=True, exist_ok=True) else:
images_dir = output_dir / "images" output_dir = Path(tempfile.mkdtemp(prefix="pdf_markdown_"))
images_dir.mkdir(parents=True, exist_ok=True) output_dir.mkdir(parents=True, exist_ok=True)
images_dir = output_dir / "images"
images_dir.mkdir(parents=True, exist_ok=True)
markdown_parts = [] markdown_parts = []
@ -322,53 +322,44 @@ class ImageProcessingMixin(MCPMixin):
for img_index, img in enumerate(image_list): for img_index, img in enumerate(image_list):
try: try:
alt_text = f"Image {img_index + 1} from page {page_num + 1}" alt_text = f"Image {img_index + 1} from page {page_num + 1}"
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if images_dir: if pix.width < min_width or pix.height < min_height:
# Disk mode: extract image, filter by size, save to images/ images_skipped += 1
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.width < min_width or pix.height < min_height:
images_skipped += 1
pix = None
continue
# Convert CMYK to RGB if necessary
if pix.n - pix.alpha >= 4:
pix = fitz.Pixmap(fitz.csRGB, pix)
base_name = input_pdf_path.stem
filename = f"{base_name}_page_{page_num + 1}_img_{img_index + 1}.{image_format}"
img_path = images_dir / filename
if image_format.lower() in ["jpg", "jpeg"]:
pix.save(str(img_path), "JPEG")
else:
pix.save(str(img_path), "PNG")
file_size = img_path.stat().st_size
extracted_image_info.append({
"filename": filename,
"path": str(img_path),
"page": page_num + 1,
"width": pix.width,
"height": pix.height,
"size_bytes": file_size
})
images_extracted += 1
pix = None pix = None
continue
markdown_parts.append(f"![{alt_text}](./images/{filename})\n\n") # Convert CMYK to RGB if necessary
if pix.n - pix.alpha >= 4:
pix = fitz.Pixmap(fitz.csRGB, pix)
base_name = input_pdf_path.stem
filename = f"{base_name}_page_{page_num + 1}_img_{img_index + 1}.{image_format}"
img_path = images_dir / filename
if image_format.lower() in ["jpg", "jpeg"]:
pix.save(str(img_path), "JPEG")
else: else:
# Legacy mode: pdf-image:// MCP resource URI pix.save(str(img_path), "PNG")
image_id = f"page_{page_num + 1}_img_{img_index + 1}"
mcp_uri = f"pdf-image://{image_id}" file_size = img_path.stat().st_size
markdown_parts.append(f"![{alt_text}]({mcp_uri})\n\n") extracted_image_info.append({
"filename": filename,
"path": str(img_path),
"page": page_num + 1,
"width": pix.width,
"height": pix.height,
"size_bytes": file_size
})
images_extracted += 1
pix = None
markdown_parts.append(f"![{alt_text}](./images/{filename})\n\n")
except Exception as e: except Exception as e:
logger.warning(f"Failed to process image {img_index + 1} on page {page_num + 1}: {e}") logger.warning(f"Failed to process image {img_index + 1} on page {page_num + 1}: {e}")
if images_dir: images_skipped += 1
images_skipped += 1
except Exception as e: except Exception as e:
logger.warning(f"Failed to process page {page_num + 1}: {e}") logger.warning(f"Failed to process page {page_num + 1}: {e}")
@ -379,42 +370,57 @@ class ImageProcessingMixin(MCPMixin):
# Combine all markdown parts # Combine all markdown parts
full_markdown = "".join(markdown_parts) full_markdown = "".join(markdown_parts)
# Save markdown file if requested
markdown_path = None
if save_markdown and output_directory:
md_path = output_dir / f"{input_pdf_path.stem}.md"
with open(md_path, 'w', encoding='utf-8') as f:
f.write(full_markdown)
markdown_path = str(md_path)
# Calculate statistics # Calculate statistics
word_count = len(full_markdown.split()) word_count = len(full_markdown.split())
line_count = len(full_markdown.split('\n')) line_count = len(full_markdown.split('\n'))
char_count = len(full_markdown) char_count = len(full_markdown)
result = { conversion_summary = {
"success": True, "pages_converted": len(pages_to_process),
"markdown": full_markdown, "total_pages": total_pages,
"conversion_summary": { "word_count": word_count,
"pages_converted": len(pages_to_process), "line_count": line_count,
"total_pages": total_pages, "character_count": char_count,
"word_count": word_count, "images_extracted": images_extracted,
"line_count": line_count, "images_skipped": images_skipped
"character_count": char_count,
"includes_images": include_images,
"includes_metadata": include_metadata,
"images_extracted": images_extracted,
"images_skipped": images_skipped
},
"file_info": {
"input_path": str(input_pdf_path),
"pages_processed": pages or "all"
},
"conversion_time": round(time.time() - start_time, 2)
} }
if images_dir: # Inline mode: return full markdown in response
result["image_output"] = { if inline:
return {
"success": True,
"markdown": full_markdown,
"conversion_summary": conversion_summary,
"image_output": {
"images_directory": str(images_dir),
"images": extracted_image_info
},
"file_info": {
"input_path": str(input_pdf_path),
"pages_processed": pages or "all"
},
"conversion_time": round(time.time() - start_time, 2)
}
# File output mode (default): write .md file, return path + summary
md_path = output_dir / f"{input_pdf_path.stem}.md"
with open(md_path, 'w', encoding='utf-8') as f:
f.write(full_markdown)
# Build preview (first ~500 chars at sentence boundary)
preview = full_markdown[:500]
if len(full_markdown) > 500:
last_period = preview.rfind('.')
if last_period > 300:
preview = preview[:last_period + 1]
preview += " [...]"
return {
"success": True,
"output_file": str(md_path),
"markdown_preview": preview,
"conversion_summary": conversion_summary,
"image_output": {
"images_directory": str(images_dir), "images_directory": str(images_dir),
"images_extracted": images_extracted, "images_extracted": images_extracted,
"images_skipped": images_skipped, "images_skipped": images_skipped,
@ -424,17 +430,14 @@ class ImageProcessingMixin(MCPMixin):
"image_format": image_format "image_format": image_format
}, },
"images": extracted_image_info "images": extracted_image_info
} },
else: "file_info": {
result["mcp_integration"] = { "input_path": str(input_pdf_path),
"image_uri_format": "pdf-image://{image_id}", "output_directory": str(output_dir),
"description": "Images use MCP resource URIs. Set output_directory for disk-based images with relative paths." "pages_processed": pages or "all"
} },
"conversion_time": round(time.time() - start_time, 2)
if markdown_path: }
result["markdown_path"] = markdown_path
return result
except Exception as e: except Exception as e:
error_msg = sanitize_error_message(str(e)) error_msg = sanitize_error_message(str(e))

View File

@ -3,8 +3,8 @@ Text Extraction Mixin - PDF text extraction, OCR, and scanned PDF detection
Uses official fastmcp.contrib.mcp_mixin pattern Uses official fastmcp.contrib.mcp_mixin pattern
""" """
import asyncio
import time import time
import tempfile
from pathlib import Path from pathlib import Path
from typing import Dict, Any, Optional, List from typing import Dict, Any, Optional, List
import logging import logging
@ -18,7 +18,7 @@ import io
# Official FastMCP mixin # Official FastMCP mixin
from fastmcp.contrib.mcp_mixin import MCPMixin, mcp_tool from fastmcp.contrib.mcp_mixin import MCPMixin, mcp_tool
from ..security import validate_pdf_path, sanitize_error_message from ..security import validate_pdf_path, validate_output_path, sanitize_error_message
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -36,30 +36,43 @@ class TextExtractionMixin(MCPMixin):
@mcp_tool( @mcp_tool(
name="extract_text", name="extract_text",
description="Extract text from PDF with intelligent method selection and automatic chunking for large files" description=(
"Extract text from PDF and write to a .txt file. Returns the output "
"file path and a short preview — full text is in the file, not in the "
"response. Use output_directory to control where the file is saved, "
"or set inline=True to get full text in the response instead."
)
) )
async def extract_text( async def extract_text(
self, self,
pdf_path: str, pdf_path: str,
pages: Optional[str] = None, pages: Optional[str] = None,
method: str = "auto", method: str = "auto",
preserve_layout: bool = False,
output_directory: Optional[str] = None,
inline: bool = False,
chunk_pages: int = 10, chunk_pages: int = 10,
max_tokens: int = 20000, max_tokens: int = 20000
preserve_layout: bool = False
) -> Dict[str, Any]: ) -> Dict[str, Any]:
""" """
Extract text from PDF with intelligent method selection. Extract text from PDF with intelligent method selection.
By default, writes extracted text to a file and returns the path with
a short preview. This prevents large extractions from filling the MCP
context window. Set inline=True for the old behavior (full text in response).
Args: Args:
pdf_path: Path to PDF file or HTTPS URL pdf_path: Path to PDF file or HTTPS URL
pages: Page numbers to extract (comma-separated, 1-based), None for all pages: Page numbers to extract (comma-separated, 1-based), None for all
method: Extraction method ("auto", "pymupdf", "pdfplumber", "pypdf") method: Extraction method ("auto", "pymupdf", "pdfplumber", "pypdf")
chunk_pages: Number of pages per chunk for large files
max_tokens: Maximum tokens per response to prevent overflow
preserve_layout: Whether to preserve text layout and formatting preserve_layout: Whether to preserve text layout and formatting
output_directory: Directory to save the text file (default: temp directory)
inline: Return full text in response instead of writing to file
chunk_pages: Pages per chunk when inline=True (ignored for file output)
max_tokens: Max chars when inline=True (ignored for file output)
Returns: Returns:
Dictionary containing extracted text and metadata Dictionary with output_file path and summary, or full text if inline=True
""" """
start_time = time.time() start_time = time.time()
@ -84,44 +97,93 @@ class TextExtractionMixin(MCPMixin):
"extraction_time": 0 "extraction_time": 0
} }
# Check if chunking is needed # Inline mode: old behavior with chunking/truncation
if len(pages_to_extract) > chunk_pages: if inline:
return await self._extract_text_chunked( if len(pages_to_extract) > chunk_pages:
doc, path, pages_to_extract, method, chunk_pages, return await self._extract_text_chunked(
max_tokens, preserve_layout, start_time doc, path, pages_to_extract, method, chunk_pages,
) max_tokens, preserve_layout, start_time
)
# Extract text from specified pages extraction_result = await self._extract_text_from_pages(
doc, pages_to_extract, method, preserve_layout
)
doc.close()
if len(extraction_result["text"]) > max_tokens:
truncated_text = extraction_result["text"][:max_tokens]
last_period = truncated_text.rfind('.')
if last_period > max_tokens * 0.8:
truncated_text = truncated_text[:last_period + 1]
extraction_result["text"] = truncated_text
extraction_result["truncated"] = True
extraction_result["truncation_reason"] = f"Response too large (>{max_tokens} chars)"
extraction_result.update({
"success": True,
"file_info": {
"path": str(path),
"total_pages": total_pages,
"pages_extracted": len(pages_to_extract),
"pages_requested": pages or "all"
},
"extraction_time": round(time.time() - start_time, 2)
})
return extraction_result
# File output mode (default): extract all requested pages, write to file
extraction_result = await self._extract_text_from_pages( extraction_result = await self._extract_text_from_pages(
doc, pages_to_extract, method, preserve_layout doc, pages_to_extract, method, preserve_layout
) )
doc.close() doc.close()
# Check token limit and truncate if necessary full_text = extraction_result["text"]
if len(extraction_result["text"]) > max_tokens:
truncated_text = extraction_result["text"][:max_tokens]
# Try to truncate at sentence boundary
last_period = truncated_text.rfind('.')
if last_period > max_tokens * 0.8: # If we can find a good break point
truncated_text = truncated_text[:last_period + 1]
extraction_result["text"] = truncated_text # Setup output directory
extraction_result["truncated"] = True if output_directory:
extraction_result["truncation_reason"] = f"Response too large (>{max_tokens} chars)" output_dir = validate_output_path(output_directory)
output_dir.mkdir(parents=True, exist_ok=True)
else:
output_dir = Path(tempfile.mkdtemp(prefix="pdf_text_"))
extraction_result.update({ # Write text to file
output_filename = f"{path.stem}.txt"
output_path = output_dir / output_filename
with open(output_path, 'w', encoding='utf-8') as f:
f.write(full_text)
# Build preview (first ~500 chars at sentence boundary)
preview = full_text[:500]
if len(full_text) > 500:
last_period = preview.rfind('.')
if last_period > 300:
preview = preview[:last_period + 1]
preview += " [...]"
word_count = len(full_text.split())
char_count = len(full_text)
file_size = output_path.stat().st_size
return {
"success": True, "success": True,
"file_info": { "output_file": str(output_path),
"path": str(path), "text_preview": preview,
"total_pages": total_pages, "extraction_summary": {
"word_count": word_count,
"character_count": char_count,
"file_size_bytes": file_size,
"file_size_kb": round(file_size / 1024, 1),
"pages_extracted": len(pages_to_extract), "pages_extracted": len(pages_to_extract),
"total_pages": total_pages,
"method_used": extraction_result.get("method_used", method)
},
"file_info": {
"input_path": str(path),
"total_pages": total_pages,
"pages_requested": pages or "all" "pages_requested": pages or "all"
}, },
"extraction_time": round(time.time() - start_time, 2) "extraction_time": round(time.time() - start_time, 2)
}) }
return extraction_result
except Exception as e: except Exception as e:
error_msg = sanitize_error_message(str(e)) error_msg = sanitize_error_message(str(e))