🐛 File-first output for extract_text and pdf_to_markdown

Both tools now write to disk by default and return file path + short preview instead of full content inline. Prevents MCP context overflow on large PDFs. Set inline=True for the old behavior. pdf_to_markdown always extracts images to ./images/ with relative paths (no more dead pdf-image:// URIs). extract_text writes a .txt file.
2026-02-18 15:01:43 -07:00 · 2026-02-18 15:01:43 -07:00 · 772bcac0df
commit 772bcac0df
parent 2d5f7e241d
3 changed files with 205 additions and 140 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -87,11 +87,11 @@ uv publish
 ### Tool Categories
-1. **Text Extraction**: `extract_text` - Intelligent method selection with automatic chunking for large files
+1. **Text Extraction**: `extract_text` - Writes extracted text to a .txt file by default, returns path + preview. Set `inline=True` for full text in response.
 2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula
 3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options
 4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
-5. **Format Conversion**: `pdf_to_markdown` - Convert PDF to markdown. With `output_directory`, extracts images to disk with relative `./images/` paths. Without, uses `pdf-image://` MCP resource URIs (legacy).
+5. **Format Conversion**: `pdf_to_markdown` - Writes markdown + extracted images to disk by default, returns path + preview. Images use relative `./images/` paths. Set `inline=True` for full markdown in response.
 6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
 7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization
 8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management
@ -103,9 +103,9 @@ uv publish
 **Optimized for MCP Context Management:**
 - **Custom Output Paths**: `extract_images` allows users to specify where images are saved
 - **Clean Summary Output**: Returns concise extraction summary instead of verbose image metadata
- **Resource URIs**: `pdf_to_markdown` uses `pdf-image://{image_id}` protocol when no `output_directory` is set (legacy mode)
+- **File-First Output**: `extract_text` and `pdf_to_markdown` write results to files by default, returning paths + short previews instead of full content — prevents MCP context overflow on large PDFs
- **Disk-Based Images**: When `output_directory` is provided, `pdf_to_markdown` extracts images to `{output_directory}/images/` with relative `./images/` paths — compatible with Starlight, browsers, and standard renderers  
+- **Disk-Based Images**: `pdf_to_markdown` always extracts images to `{output_directory}/images/` with relative `./images/` paths — compatible with Starlight, browsers, and standard renderers
- **Prevents Context Overflow**: Avoids verbose output that fills client message windows
+- **Inline Escape Hatch**: Both tools accept `inline=True` to return full content in the response for small queries
 - **User Control**: Flexible output directory support with automatic directory creation
 ### Intelligent Fallbacks and Token Management
--- a/src/mcp_pdf/mixins_official/image_processing.py
+++ b/src/mcp_pdf/mixins_official/image_processing.py
@ -215,10 +215,10 @@ class ImageProcessingMixin(MCPMixin):
    @mcp_tool(
        name="pdf_to_markdown",
        description=(
-            "Convert PDF to markdown. When output_directory is provided, images are "
+            "Convert PDF to markdown and write to a .md file. Images are extracted "
-            "extracted to {output_directory}/images/ with relative ./images/ paths in "
+            "to {output_directory}/images/ with relative ./images/ paths. Returns "
-            "the markdown — ready for Starlight, browsers, or any renderer. "
+            "the output file path and a short preview — full markdown is in the file. "
-            "Without output_directory, images use pdf-image:// MCP resource URIs."
+            "Set inline=True to get full markdown in the response instead."
        )
    )
    async def pdf_to_markdown(
@ -231,30 +231,29 @@ class ImageProcessingMixin(MCPMixin):
        min_width: int = 100,
        min_height: int = 100,
        image_format: str = "png",
-        save_markdown: bool = False
+        inline: bool = False
    ) -> Dict[str, Any]:
        """
-        Convert PDF to clean markdown format.
+        Convert PDF to clean markdown format and write to file.
-        Two image modes:
+        By default, writes markdown to a file and extracts images to an images/
-        - With output_directory: extracts images to disk, uses relative paths in markdown.
+        subdirectory with relative paths. Returns file path + summary to avoid
-          Images are filtered by min_width/min_height (matching extract_images behavior).
+        filling the MCP context window. Set inline=True for full markdown in response.
        - Without output_directory: uses pdf-image:// MCP resource URIs (legacy behavior).
        Args:
            pdf_path: Path to PDF file or HTTPS URL
            pages: Page numbers to convert (comma-separated, 1-based), None for all
            include_images: Whether to include images in markdown
            include_metadata: Whether to include document metadata
-            output_directory: Directory for extracted images and optional markdown file.
+            output_directory: Directory for output .md file and images/ subdirectory.
-                When set, images go to {output_directory}/images/ with relative paths.
+                Defaults to a temp directory if not specified.
-            min_width: Minimum image width to extract (only when output_directory is set)
+            min_width: Minimum image width to extract (filters small decorative images)
-            min_height: Minimum image height to extract (only when output_directory is set)
+            min_height: Minimum image height to extract (filters small decorative images)
-            image_format: Image format - "png" or "jpg" (only when output_directory is set)
+            image_format: Image format - "png" or "jpg"
-            save_markdown: Save markdown to {output_directory}/{filename}.md
+            inline: Return full markdown in response instead of writing to file
        Returns:
-            Dictionary containing markdown content and metadata
+            Dictionary with output_file path and summary, or full markdown if inline=True
        """
        start_time = time.time()
@ -273,17 +272,18 @@ class ImageProcessingMixin(MCPMixin):
            pages_to_process = parsed_pages if parsed_pages else list(range(total_pages))
            pages_to_process = [p for p in pages_to_process if 0 <= p < total_pages]
-            # Setup output directory for image extraction
+            # Setup output directory — always needed (file output is the default)
            images_dir = None
            images_extracted = 0
            images_skipped = 0
            extracted_image_info = []
            if output_directory:
                output_dir = validate_output_path(output_directory)
-                output_dir.mkdir(parents=True, exist_ok=True)
+            else:
-                images_dir = output_dir / "images"
+                output_dir = Path(tempfile.mkdtemp(prefix="pdf_markdown_"))
-                images_dir.mkdir(parents=True, exist_ok=True)
+            output_dir.mkdir(parents=True, exist_ok=True)
            images_dir = output_dir / "images"
            images_dir.mkdir(parents=True, exist_ok=True)
            markdown_parts = []
@ -322,53 +322,44 @@ class ImageProcessingMixin(MCPMixin):
                        for img_index, img in enumerate(image_list):
                            try:
                                alt_text = f"Image {img_index + 1} from page {page_num + 1}"
                                xref = img[0]
                                pix = fitz.Pixmap(doc, xref)
-                                if images_dir:
+                                if pix.width < min_width or pix.height < min_height:
-                                    # Disk mode: extract image, filter by size, save to images/
+                                    images_skipped += 1
                                    xref = img[0]
                                    pix = fitz.Pixmap(doc, xref)
                                    if pix.width < min_width or pix.height < min_height:
                                        images_skipped += 1
                                        pix = None
                                        continue
                                    # Convert CMYK to RGB if necessary
                                    if pix.n - pix.alpha >= 4:
                                        pix = fitz.Pixmap(fitz.csRGB, pix)
                                    base_name = input_pdf_path.stem
                                    filename = f"{base_name}_page_{page_num + 1}_img_{img_index + 1}.{image_format}"
                                    img_path = images_dir / filename
                                    if image_format.lower() in ["jpg", "jpeg"]:
                                        pix.save(str(img_path), "JPEG")
                                    else:
                                        pix.save(str(img_path), "PNG")
                                    file_size = img_path.stat().st_size
                                    extracted_image_info.append({
                                        "filename": filename,
                                        "path": str(img_path),
                                        "page": page_num + 1,
                                        "width": pix.width,
                                        "height": pix.height,
                                        "size_bytes": file_size
                                    })
                                    images_extracted += 1
                                    pix = None
                                    continue
-                                    markdown_parts.append(f"![{alt_text}](./images/{filename})\n\n")
+                                # Convert CMYK to RGB if necessary
                                if pix.n - pix.alpha >= 4:
                                    pix = fitz.Pixmap(fitz.csRGB, pix)
                                base_name = input_pdf_path.stem
                                filename = f"{base_name}_page_{page_num + 1}_img_{img_index + 1}.{image_format}"
                                img_path = images_dir / filename
                                if image_format.lower() in ["jpg", "jpeg"]:
                                    pix.save(str(img_path), "JPEG")
                                else:
-                                    # Legacy mode: pdf-image:// MCP resource URI
+                                    pix.save(str(img_path), "PNG")
-                                    image_id = f"page_{page_num + 1}_img_{img_index + 1}"
+
-                                    mcp_uri = f"pdf-image://{image_id}"
+                                file_size = img_path.stat().st_size
-                                    markdown_parts.append(f"![{alt_text}]({mcp_uri})\n\n")
+                                extracted_image_info.append({
                                    "filename": filename,
                                    "path": str(img_path),
                                    "page": page_num + 1,
                                    "width": pix.width,
                                    "height": pix.height,
                                    "size_bytes": file_size
                                })
                                images_extracted += 1
                                pix = None
                                markdown_parts.append(f"![{alt_text}](./images/{filename})\n\n")
                            except Exception as e:
                                logger.warning(f"Failed to process image {img_index + 1} on page {page_num + 1}: {e}")
-                                if images_dir:
+                                images_skipped += 1
                                    images_skipped += 1
                except Exception as e:
                    logger.warning(f"Failed to process page {page_num + 1}: {e}")
@ -379,42 +370,57 @@ class ImageProcessingMixin(MCPMixin):
            # Combine all markdown parts
            full_markdown = "".join(markdown_parts)
            # Save markdown file if requested
            markdown_path = None
            if save_markdown and output_directory:
                md_path = output_dir / f"{input_pdf_path.stem}.md"
                with open(md_path, 'w', encoding='utf-8') as f:
                    f.write(full_markdown)
                markdown_path = str(md_path)
            # Calculate statistics
            word_count = len(full_markdown.split())
            line_count = len(full_markdown.split('\n'))
            char_count = len(full_markdown)
-            result = {
+            conversion_summary = {
-                "success": True,
+                "pages_converted": len(pages_to_process),
-                "markdown": full_markdown,
+                "total_pages": total_pages,
-                "conversion_summary": {
+                "word_count": word_count,
-                    "pages_converted": len(pages_to_process),
+                "line_count": line_count,
-                    "total_pages": total_pages,
+                "character_count": char_count,
-                    "word_count": word_count,
+                "images_extracted": images_extracted,
-                    "line_count": line_count,
+                "images_skipped": images_skipped
                    "character_count": char_count,
                    "includes_images": include_images,
                    "includes_metadata": include_metadata,
                    "images_extracted": images_extracted,
                    "images_skipped": images_skipped
                },
                "file_info": {
                    "input_path": str(input_pdf_path),
                    "pages_processed": pages or "all"
                },
                "conversion_time": round(time.time() - start_time, 2)
            }
-            if images_dir:
+            # Inline mode: return full markdown in response
-                result["image_output"] = {
+            if inline:
                return {
                    "success": True,
                    "markdown": full_markdown,
                    "conversion_summary": conversion_summary,
                    "image_output": {
                        "images_directory": str(images_dir),
                        "images": extracted_image_info
                    },
                    "file_info": {
                        "input_path": str(input_pdf_path),
                        "pages_processed": pages or "all"
                    },
                    "conversion_time": round(time.time() - start_time, 2)
                }
            # File output mode (default): write .md file, return path + summary
            md_path = output_dir / f"{input_pdf_path.stem}.md"
            with open(md_path, 'w', encoding='utf-8') as f:
                f.write(full_markdown)
            # Build preview (first ~500 chars at sentence boundary)
            preview = full_markdown[:500]
            if len(full_markdown) > 500:
                last_period = preview.rfind('.')
                if last_period > 300:
                    preview = preview[:last_period + 1]
                preview += " [...]"
            return {
                "success": True,
                "output_file": str(md_path),
                "markdown_preview": preview,
                "conversion_summary": conversion_summary,
                "image_output": {
                    "images_directory": str(images_dir),
                    "images_extracted": images_extracted,
                    "images_skipped": images_skipped,
@ -424,17 +430,14 @@ class ImageProcessingMixin(MCPMixin):
                        "image_format": image_format
                    },
                    "images": extracted_image_info
-                }
+                },
-            else:
+                "file_info": {
-                result["mcp_integration"] = {
+                    "input_path": str(input_pdf_path),
-                    "image_uri_format": "pdf-image://{image_id}",
+                    "output_directory": str(output_dir),
-                    "description": "Images use MCP resource URIs. Set output_directory for disk-based images with relative paths."
+                    "pages_processed": pages or "all"
-                }
+                },
-
+                "conversion_time": round(time.time() - start_time, 2)
-            if markdown_path:
+            }
                result["markdown_path"] = markdown_path
            return result
        except Exception as e:
            error_msg = sanitize_error_message(str(e))
--- a/src/mcp_pdf/mixins_official/text_extraction.py
+++ b/src/mcp_pdf/mixins_official/text_extraction.py
@ -3,8 +3,8 @@ Text Extraction Mixin - PDF text extraction, OCR, and scanned PDF detection
 Uses official fastmcp.contrib.mcp_mixin pattern
 """
 import asyncio
 import time
 import tempfile
 from pathlib import Path
 from typing import Dict, Any, Optional, List
 import logging
@ -18,7 +18,7 @@ import io
 # Official FastMCP mixin
 from fastmcp.contrib.mcp_mixin import MCPMixin, mcp_tool
-from ..security import validate_pdf_path, sanitize_error_message
+from ..security import validate_pdf_path, validate_output_path, sanitize_error_message
 logger = logging.getLogger(__name__)
@ -36,30 +36,43 @@ class TextExtractionMixin(MCPMixin):
    @mcp_tool(
        name="extract_text",
-        description="Extract text from PDF with intelligent method selection and automatic chunking for large files"
+        description=(
            "Extract text from PDF and write to a .txt file. Returns the output "
            "file path and a short preview — full text is in the file, not in the "
            "response. Use output_directory to control where the file is saved, "
            "or set inline=True to get full text in the response instead."
        )
    )
    async def extract_text(
        self,
        pdf_path: str,
        pages: Optional[str] = None,
        method: str = "auto",
        preserve_layout: bool = False,
        output_directory: Optional[str] = None,
        inline: bool = False,
        chunk_pages: int = 10,
-        max_tokens: int = 20000,
+        max_tokens: int = 20000
        preserve_layout: bool = False
    ) -> Dict[str, Any]:
        """
        Extract text from PDF with intelligent method selection.
        By default, writes extracted text to a file and returns the path with
        a short preview. This prevents large extractions from filling the MCP
        context window. Set inline=True for the old behavior (full text in response).
        Args:
            pdf_path: Path to PDF file or HTTPS URL
            pages: Page numbers to extract (comma-separated, 1-based), None for all
            method: Extraction method ("auto", "pymupdf", "pdfplumber", "pypdf")
            chunk_pages: Number of pages per chunk for large files
            max_tokens: Maximum tokens per response to prevent overflow
            preserve_layout: Whether to preserve text layout and formatting
            output_directory: Directory to save the text file (default: temp directory)
            inline: Return full text in response instead of writing to file
            chunk_pages: Pages per chunk when inline=True (ignored for file output)
            max_tokens: Max chars when inline=True (ignored for file output)
        Returns:
-            Dictionary containing extracted text and metadata
+            Dictionary with output_file path and summary, or full text if inline=True
        """
        start_time = time.time()
@ -84,44 +97,93 @@ class TextExtractionMixin(MCPMixin):
                    "extraction_time": 0
                }
-            # Check if chunking is needed
+            # Inline mode: old behavior with chunking/truncation
-            if len(pages_to_extract) > chunk_pages:
+            if inline:
-                return await self._extract_text_chunked(
+                if len(pages_to_extract) > chunk_pages:
-                    doc, path, pages_to_extract, method, chunk_pages,
+                    return await self._extract_text_chunked(
-                    max_tokens, preserve_layout, start_time
+                        doc, path, pages_to_extract, method, chunk_pages,
-                )
+                        max_tokens, preserve_layout, start_time
                    )
-            # Extract text from specified pages
+                extraction_result = await self._extract_text_from_pages(
                    doc, pages_to_extract, method, preserve_layout
                )
                doc.close()
                if len(extraction_result["text"]) > max_tokens:
                    truncated_text = extraction_result["text"][:max_tokens]
                    last_period = truncated_text.rfind('.')
                    if last_period > max_tokens * 0.8:
                        truncated_text = truncated_text[:last_period + 1]
                    extraction_result["text"] = truncated_text
                    extraction_result["truncated"] = True
                    extraction_result["truncation_reason"] = f"Response too large (>{max_tokens} chars)"
                extraction_result.update({
                    "success": True,
                    "file_info": {
                        "path": str(path),
                        "total_pages": total_pages,
                        "pages_extracted": len(pages_to_extract),
                        "pages_requested": pages or "all"
                    },
                    "extraction_time": round(time.time() - start_time, 2)
                })
                return extraction_result
            # File output mode (default): extract all requested pages, write to file
            extraction_result = await self._extract_text_from_pages(
                doc, pages_to_extract, method, preserve_layout
            )
            doc.close()
-            # Check token limit and truncate if necessary
+            full_text = extraction_result["text"]
            if len(extraction_result["text"]) > max_tokens:
                truncated_text = extraction_result["text"][:max_tokens]
                # Try to truncate at sentence boundary
                last_period = truncated_text.rfind('.')
                if last_period > max_tokens * 0.8:  # If we can find a good break point
                    truncated_text = truncated_text[:last_period + 1]
-                extraction_result["text"] = truncated_text
+            # Setup output directory
-                extraction_result["truncated"] = True
+            if output_directory:
-                extraction_result["truncation_reason"] = f"Response too large (>{max_tokens} chars)"
+                output_dir = validate_output_path(output_directory)
                output_dir.mkdir(parents=True, exist_ok=True)
            else:
                output_dir = Path(tempfile.mkdtemp(prefix="pdf_text_"))
-            extraction_result.update({
+            # Write text to file
            output_filename = f"{path.stem}.txt"
            output_path = output_dir / output_filename
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(full_text)
            # Build preview (first ~500 chars at sentence boundary)
            preview = full_text[:500]
            if len(full_text) > 500:
                last_period = preview.rfind('.')
                if last_period > 300:
                    preview = preview[:last_period + 1]
                preview += " [...]"
            word_count = len(full_text.split())
            char_count = len(full_text)
            file_size = output_path.stat().st_size
            return {
                "success": True,
-                "file_info": {
+                "output_file": str(output_path),
-                    "path": str(path),
+                "text_preview": preview,
-                    "total_pages": total_pages,
+                "extraction_summary": {
                    "word_count": word_count,
                    "character_count": char_count,
                    "file_size_bytes": file_size,
                    "file_size_kb": round(file_size / 1024, 1),
                    "pages_extracted": len(pages_to_extract),
                    "total_pages": total_pages,
                    "method_used": extraction_result.get("method_used", method)
                },
                "file_info": {
                    "input_path": str(path),
                    "total_pages": total_pages,
                    "pages_requested": pages or "all"
                },
                "extraction_time": round(time.time() - start_time, 2)
-            })
+            }
            return extraction_result
        except Exception as e:
            error_msg = sanitize_error_message(str(e))