🐛 File-first output for extract_text and pdf_to_markdown
Both tools now write to disk by default and return file path + short preview instead of full content inline. Prevents MCP context overflow on large PDFs. Set inline=True for the old behavior. pdf_to_markdown always extracts images to ./images/ with relative paths (no more dead pdf-image:// URIs). extract_text writes a .txt file.
This commit is contained in:
parent
2d5f7e241d
commit
772bcac0df
10
CLAUDE.md
10
CLAUDE.md
@ -87,11 +87,11 @@ uv publish
|
|||||||
|
|
||||||
### Tool Categories
|
### Tool Categories
|
||||||
|
|
||||||
1. **Text Extraction**: `extract_text` - Intelligent method selection with automatic chunking for large files
|
1. **Text Extraction**: `extract_text` - Writes extracted text to a .txt file by default, returns path + preview. Set `inline=True` for full text in response.
|
||||||
2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula
|
2. **Table Extraction**: `extract_tables` - Auto-fallback through Camelot → pdfplumber → Tabula
|
||||||
3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options
|
3. **OCR Processing**: `ocr_pdf` - Tesseract with preprocessing options
|
||||||
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
|
4. **Document Analysis**: `is_scanned_pdf`, `get_document_structure`, `extract_metadata`
|
||||||
5. **Format Conversion**: `pdf_to_markdown` - Convert PDF to markdown. With `output_directory`, extracts images to disk with relative `./images/` paths. Without, uses `pdf-image://` MCP resource URIs (legacy).
|
5. **Format Conversion**: `pdf_to_markdown` - Writes markdown + extracted images to disk by default, returns path + preview. Images use relative `./images/` paths. Set `inline=True` for full markdown in response.
|
||||||
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
|
6. **Image Processing**: `extract_images` - Extract images with custom output paths and clean summary output
|
||||||
7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization
|
7. **Link Extraction**: `extract_links` - Extract all hyperlinks with page filtering and type categorization
|
||||||
8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management
|
8. **PDF Forms**: `extract_form_data`, `create_form_pdf`, `fill_form_pdf`, `add_form_fields` - Complete form lifecycle management
|
||||||
@ -103,9 +103,9 @@ uv publish
|
|||||||
**Optimized for MCP Context Management:**
|
**Optimized for MCP Context Management:**
|
||||||
- **Custom Output Paths**: `extract_images` allows users to specify where images are saved
|
- **Custom Output Paths**: `extract_images` allows users to specify where images are saved
|
||||||
- **Clean Summary Output**: Returns concise extraction summary instead of verbose image metadata
|
- **Clean Summary Output**: Returns concise extraction summary instead of verbose image metadata
|
||||||
- **Resource URIs**: `pdf_to_markdown` uses `pdf-image://{image_id}` protocol when no `output_directory` is set (legacy mode)
|
- **File-First Output**: `extract_text` and `pdf_to_markdown` write results to files by default, returning paths + short previews instead of full content — prevents MCP context overflow on large PDFs
|
||||||
- **Disk-Based Images**: When `output_directory` is provided, `pdf_to_markdown` extracts images to `{output_directory}/images/` with relative `./images/` paths — compatible with Starlight, browsers, and standard renderers
|
- **Disk-Based Images**: `pdf_to_markdown` always extracts images to `{output_directory}/images/` with relative `./images/` paths — compatible with Starlight, browsers, and standard renderers
|
||||||
- **Prevents Context Overflow**: Avoids verbose output that fills client message windows
|
- **Inline Escape Hatch**: Both tools accept `inline=True` to return full content in the response for small queries
|
||||||
- **User Control**: Flexible output directory support with automatic directory creation
|
- **User Control**: Flexible output directory support with automatic directory creation
|
||||||
|
|
||||||
### Intelligent Fallbacks and Token Management
|
### Intelligent Fallbacks and Token Management
|
||||||
|
|||||||
@ -215,10 +215,10 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
@mcp_tool(
|
@mcp_tool(
|
||||||
name="pdf_to_markdown",
|
name="pdf_to_markdown",
|
||||||
description=(
|
description=(
|
||||||
"Convert PDF to markdown. When output_directory is provided, images are "
|
"Convert PDF to markdown and write to a .md file. Images are extracted "
|
||||||
"extracted to {output_directory}/images/ with relative ./images/ paths in "
|
"to {output_directory}/images/ with relative ./images/ paths. Returns "
|
||||||
"the markdown — ready for Starlight, browsers, or any renderer. "
|
"the output file path and a short preview — full markdown is in the file. "
|
||||||
"Without output_directory, images use pdf-image:// MCP resource URIs."
|
"Set inline=True to get full markdown in the response instead."
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
async def pdf_to_markdown(
|
async def pdf_to_markdown(
|
||||||
@ -231,30 +231,29 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
min_width: int = 100,
|
min_width: int = 100,
|
||||||
min_height: int = 100,
|
min_height: int = 100,
|
||||||
image_format: str = "png",
|
image_format: str = "png",
|
||||||
save_markdown: bool = False
|
inline: bool = False
|
||||||
) -> Dict[str, Any]:
|
) -> Dict[str, Any]:
|
||||||
"""
|
"""
|
||||||
Convert PDF to clean markdown format.
|
Convert PDF to clean markdown format and write to file.
|
||||||
|
|
||||||
Two image modes:
|
By default, writes markdown to a file and extracts images to an images/
|
||||||
- With output_directory: extracts images to disk, uses relative paths in markdown.
|
subdirectory with relative paths. Returns file path + summary to avoid
|
||||||
Images are filtered by min_width/min_height (matching extract_images behavior).
|
filling the MCP context window. Set inline=True for full markdown in response.
|
||||||
- Without output_directory: uses pdf-image:// MCP resource URIs (legacy behavior).
|
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
pdf_path: Path to PDF file or HTTPS URL
|
pdf_path: Path to PDF file or HTTPS URL
|
||||||
pages: Page numbers to convert (comma-separated, 1-based), None for all
|
pages: Page numbers to convert (comma-separated, 1-based), None for all
|
||||||
include_images: Whether to include images in markdown
|
include_images: Whether to include images in markdown
|
||||||
include_metadata: Whether to include document metadata
|
include_metadata: Whether to include document metadata
|
||||||
output_directory: Directory for extracted images and optional markdown file.
|
output_directory: Directory for output .md file and images/ subdirectory.
|
||||||
When set, images go to {output_directory}/images/ with relative paths.
|
Defaults to a temp directory if not specified.
|
||||||
min_width: Minimum image width to extract (only when output_directory is set)
|
min_width: Minimum image width to extract (filters small decorative images)
|
||||||
min_height: Minimum image height to extract (only when output_directory is set)
|
min_height: Minimum image height to extract (filters small decorative images)
|
||||||
image_format: Image format - "png" or "jpg" (only when output_directory is set)
|
image_format: Image format - "png" or "jpg"
|
||||||
save_markdown: Save markdown to {output_directory}/{filename}.md
|
inline: Return full markdown in response instead of writing to file
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Dictionary containing markdown content and metadata
|
Dictionary with output_file path and summary, or full markdown if inline=True
|
||||||
"""
|
"""
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
|
|
||||||
@ -273,17 +272,18 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
pages_to_process = parsed_pages if parsed_pages else list(range(total_pages))
|
pages_to_process = parsed_pages if parsed_pages else list(range(total_pages))
|
||||||
pages_to_process = [p for p in pages_to_process if 0 <= p < total_pages]
|
pages_to_process = [p for p in pages_to_process if 0 <= p < total_pages]
|
||||||
|
|
||||||
# Setup output directory for image extraction
|
# Setup output directory — always needed (file output is the default)
|
||||||
images_dir = None
|
|
||||||
images_extracted = 0
|
images_extracted = 0
|
||||||
images_skipped = 0
|
images_skipped = 0
|
||||||
extracted_image_info = []
|
extracted_image_info = []
|
||||||
|
|
||||||
if output_directory:
|
if output_directory:
|
||||||
output_dir = validate_output_path(output_directory)
|
output_dir = validate_output_path(output_directory)
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
else:
|
||||||
images_dir = output_dir / "images"
|
output_dir = Path(tempfile.mkdtemp(prefix="pdf_markdown_"))
|
||||||
images_dir.mkdir(parents=True, exist_ok=True)
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
images_dir = output_dir / "images"
|
||||||
|
images_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
markdown_parts = []
|
markdown_parts = []
|
||||||
|
|
||||||
@ -322,53 +322,44 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
for img_index, img in enumerate(image_list):
|
for img_index, img in enumerate(image_list):
|
||||||
try:
|
try:
|
||||||
alt_text = f"Image {img_index + 1} from page {page_num + 1}"
|
alt_text = f"Image {img_index + 1} from page {page_num + 1}"
|
||||||
|
xref = img[0]
|
||||||
|
pix = fitz.Pixmap(doc, xref)
|
||||||
|
|
||||||
if images_dir:
|
if pix.width < min_width or pix.height < min_height:
|
||||||
# Disk mode: extract image, filter by size, save to images/
|
images_skipped += 1
|
||||||
xref = img[0]
|
|
||||||
pix = fitz.Pixmap(doc, xref)
|
|
||||||
|
|
||||||
if pix.width < min_width or pix.height < min_height:
|
|
||||||
images_skipped += 1
|
|
||||||
pix = None
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Convert CMYK to RGB if necessary
|
|
||||||
if pix.n - pix.alpha >= 4:
|
|
||||||
pix = fitz.Pixmap(fitz.csRGB, pix)
|
|
||||||
|
|
||||||
base_name = input_pdf_path.stem
|
|
||||||
filename = f"{base_name}_page_{page_num + 1}_img_{img_index + 1}.{image_format}"
|
|
||||||
img_path = images_dir / filename
|
|
||||||
|
|
||||||
if image_format.lower() in ["jpg", "jpeg"]:
|
|
||||||
pix.save(str(img_path), "JPEG")
|
|
||||||
else:
|
|
||||||
pix.save(str(img_path), "PNG")
|
|
||||||
|
|
||||||
file_size = img_path.stat().st_size
|
|
||||||
extracted_image_info.append({
|
|
||||||
"filename": filename,
|
|
||||||
"path": str(img_path),
|
|
||||||
"page": page_num + 1,
|
|
||||||
"width": pix.width,
|
|
||||||
"height": pix.height,
|
|
||||||
"size_bytes": file_size
|
|
||||||
})
|
|
||||||
images_extracted += 1
|
|
||||||
pix = None
|
pix = None
|
||||||
|
continue
|
||||||
|
|
||||||
markdown_parts.append(f"\n\n")
|
# Convert CMYK to RGB if necessary
|
||||||
|
if pix.n - pix.alpha >= 4:
|
||||||
|
pix = fitz.Pixmap(fitz.csRGB, pix)
|
||||||
|
|
||||||
|
base_name = input_pdf_path.stem
|
||||||
|
filename = f"{base_name}_page_{page_num + 1}_img_{img_index + 1}.{image_format}"
|
||||||
|
img_path = images_dir / filename
|
||||||
|
|
||||||
|
if image_format.lower() in ["jpg", "jpeg"]:
|
||||||
|
pix.save(str(img_path), "JPEG")
|
||||||
else:
|
else:
|
||||||
# Legacy mode: pdf-image:// MCP resource URI
|
pix.save(str(img_path), "PNG")
|
||||||
image_id = f"page_{page_num + 1}_img_{img_index + 1}"
|
|
||||||
mcp_uri = f"pdf-image://{image_id}"
|
file_size = img_path.stat().st_size
|
||||||
markdown_parts.append(f"\n\n")
|
extracted_image_info.append({
|
||||||
|
"filename": filename,
|
||||||
|
"path": str(img_path),
|
||||||
|
"page": page_num + 1,
|
||||||
|
"width": pix.width,
|
||||||
|
"height": pix.height,
|
||||||
|
"size_bytes": file_size
|
||||||
|
})
|
||||||
|
images_extracted += 1
|
||||||
|
pix = None
|
||||||
|
|
||||||
|
markdown_parts.append(f"\n\n")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"Failed to process image {img_index + 1} on page {page_num + 1}: {e}")
|
logger.warning(f"Failed to process image {img_index + 1} on page {page_num + 1}: {e}")
|
||||||
if images_dir:
|
images_skipped += 1
|
||||||
images_skipped += 1
|
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"Failed to process page {page_num + 1}: {e}")
|
logger.warning(f"Failed to process page {page_num + 1}: {e}")
|
||||||
@ -379,42 +370,57 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
# Combine all markdown parts
|
# Combine all markdown parts
|
||||||
full_markdown = "".join(markdown_parts)
|
full_markdown = "".join(markdown_parts)
|
||||||
|
|
||||||
# Save markdown file if requested
|
|
||||||
markdown_path = None
|
|
||||||
if save_markdown and output_directory:
|
|
||||||
md_path = output_dir / f"{input_pdf_path.stem}.md"
|
|
||||||
with open(md_path, 'w', encoding='utf-8') as f:
|
|
||||||
f.write(full_markdown)
|
|
||||||
markdown_path = str(md_path)
|
|
||||||
|
|
||||||
# Calculate statistics
|
# Calculate statistics
|
||||||
word_count = len(full_markdown.split())
|
word_count = len(full_markdown.split())
|
||||||
line_count = len(full_markdown.split('\n'))
|
line_count = len(full_markdown.split('\n'))
|
||||||
char_count = len(full_markdown)
|
char_count = len(full_markdown)
|
||||||
|
|
||||||
result = {
|
conversion_summary = {
|
||||||
"success": True,
|
"pages_converted": len(pages_to_process),
|
||||||
"markdown": full_markdown,
|
"total_pages": total_pages,
|
||||||
"conversion_summary": {
|
"word_count": word_count,
|
||||||
"pages_converted": len(pages_to_process),
|
"line_count": line_count,
|
||||||
"total_pages": total_pages,
|
"character_count": char_count,
|
||||||
"word_count": word_count,
|
"images_extracted": images_extracted,
|
||||||
"line_count": line_count,
|
"images_skipped": images_skipped
|
||||||
"character_count": char_count,
|
|
||||||
"includes_images": include_images,
|
|
||||||
"includes_metadata": include_metadata,
|
|
||||||
"images_extracted": images_extracted,
|
|
||||||
"images_skipped": images_skipped
|
|
||||||
},
|
|
||||||
"file_info": {
|
|
||||||
"input_path": str(input_pdf_path),
|
|
||||||
"pages_processed": pages or "all"
|
|
||||||
},
|
|
||||||
"conversion_time": round(time.time() - start_time, 2)
|
|
||||||
}
|
}
|
||||||
|
|
||||||
if images_dir:
|
# Inline mode: return full markdown in response
|
||||||
result["image_output"] = {
|
if inline:
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"markdown": full_markdown,
|
||||||
|
"conversion_summary": conversion_summary,
|
||||||
|
"image_output": {
|
||||||
|
"images_directory": str(images_dir),
|
||||||
|
"images": extracted_image_info
|
||||||
|
},
|
||||||
|
"file_info": {
|
||||||
|
"input_path": str(input_pdf_path),
|
||||||
|
"pages_processed": pages or "all"
|
||||||
|
},
|
||||||
|
"conversion_time": round(time.time() - start_time, 2)
|
||||||
|
}
|
||||||
|
|
||||||
|
# File output mode (default): write .md file, return path + summary
|
||||||
|
md_path = output_dir / f"{input_pdf_path.stem}.md"
|
||||||
|
with open(md_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(full_markdown)
|
||||||
|
|
||||||
|
# Build preview (first ~500 chars at sentence boundary)
|
||||||
|
preview = full_markdown[:500]
|
||||||
|
if len(full_markdown) > 500:
|
||||||
|
last_period = preview.rfind('.')
|
||||||
|
if last_period > 300:
|
||||||
|
preview = preview[:last_period + 1]
|
||||||
|
preview += " [...]"
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"output_file": str(md_path),
|
||||||
|
"markdown_preview": preview,
|
||||||
|
"conversion_summary": conversion_summary,
|
||||||
|
"image_output": {
|
||||||
"images_directory": str(images_dir),
|
"images_directory": str(images_dir),
|
||||||
"images_extracted": images_extracted,
|
"images_extracted": images_extracted,
|
||||||
"images_skipped": images_skipped,
|
"images_skipped": images_skipped,
|
||||||
@ -424,17 +430,14 @@ class ImageProcessingMixin(MCPMixin):
|
|||||||
"image_format": image_format
|
"image_format": image_format
|
||||||
},
|
},
|
||||||
"images": extracted_image_info
|
"images": extracted_image_info
|
||||||
}
|
},
|
||||||
else:
|
"file_info": {
|
||||||
result["mcp_integration"] = {
|
"input_path": str(input_pdf_path),
|
||||||
"image_uri_format": "pdf-image://{image_id}",
|
"output_directory": str(output_dir),
|
||||||
"description": "Images use MCP resource URIs. Set output_directory for disk-based images with relative paths."
|
"pages_processed": pages or "all"
|
||||||
}
|
},
|
||||||
|
"conversion_time": round(time.time() - start_time, 2)
|
||||||
if markdown_path:
|
}
|
||||||
result["markdown_path"] = markdown_path
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
error_msg = sanitize_error_message(str(e))
|
error_msg = sanitize_error_message(str(e))
|
||||||
|
|||||||
@ -3,8 +3,8 @@ Text Extraction Mixin - PDF text extraction, OCR, and scanned PDF detection
|
|||||||
Uses official fastmcp.contrib.mcp_mixin pattern
|
Uses official fastmcp.contrib.mcp_mixin pattern
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import asyncio
|
|
||||||
import time
|
import time
|
||||||
|
import tempfile
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Dict, Any, Optional, List
|
from typing import Dict, Any, Optional, List
|
||||||
import logging
|
import logging
|
||||||
@ -18,7 +18,7 @@ import io
|
|||||||
# Official FastMCP mixin
|
# Official FastMCP mixin
|
||||||
from fastmcp.contrib.mcp_mixin import MCPMixin, mcp_tool
|
from fastmcp.contrib.mcp_mixin import MCPMixin, mcp_tool
|
||||||
|
|
||||||
from ..security import validate_pdf_path, sanitize_error_message
|
from ..security import validate_pdf_path, validate_output_path, sanitize_error_message
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@ -36,30 +36,43 @@ class TextExtractionMixin(MCPMixin):
|
|||||||
|
|
||||||
@mcp_tool(
|
@mcp_tool(
|
||||||
name="extract_text",
|
name="extract_text",
|
||||||
description="Extract text from PDF with intelligent method selection and automatic chunking for large files"
|
description=(
|
||||||
|
"Extract text from PDF and write to a .txt file. Returns the output "
|
||||||
|
"file path and a short preview — full text is in the file, not in the "
|
||||||
|
"response. Use output_directory to control where the file is saved, "
|
||||||
|
"or set inline=True to get full text in the response instead."
|
||||||
|
)
|
||||||
)
|
)
|
||||||
async def extract_text(
|
async def extract_text(
|
||||||
self,
|
self,
|
||||||
pdf_path: str,
|
pdf_path: str,
|
||||||
pages: Optional[str] = None,
|
pages: Optional[str] = None,
|
||||||
method: str = "auto",
|
method: str = "auto",
|
||||||
|
preserve_layout: bool = False,
|
||||||
|
output_directory: Optional[str] = None,
|
||||||
|
inline: bool = False,
|
||||||
chunk_pages: int = 10,
|
chunk_pages: int = 10,
|
||||||
max_tokens: int = 20000,
|
max_tokens: int = 20000
|
||||||
preserve_layout: bool = False
|
|
||||||
) -> Dict[str, Any]:
|
) -> Dict[str, Any]:
|
||||||
"""
|
"""
|
||||||
Extract text from PDF with intelligent method selection.
|
Extract text from PDF with intelligent method selection.
|
||||||
|
|
||||||
|
By default, writes extracted text to a file and returns the path with
|
||||||
|
a short preview. This prevents large extractions from filling the MCP
|
||||||
|
context window. Set inline=True for the old behavior (full text in response).
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
pdf_path: Path to PDF file or HTTPS URL
|
pdf_path: Path to PDF file or HTTPS URL
|
||||||
pages: Page numbers to extract (comma-separated, 1-based), None for all
|
pages: Page numbers to extract (comma-separated, 1-based), None for all
|
||||||
method: Extraction method ("auto", "pymupdf", "pdfplumber", "pypdf")
|
method: Extraction method ("auto", "pymupdf", "pdfplumber", "pypdf")
|
||||||
chunk_pages: Number of pages per chunk for large files
|
|
||||||
max_tokens: Maximum tokens per response to prevent overflow
|
|
||||||
preserve_layout: Whether to preserve text layout and formatting
|
preserve_layout: Whether to preserve text layout and formatting
|
||||||
|
output_directory: Directory to save the text file (default: temp directory)
|
||||||
|
inline: Return full text in response instead of writing to file
|
||||||
|
chunk_pages: Pages per chunk when inline=True (ignored for file output)
|
||||||
|
max_tokens: Max chars when inline=True (ignored for file output)
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Dictionary containing extracted text and metadata
|
Dictionary with output_file path and summary, or full text if inline=True
|
||||||
"""
|
"""
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
|
|
||||||
@ -84,44 +97,93 @@ class TextExtractionMixin(MCPMixin):
|
|||||||
"extraction_time": 0
|
"extraction_time": 0
|
||||||
}
|
}
|
||||||
|
|
||||||
# Check if chunking is needed
|
# Inline mode: old behavior with chunking/truncation
|
||||||
if len(pages_to_extract) > chunk_pages:
|
if inline:
|
||||||
return await self._extract_text_chunked(
|
if len(pages_to_extract) > chunk_pages:
|
||||||
doc, path, pages_to_extract, method, chunk_pages,
|
return await self._extract_text_chunked(
|
||||||
max_tokens, preserve_layout, start_time
|
doc, path, pages_to_extract, method, chunk_pages,
|
||||||
)
|
max_tokens, preserve_layout, start_time
|
||||||
|
)
|
||||||
|
|
||||||
# Extract text from specified pages
|
extraction_result = await self._extract_text_from_pages(
|
||||||
|
doc, pages_to_extract, method, preserve_layout
|
||||||
|
)
|
||||||
|
doc.close()
|
||||||
|
|
||||||
|
if len(extraction_result["text"]) > max_tokens:
|
||||||
|
truncated_text = extraction_result["text"][:max_tokens]
|
||||||
|
last_period = truncated_text.rfind('.')
|
||||||
|
if last_period > max_tokens * 0.8:
|
||||||
|
truncated_text = truncated_text[:last_period + 1]
|
||||||
|
extraction_result["text"] = truncated_text
|
||||||
|
extraction_result["truncated"] = True
|
||||||
|
extraction_result["truncation_reason"] = f"Response too large (>{max_tokens} chars)"
|
||||||
|
|
||||||
|
extraction_result.update({
|
||||||
|
"success": True,
|
||||||
|
"file_info": {
|
||||||
|
"path": str(path),
|
||||||
|
"total_pages": total_pages,
|
||||||
|
"pages_extracted": len(pages_to_extract),
|
||||||
|
"pages_requested": pages or "all"
|
||||||
|
},
|
||||||
|
"extraction_time": round(time.time() - start_time, 2)
|
||||||
|
})
|
||||||
|
return extraction_result
|
||||||
|
|
||||||
|
# File output mode (default): extract all requested pages, write to file
|
||||||
extraction_result = await self._extract_text_from_pages(
|
extraction_result = await self._extract_text_from_pages(
|
||||||
doc, pages_to_extract, method, preserve_layout
|
doc, pages_to_extract, method, preserve_layout
|
||||||
)
|
)
|
||||||
|
|
||||||
doc.close()
|
doc.close()
|
||||||
|
|
||||||
# Check token limit and truncate if necessary
|
full_text = extraction_result["text"]
|
||||||
if len(extraction_result["text"]) > max_tokens:
|
|
||||||
truncated_text = extraction_result["text"][:max_tokens]
|
|
||||||
# Try to truncate at sentence boundary
|
|
||||||
last_period = truncated_text.rfind('.')
|
|
||||||
if last_period > max_tokens * 0.8: # If we can find a good break point
|
|
||||||
truncated_text = truncated_text[:last_period + 1]
|
|
||||||
|
|
||||||
extraction_result["text"] = truncated_text
|
# Setup output directory
|
||||||
extraction_result["truncated"] = True
|
if output_directory:
|
||||||
extraction_result["truncation_reason"] = f"Response too large (>{max_tokens} chars)"
|
output_dir = validate_output_path(output_directory)
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
else:
|
||||||
|
output_dir = Path(tempfile.mkdtemp(prefix="pdf_text_"))
|
||||||
|
|
||||||
extraction_result.update({
|
# Write text to file
|
||||||
|
output_filename = f"{path.stem}.txt"
|
||||||
|
output_path = output_dir / output_filename
|
||||||
|
with open(output_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(full_text)
|
||||||
|
|
||||||
|
# Build preview (first ~500 chars at sentence boundary)
|
||||||
|
preview = full_text[:500]
|
||||||
|
if len(full_text) > 500:
|
||||||
|
last_period = preview.rfind('.')
|
||||||
|
if last_period > 300:
|
||||||
|
preview = preview[:last_period + 1]
|
||||||
|
preview += " [...]"
|
||||||
|
|
||||||
|
word_count = len(full_text.split())
|
||||||
|
char_count = len(full_text)
|
||||||
|
file_size = output_path.stat().st_size
|
||||||
|
|
||||||
|
return {
|
||||||
"success": True,
|
"success": True,
|
||||||
"file_info": {
|
"output_file": str(output_path),
|
||||||
"path": str(path),
|
"text_preview": preview,
|
||||||
"total_pages": total_pages,
|
"extraction_summary": {
|
||||||
|
"word_count": word_count,
|
||||||
|
"character_count": char_count,
|
||||||
|
"file_size_bytes": file_size,
|
||||||
|
"file_size_kb": round(file_size / 1024, 1),
|
||||||
"pages_extracted": len(pages_to_extract),
|
"pages_extracted": len(pages_to_extract),
|
||||||
|
"total_pages": total_pages,
|
||||||
|
"method_used": extraction_result.get("method_used", method)
|
||||||
|
},
|
||||||
|
"file_info": {
|
||||||
|
"input_path": str(path),
|
||||||
|
"total_pages": total_pages,
|
||||||
"pages_requested": pages or "all"
|
"pages_requested": pages or "all"
|
||||||
},
|
},
|
||||||
"extraction_time": round(time.time() - start_time, 2)
|
"extraction_time": round(time.time() - start_time, 2)
|
||||||
})
|
}
|
||||||
|
|
||||||
return extraction_result
|
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
error_msg = sanitize_error_message(str(e))
|
error_msg = sanitize_error_message(str(e))
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user