Starlight automatically renders the frontmatter title as H1, so having a duplicate # Title in the body creates redundancy. Removed from 29 content files across all sections.
250 lines
4.5 KiB
Plaintext
250 lines
4.5 KiB
Plaintext
---
|
|
title: Your First Extraction
|
|
description: Extract text from an Office document in 60 seconds.
|
|
---
|
|
|
|
import { Aside, Steps, Code, Tabs, TabItem } from '@astrojs/starlight/components';
|
|
|
|
> *"I'll be honest with you, I love extracting documents. I do. I'm a mcwaddams fan."*
|
|
|
|
Let's get you extracting documents faster than you can say "TPS report cover sheet."
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
Make sure you have mcwaddams installed and configured:
|
|
|
|
<Tabs>
|
|
<TabItem label="Claude Code">
|
|
```bash
|
|
claude mcp add mcwaddams "uvx mcwaddams"
|
|
```
|
|
|
|
Restart Claude Code, and you're ready.
|
|
</TabItem>
|
|
<TabItem label="Claude Desktop">
|
|
Add to your `claude_desktop_config.json`:
|
|
|
|
```json
|
|
{
|
|
"mcpServers": {
|
|
"mcwaddams": {
|
|
"command": "uvx",
|
|
"args": ["mcwaddams"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Restart Claude Desktop.
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
---
|
|
|
|
## Step 1: Find a Document
|
|
|
|
Grab any Office document you have lying around:
|
|
|
|
- A `.docx` report
|
|
- An `.xlsx` spreadsheet
|
|
- A `.pptx` presentation
|
|
- Even a crusty `.doc` from 2005
|
|
|
|
<Aside type="tip">
|
|
Don't have one handy? You can also use a URL:
|
|
```
|
|
https://example.com/sample-report.docx
|
|
```
|
|
</Aside>
|
|
|
|
---
|
|
|
|
## Step 2: Ask for Extraction
|
|
|
|
Just tell your AI assistant what you want:
|
|
|
|
```
|
|
Extract text from /path/to/quarterly-report.docx
|
|
```
|
|
|
|
That's it. No configuration, no options, no ceremony.
|
|
|
|
---
|
|
|
|
## Step 3: Get Results
|
|
|
|
mcwaddams returns structured data:
|
|
|
|
```json
|
|
{
|
|
"text": "Q4 2024 Financial Summary\n\nRevenue increased by 15%...",
|
|
"metadata": {
|
|
"format": "Word Document (DOCX)",
|
|
"extraction_method": "python-docx",
|
|
"extraction_time": 0.042,
|
|
"word_count": 3421
|
|
}
|
|
}
|
|
```
|
|
|
|
The AI can now use this content to answer your questions, summarize, analyze, or whatever you need.
|
|
|
|
---
|
|
|
|
## What Just Happened?
|
|
|
|
Behind the scenes, mcwaddams:
|
|
|
|
<Steps>
|
|
1. **Detected the format** — Identified `.docx` as a modern Word document
|
|
|
|
2. **Selected the best method** — Used `python-docx` for optimal extraction
|
|
|
|
3. **Extracted the content** — Pulled text while preserving structure
|
|
|
|
4. **Added metadata** — Included timing and method information
|
|
</Steps>
|
|
|
|
---
|
|
|
|
## Try Different Formats
|
|
|
|
The same command works for all supported formats:
|
|
|
|
### Word Documents
|
|
|
|
```
|
|
Extract text from contract.docx
|
|
Extract text from legacy-proposal.doc
|
|
```
|
|
|
|
### Excel Spreadsheets
|
|
|
|
```
|
|
Extract text from sales-data.xlsx
|
|
Extract text from budget-2019.xls
|
|
```
|
|
|
|
### PowerPoint Presentations
|
|
|
|
```
|
|
Extract text from quarterly-deck.pptx
|
|
Extract text from old-presentation.ppt
|
|
```
|
|
|
|
### CSV Files
|
|
|
|
```
|
|
Extract text from export.csv
|
|
```
|
|
|
|
---
|
|
|
|
## Working with Large Documents
|
|
|
|
Documents over 25,000 tokens get automatically paginated:
|
|
|
|
```json
|
|
{
|
|
"text": "Chapter 1: Introduction...",
|
|
"pagination": {
|
|
"current_page": 1,
|
|
"total_pages": 5,
|
|
"cursor_id": "abc123"
|
|
}
|
|
}
|
|
```
|
|
|
|
To get the next page:
|
|
|
|
```
|
|
Continue extracting (cursor: abc123)
|
|
```
|
|
|
|
<Aside type="note">
|
|
The AI handles pagination automatically in most cases. You'll see all the content without manually fetching pages.
|
|
</Aside>
|
|
|
|
---
|
|
|
|
## Common Options
|
|
|
|
You can be more specific about what you want:
|
|
|
|
### Include Images
|
|
|
|
```
|
|
Extract text and images from report.docx
|
|
```
|
|
|
|
### Get Metadata Only
|
|
|
|
```
|
|
Get metadata from mystery-file.doc
|
|
```
|
|
|
|
### Convert to Markdown
|
|
|
|
```
|
|
Convert presentation.pptx to markdown
|
|
```
|
|
|
|
### Analyze Structure
|
|
|
|
```
|
|
Show me the structure of thesis.docx
|
|
```
|
|
|
|
---
|
|
|
|
## Error Messages
|
|
|
|
mcwaddams provides clear errors when something goes wrong:
|
|
|
|
### File Not Found
|
|
```json
|
|
{
|
|
"error": "File not found",
|
|
"path": "/path/to/missing.docx",
|
|
"hint": "Check that the file path exists and is accessible"
|
|
}
|
|
```
|
|
|
|
### Unsupported Format
|
|
```json
|
|
{
|
|
"error": "Unsupported format",
|
|
"extension": ".xyz",
|
|
"hint": "Use get_supported_formats to see all supported types"
|
|
}
|
|
```
|
|
|
|
### Password Protected
|
|
```json
|
|
{
|
|
"error": "Document is password-protected",
|
|
"hint": "Remove password protection or provide an unencrypted version"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
Now that you've extracted your first document:
|
|
|
|
- **[Working with Legacy Formats](/tutorials/legacy-formats/)** — Handle `.doc`, `.xls`, `.ppt`
|
|
- **[Indexing Large Documents](/tutorials/indexing/)** — Efficient access to huge files
|
|
- **[Extract Tables](/how-to/extract-tables/)** — Structured table extraction
|
|
- **[All Tools Reference](/reference/tools/)** — Complete tool documentation
|
|
|
|
---
|
|
|
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
|
"Looks like someone has a case of the Mondays."
|
|
<br/>
|
|
<small>Not anymore. Your documents are extracted.</small>
|
|
</div>
|