mcwaddams-site/src/content/docs/tutorials/first-extraction.mdx
Ryan Malloy 0bea793a09 Remove duplicate H1 titles from MDX content files
Starlight automatically renders the frontmatter title as H1,
so having a duplicate # Title in the body creates redundancy.
Removed from 29 content files across all sections.
2026-01-11 14:49:26 -07:00

250 lines
4.5 KiB
Plaintext

---
title: Your First Extraction
description: Extract text from an Office document in 60 seconds.
---
import { Aside, Steps, Code, Tabs, TabItem } from '@astrojs/starlight/components';
> *"I'll be honest with you, I love extracting documents. I do. I'm a mcwaddams fan."*
Let's get you extracting documents faster than you can say "TPS report cover sheet."
---
## Prerequisites
Make sure you have mcwaddams installed and configured:
<Tabs>
<TabItem label="Claude Code">
```bash
claude mcp add mcwaddams "uvx mcwaddams"
```
Restart Claude Code, and you're ready.
</TabItem>
<TabItem label="Claude Desktop">
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"mcwaddams": {
"command": "uvx",
"args": ["mcwaddams"]
}
}
}
```
Restart Claude Desktop.
</TabItem>
</Tabs>
---
## Step 1: Find a Document
Grab any Office document you have lying around:
- A `.docx` report
- An `.xlsx` spreadsheet
- A `.pptx` presentation
- Even a crusty `.doc` from 2005
<Aside type="tip">
Don't have one handy? You can also use a URL:
```
https://example.com/sample-report.docx
```
</Aside>
---
## Step 2: Ask for Extraction
Just tell your AI assistant what you want:
```
Extract text from /path/to/quarterly-report.docx
```
That's it. No configuration, no options, no ceremony.
---
## Step 3: Get Results
mcwaddams returns structured data:
```json
{
"text": "Q4 2024 Financial Summary\n\nRevenue increased by 15%...",
"metadata": {
"format": "Word Document (DOCX)",
"extraction_method": "python-docx",
"extraction_time": 0.042,
"word_count": 3421
}
}
```
The AI can now use this content to answer your questions, summarize, analyze, or whatever you need.
---
## What Just Happened?
Behind the scenes, mcwaddams:
<Steps>
1. **Detected the format** — Identified `.docx` as a modern Word document
2. **Selected the best method** — Used `python-docx` for optimal extraction
3. **Extracted the content** — Pulled text while preserving structure
4. **Added metadata** — Included timing and method information
</Steps>
---
## Try Different Formats
The same command works for all supported formats:
### Word Documents
```
Extract text from contract.docx
Extract text from legacy-proposal.doc
```
### Excel Spreadsheets
```
Extract text from sales-data.xlsx
Extract text from budget-2019.xls
```
### PowerPoint Presentations
```
Extract text from quarterly-deck.pptx
Extract text from old-presentation.ppt
```
### CSV Files
```
Extract text from export.csv
```
---
## Working with Large Documents
Documents over 25,000 tokens get automatically paginated:
```json
{
"text": "Chapter 1: Introduction...",
"pagination": {
"current_page": 1,
"total_pages": 5,
"cursor_id": "abc123"
}
}
```
To get the next page:
```
Continue extracting (cursor: abc123)
```
<Aside type="note">
The AI handles pagination automatically in most cases. You'll see all the content without manually fetching pages.
</Aside>
---
## Common Options
You can be more specific about what you want:
### Include Images
```
Extract text and images from report.docx
```
### Get Metadata Only
```
Get metadata from mystery-file.doc
```
### Convert to Markdown
```
Convert presentation.pptx to markdown
```
### Analyze Structure
```
Show me the structure of thesis.docx
```
---
## Error Messages
mcwaddams provides clear errors when something goes wrong:
### File Not Found
```json
{
"error": "File not found",
"path": "/path/to/missing.docx",
"hint": "Check that the file path exists and is accessible"
}
```
### Unsupported Format
```json
{
"error": "Unsupported format",
"extension": ".xyz",
"hint": "Use get_supported_formats to see all supported types"
}
```
### Password Protected
```json
{
"error": "Document is password-protected",
"hint": "Remove password protection or provide an unencrypted version"
}
```
---
## Next Steps
Now that you've extracted your first document:
- **[Working with Legacy Formats](/tutorials/legacy-formats/)** — Handle `.doc`, `.xls`, `.ppt`
- **[Indexing Large Documents](/tutorials/indexing/)** — Efficient access to huge files
- **[Extract Tables](/how-to/extract-tables/)** — Structured table extraction
- **[All Tools Reference](/reference/tools/)** — Complete tool documentation
---
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
"Looks like someone has a case of the Mondays."
<br/>
<small>Not anymore. Your documents are extracted.</small>
</div>