mcwaddams-site/src/content/docs/tutorials/first-extraction.mdx

---
title: Your First Extraction
description: Extract text from an Office document in 60 seconds.
---

import { Aside, Steps, Code, Tabs, TabItem } from '@astrojs/starlight/components';

> *"I'll be honest with you, I love extracting documents. I do. I'm a mcwaddams fan."*

Let's get you extracting documents faster than you can say "TPS report cover sheet."

---

## Prerequisites

Make sure you have mcwaddams installed and configured:

<Tabs>
  <TabItem label="Claude Code">
    ```bash
    claude mcp add mcwaddams "uvx mcwaddams"
    ```

    Restart Claude Code, and you're ready.
  </TabItem>
  <TabItem label="Claude Desktop">
    Add to your `claude_desktop_config.json`:

    ```json
    {
      "mcpServers": {
        "mcwaddams": {
          "command": "uvx",
          "args": ["mcwaddams"]
        }
      }
    }
    ```

    Restart Claude Desktop.
  </TabItem>
</Tabs>

---

## Step 1: Find a Document

Grab any Office document you have lying around:

- A `.docx` report
- An `.xlsx` spreadsheet
- A `.pptx` presentation
- Even a crusty `.doc` from 2005

<Aside type="tip">
Don't have one handy? You can also use a URL:
```
https://example.com/sample-report.docx
```
</Aside>

---

## Step 2: Ask for Extraction

Just tell your AI assistant what you want:

```
Extract text from /path/to/quarterly-report.docx
```

That's it. No configuration, no options, no ceremony.

---

## Step 3: Get Results

mcwaddams returns structured data:

```json
{
  "text": "Q4 2024 Financial Summary\n\nRevenue increased by 15%...",
  "metadata": {
    "format": "Word Document (DOCX)",
    "extraction_method": "python-docx",
    "extraction_time": 0.042,
    "word_count": 3421
  }
}
```

The AI can now use this content to answer your questions, summarize, analyze, or whatever you need.

---

## What Just Happened?

Behind the scenes, mcwaddams:

<Steps>
1. **Detected the format** — Identified `.docx` as a modern Word document

2. **Selected the best method** — Used `python-docx` for optimal extraction

3. **Extracted the content** — Pulled text while preserving structure

4. **Added metadata** — Included timing and method information
</Steps>

---

## Try Different Formats

The same command works for all supported formats:

### Word Documents

```
Extract text from contract.docx
Extract text from legacy-proposal.doc
```

### Excel Spreadsheets

```
Extract text from sales-data.xlsx
Extract text from budget-2019.xls
```

### PowerPoint Presentations

```
Extract text from quarterly-deck.pptx
Extract text from old-presentation.ppt
```

### CSV Files

```
Extract text from export.csv
```

---

## Working with Large Documents

Documents over 25,000 tokens get automatically paginated:

```json
{
  "text": "Chapter 1: Introduction...",
  "pagination": {
    "current_page": 1,
    "total_pages": 5,
    "cursor_id": "abc123"
  }
}
```

To get the next page:

```
Continue extracting (cursor: abc123)
```

<Aside type="note">
The AI handles pagination automatically in most cases. You'll see all the content without manually fetching pages.
</Aside>

---

## Common Options

You can be more specific about what you want:

### Include Images

```
Extract text and images from report.docx
```

### Get Metadata Only

```
Get metadata from mystery-file.doc
```

### Convert to Markdown

```
Convert presentation.pptx to markdown
```

### Analyze Structure

```
Show me the structure of thesis.docx
```

---

## Error Messages

mcwaddams provides clear errors when something goes wrong:

### File Not Found
```json
{
  "error": "File not found",
  "path": "/path/to/missing.docx",
  "hint": "Check that the file path exists and is accessible"
}
```

### Unsupported Format
```json
{
  "error": "Unsupported format",
  "extension": ".xyz",
  "hint": "Use get_supported_formats to see all supported types"
}
```

### Password Protected
```json
{
  "error": "Document is password-protected",
  "hint": "Remove password protection or provide an unencrypted version"
}
```

---

## Next Steps

Now that you've extracted your first document:

- **[Working with Legacy Formats](/tutorials/legacy-formats/)** — Handle `.doc`, `.xls`, `.ppt`
- **[Indexing Large Documents](/tutorials/indexing/)** — Efficient access to huge files
- **[Extract Tables](/how-to/extract-tables/)** — Structured table extraction
- **[All Tools Reference](/reference/tools/)** — Complete tool documentation

---

<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
"Looks like someone has a case of the Mondays."
<br/>
<small>Not anymore. Your documents are extracted.</small>
</div>