Initial mcwaddams documentation site
Astro + Starlight documentation site for mcwaddams MCP server. Features: - Diátaxis documentation structure (tutorials, how-to, reference, explanation) - Office Space theming (Milton Waddams, TPS Reports, red Swingline) - 29 documentation pages covering all 20 tools - TPS Reports section for test results - Flair gamification config (pieces of documentation flair) - Custom CSS with Office Space color scheme Structure: - Getting Started: backstory, installation, quickstart - Tutorials: first extraction, legacy formats, indexing, resources - How-To: tables, Excel analysis, markdown, pagination, URLs - Reference: all tools, universal/word/excel tools, MCP resources, formats - Explanation: architecture, mixins, fallbacks, resource system - TPS Reports: dashboard, coverage, torture tests - Community: credits, feedback, leaderboard
This commit is contained in:
commit
32b41f79d9
21
.gitignore
vendored
Normal file
21
.gitignore
vendored
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
# build output
|
||||||
|
dist/
|
||||||
|
# generated types
|
||||||
|
.astro/
|
||||||
|
|
||||||
|
# dependencies
|
||||||
|
node_modules/
|
||||||
|
|
||||||
|
# logs
|
||||||
|
npm-debug.log*
|
||||||
|
yarn-debug.log*
|
||||||
|
yarn-error.log*
|
||||||
|
pnpm-debug.log*
|
||||||
|
|
||||||
|
|
||||||
|
# environment variables
|
||||||
|
.env
|
||||||
|
.env.production
|
||||||
|
|
||||||
|
# macOS-specific files
|
||||||
|
.DS_Store
|
||||||
49
README.md
Normal file
49
README.md
Normal file
@ -0,0 +1,49 @@
|
|||||||
|
# Starlight Starter Kit: Basics
|
||||||
|
|
||||||
|
[](https://starlight.astro.build)
|
||||||
|
|
||||||
|
```
|
||||||
|
npm create astro@latest -- --template starlight
|
||||||
|
```
|
||||||
|
|
||||||
|
> 🧑🚀 **Seasoned astronaut?** Delete this file. Have fun!
|
||||||
|
|
||||||
|
## 🚀 Project Structure
|
||||||
|
|
||||||
|
Inside of your Astro + Starlight project, you'll see the following folders and files:
|
||||||
|
|
||||||
|
```
|
||||||
|
.
|
||||||
|
├── public/
|
||||||
|
├── src/
|
||||||
|
│ ├── assets/
|
||||||
|
│ ├── content/
|
||||||
|
│ │ └── docs/
|
||||||
|
│ └── content.config.ts
|
||||||
|
├── astro.config.mjs
|
||||||
|
├── package.json
|
||||||
|
└── tsconfig.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Starlight looks for `.md` or `.mdx` files in the `src/content/docs/` directory. Each file is exposed as a route based on its file name.
|
||||||
|
|
||||||
|
Images can be added to `src/assets/` and embedded in Markdown with a relative link.
|
||||||
|
|
||||||
|
Static assets, like favicons, can be placed in the `public/` directory.
|
||||||
|
|
||||||
|
## 🧞 Commands
|
||||||
|
|
||||||
|
All commands are run from the root of the project, from a terminal:
|
||||||
|
|
||||||
|
| Command | Action |
|
||||||
|
| :------------------------ | :----------------------------------------------- |
|
||||||
|
| `npm install` | Installs dependencies |
|
||||||
|
| `npm run dev` | Starts local dev server at `localhost:4321` |
|
||||||
|
| `npm run build` | Build your production site to `./dist/` |
|
||||||
|
| `npm run preview` | Preview your build locally, before deploying |
|
||||||
|
| `npm run astro ...` | Run CLI commands like `astro add`, `astro check` |
|
||||||
|
| `npm run astro -- --help` | Get help using the Astro CLI |
|
||||||
|
|
||||||
|
## 👀 Want to learn more?
|
||||||
|
|
||||||
|
Check out [Starlight’s docs](https://starlight.astro.build/), read [the Astro documentation](https://docs.astro.build), or jump into the [Astro Discord server](https://astro.build/chat).
|
||||||
120
astro.config.mjs
Normal file
120
astro.config.mjs
Normal file
@ -0,0 +1,120 @@
|
|||||||
|
// @ts-check
|
||||||
|
import { defineConfig } from 'astro/config';
|
||||||
|
import starlight from '@astrojs/starlight';
|
||||||
|
import tailwindcss from '@tailwindcss/vite';
|
||||||
|
|
||||||
|
// https://astro.build/config
|
||||||
|
export default defineConfig({
|
||||||
|
vite: {
|
||||||
|
plugins: [tailwindcss()],
|
||||||
|
},
|
||||||
|
integrations: [
|
||||||
|
starlight({
|
||||||
|
title: 'mcwaddams',
|
||||||
|
tagline: 'I was told there would be document extraction.',
|
||||||
|
logo: {
|
||||||
|
src: './src/assets/stapler.svg',
|
||||||
|
replacesTitle: false,
|
||||||
|
},
|
||||||
|
social: [
|
||||||
|
{ icon: 'github', label: 'GitHub', href: 'https://github.com/ryanmalloy/mcwaddams' },
|
||||||
|
],
|
||||||
|
customCss: [
|
||||||
|
'./src/styles/custom.css',
|
||||||
|
],
|
||||||
|
head: [
|
||||||
|
{
|
||||||
|
tag: 'meta',
|
||||||
|
attrs: {
|
||||||
|
name: 'theme-color',
|
||||||
|
content: '#b91c1c',
|
||||||
|
},
|
||||||
|
},
|
||||||
|
],
|
||||||
|
sidebar: [
|
||||||
|
{
|
||||||
|
label: 'Getting Started',
|
||||||
|
items: [
|
||||||
|
{ label: 'The Backstory', slug: 'backstory' },
|
||||||
|
{ label: 'Installation', slug: 'installation' },
|
||||||
|
{ label: 'Quick Start', slug: 'quickstart' },
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
label: 'Tutorials',
|
||||||
|
badge: { text: 'Learn', variant: 'tip' },
|
||||||
|
items: [
|
||||||
|
{ label: 'Your First Extraction', slug: 'tutorials/first-extraction' },
|
||||||
|
{ label: 'Working with Legacy Formats', slug: 'tutorials/legacy-formats' },
|
||||||
|
{ label: 'Indexing Large Documents', slug: 'tutorials/indexing' },
|
||||||
|
{ label: 'Using MCP Resources', slug: 'tutorials/resources' },
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
label: 'How-To Guides',
|
||||||
|
badge: { text: 'Solve', variant: 'note' },
|
||||||
|
items: [
|
||||||
|
{ label: 'Extract Tables from Word', slug: 'how-to/extract-tables' },
|
||||||
|
{ label: 'Analyze Excel Data', slug: 'how-to/analyze-excel' },
|
||||||
|
{ label: 'Convert to Markdown', slug: 'how-to/convert-markdown' },
|
||||||
|
{ label: 'Handle Pagination', slug: 'how-to/pagination' },
|
||||||
|
{ label: 'Process URLs', slug: 'how-to/url-processing' },
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
label: 'Reference',
|
||||||
|
badge: { text: 'Look Up', variant: 'caution' },
|
||||||
|
items: [
|
||||||
|
{ label: 'All Tools', slug: 'reference/tools' },
|
||||||
|
{ label: 'Universal Tools', slug: 'reference/universal-tools' },
|
||||||
|
{ label: 'Word Tools', slug: 'reference/word-tools' },
|
||||||
|
{ label: 'Excel Tools', slug: 'reference/excel-tools' },
|
||||||
|
{ label: 'MCP Resources', slug: 'reference/resources' },
|
||||||
|
{ label: 'Format Support', slug: 'reference/formats' },
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
label: 'Explanation',
|
||||||
|
badge: { text: 'Understand', variant: 'success' },
|
||||||
|
items: [
|
||||||
|
{ label: 'Architecture', slug: 'explanation/architecture' },
|
||||||
|
{ label: 'Why Mixins?', slug: 'explanation/mixins' },
|
||||||
|
{ label: 'Fallback Strategy', slug: 'explanation/fallbacks' },
|
||||||
|
{ label: 'Resource System', slug: 'explanation/resource-system' },
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
label: 'TPS Reports',
|
||||||
|
badge: { text: 'Testing Painful Stuff', variant: 'danger' },
|
||||||
|
items: [
|
||||||
|
{ label: 'Test Dashboard', slug: 'tps/dashboard' },
|
||||||
|
{ label: 'Coverage Report', slug: 'tps/coverage' },
|
||||||
|
{ label: 'Torture Tests', slug: 'tps/torture' },
|
||||||
|
],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
label: 'Community',
|
||||||
|
items: [
|
||||||
|
{ label: 'Feedback', slug: 'community/feedback' },
|
||||||
|
{ label: 'Flair Leaderboard', slug: 'community/leaderboard' },
|
||||||
|
{ label: 'Credits', slug: 'community/credits' },
|
||||||
|
],
|
||||||
|
},
|
||||||
|
],
|
||||||
|
editLink: {
|
||||||
|
baseUrl: 'https://github.com/ryanmalloy/mcwaddams-site/edit/main/',
|
||||||
|
},
|
||||||
|
lastUpdated: true,
|
||||||
|
pagination: true,
|
||||||
|
tableOfContents: { minHeadingLevel: 2, maxHeadingLevel: 4 },
|
||||||
|
expressiveCode: {
|
||||||
|
themes: ['dracula', 'github-light'],
|
||||||
|
styleOverrides: {
|
||||||
|
borderRadius: '0.5rem',
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}),
|
||||||
|
],
|
||||||
|
telemetry: false,
|
||||||
|
devToolbar: { enabled: false },
|
||||||
|
});
|
||||||
6985
package-lock.json
generated
Normal file
6985
package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
19
package.json
Normal file
19
package.json
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
{
|
||||||
|
"name": "mcwaddams-site",
|
||||||
|
"type": "module",
|
||||||
|
"version": "0.0.1",
|
||||||
|
"scripts": {
|
||||||
|
"dev": "astro dev",
|
||||||
|
"start": "astro dev",
|
||||||
|
"build": "astro build",
|
||||||
|
"preview": "astro preview",
|
||||||
|
"astro": "astro"
|
||||||
|
},
|
||||||
|
"dependencies": {
|
||||||
|
"@astrojs/starlight": "^0.37.2",
|
||||||
|
"@tailwindcss/vite": "^4.1.18",
|
||||||
|
"astro": "^5.6.1",
|
||||||
|
"sharp": "^0.34.2",
|
||||||
|
"tailwindcss": "^4.1.18"
|
||||||
|
}
|
||||||
|
}
|
||||||
1
public/favicon.svg
Normal file
1
public/favicon.svg
Normal file
@ -0,0 +1 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 128 128"><path fill-rule="evenodd" d="M81 36 64 0 47 36l-1 2-9-10a6 6 0 0 0-9 9l10 10h-2L0 64l36 17h2L28 91a6 6 0 1 0 9 9l9-10 1 2 17 36 17-36v-2l9 10a6 6 0 1 0 9-9l-9-9 2-1 36-17-36-17-2-1 9-9a6 6 0 1 0-9-9l-9 10v-2Zm-17 2-2 5c-4 8-11 15-19 19l-5 2 5 2c8 4 15 11 19 19l2 5 2-5c4-8 11-15 19-19l5-2-5-2c-8-4-15-11-19-19l-2-5Z" clip-rule="evenodd"/><path d="M118 19a6 6 0 0 0-9-9l-3 3a6 6 0 1 0 9 9l3-3Zm-96 4c-2 2-6 2-9 0l-3-3a6 6 0 1 1 9-9l3 3c3 2 3 6 0 9Zm0 82c-2-2-6-2-9 0l-3 3a6 6 0 1 0 9 9l3-3c3-2 3-6 0-9Zm96 4a6 6 0 0 1-9 9l-3-3a6 6 0 1 1 9-9l3 3Z"/><style>path{fill:#000}@media (prefers-color-scheme:dark){path{fill:#fff}}</style></svg>
|
||||||
|
After Width: | Height: | Size: 696 B |
BIN
src/assets/houston.webp
Normal file
BIN
src/assets/houston.webp
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 96 KiB |
26
src/assets/stapler.svg
Normal file
26
src/assets/stapler.svg
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 64 64" width="32" height="32">
|
||||||
|
<!-- Red Swingline Stapler -->
|
||||||
|
<defs>
|
||||||
|
<linearGradient id="staplerRed" x1="0%" y1="0%" x2="0%" y2="100%">
|
||||||
|
<stop offset="0%" style="stop-color:#ef4444"/>
|
||||||
|
<stop offset="50%" style="stop-color:#dc2626"/>
|
||||||
|
<stop offset="100%" style="stop-color:#b91c1c"/>
|
||||||
|
</linearGradient>
|
||||||
|
<linearGradient id="staplerDark" x1="0%" y1="0%" x2="0%" y2="100%">
|
||||||
|
<stop offset="0%" style="stop-color:#374151"/>
|
||||||
|
<stop offset="100%" style="stop-color:#1f2937"/>
|
||||||
|
</linearGradient>
|
||||||
|
</defs>
|
||||||
|
<!-- Base/Bottom -->
|
||||||
|
<rect x="8" y="42" width="48" height="8" rx="2" fill="url(#staplerDark)"/>
|
||||||
|
<!-- Staple channel -->
|
||||||
|
<rect x="12" y="44" width="40" height="4" rx="1" fill="#111827"/>
|
||||||
|
<!-- Top/Body -->
|
||||||
|
<path d="M10 42 L10 28 Q10 24 14 24 L50 24 Q54 24 54 28 L54 42 Z" fill="url(#staplerRed)"/>
|
||||||
|
<!-- Highlight -->
|
||||||
|
<path d="M12 26 L52 26 Q53 26 53 27 L53 30 L11 30 L11 27 Q11 26 12 26 Z" fill="rgba(255,255,255,0.2)"/>
|
||||||
|
<!-- Swingline logo area -->
|
||||||
|
<rect x="20" y="30" width="24" height="8" rx="1" fill="rgba(0,0,0,0.1)"/>
|
||||||
|
<!-- Push lever -->
|
||||||
|
<rect x="44" y="20" width="8" height="6" rx="1" fill="url(#staplerDark)"/>
|
||||||
|
</svg>
|
||||||
|
After Width: | Height: | Size: 1.2 KiB |
7
src/content.config.ts
Normal file
7
src/content.config.ts
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
import { defineCollection } from 'astro:content';
|
||||||
|
import { docsLoader } from '@astrojs/starlight/loaders';
|
||||||
|
import { docsSchema } from '@astrojs/starlight/schema';
|
||||||
|
|
||||||
|
export const collections = {
|
||||||
|
docs: defineCollection({ loader: docsLoader(), schema: docsSchema() }),
|
||||||
|
};
|
||||||
98
src/content/docs/backstory.mdx
Normal file
98
src/content/docs/backstory.mdx
Normal file
@ -0,0 +1,98 @@
|
|||||||
|
---
|
||||||
|
title: The Backstory
|
||||||
|
description: How Milton Waddams became the patron saint of legacy document processing.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# The Backstory
|
||||||
|
|
||||||
|
> *"I was told I could listen to the radio at a reasonable volume from nine to eleven..."*
|
||||||
|
|
||||||
|
## The Relocation
|
||||||
|
|
||||||
|
Milton Waddams was relocated to the basement. They took his stapler. But down there, surrounded by boxes of `.doc` files from 1997 and `.xls` spreadsheets that predate Unicode, he became something else entirely.
|
||||||
|
|
||||||
|
He became a **document processing expert**.
|
||||||
|
|
||||||
|
<Aside type="tip" title="Fun Fact">
|
||||||
|
After the movie *Office Space* came out, Swingline didn't actually make a red stapler. Demand was so high they started manufacturing them. The movie created the product.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## The Problem
|
||||||
|
|
||||||
|
Every enterprise has them:
|
||||||
|
|
||||||
|
- **The Archive Folder** — 50,000 Word documents from before the cloud existed
|
||||||
|
- **The Legacy Database Export** — Excel files with formulas referencing cells that no longer exist
|
||||||
|
- **The Board Presentations** — PowerPoint decks with embedded charts from 2003
|
||||||
|
- **The Contract Repository** — `.doc` files that crash modern Word
|
||||||
|
|
||||||
|
AI agents can read PDFs. They can parse JSON. But Office documents? The binary formats, the OLE containers, the OOXML with custom schemas?
|
||||||
|
|
||||||
|
*Nobody wants to deal with that.*
|
||||||
|
|
||||||
|
## The Solution
|
||||||
|
|
||||||
|
**mcwaddams** handles the documents nobody else wants to touch.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Extract text from a 1997 .doc file
|
||||||
|
result = await extract_text("contract_final_FINAL_v2.doc")
|
||||||
|
|
||||||
|
# It just works
|
||||||
|
print(result["text"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### What We Handle
|
||||||
|
|
||||||
|
| Format | Era | Status |
|
||||||
|
|--------|-----|--------|
|
||||||
|
| `.docx` | 2007+ | ✅ Full support |
|
||||||
|
| `.doc` | 1997-2007 | ✅ Works fine |
|
||||||
|
| `.xlsx` | 2007+ | ✅ Full support |
|
||||||
|
| `.xls` | 1997-2007 | ✅ Works fine |
|
||||||
|
| `.pptx` | 2007+ | ✅ Full support |
|
||||||
|
| `.ppt` | 1997-2007 | ✅ Works fine |
|
||||||
|
|
||||||
|
<Aside type="caution" title="Did You Get the Memo?">
|
||||||
|
If a document is password-protected or encrypted, we'll detect it and tell you. We can't extract content from encrypted files, but we won't silently fail either.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## The Philosophy
|
||||||
|
|
||||||
|
### 1. No Silent Failures
|
||||||
|
|
||||||
|
When python-docx can't handle a file, mammoth tries. When openpyxl fails, pandas steps in. You'll always get either content or a clear error message explaining why.
|
||||||
|
|
||||||
|
### 2. Legacy is Not Abandoned
|
||||||
|
|
||||||
|
Those `.doc` files from 2003? They're still business-critical for someone. We don't treat legacy formats as second-class citizens.
|
||||||
|
|
||||||
|
### 3. Context-Aware Extraction
|
||||||
|
|
||||||
|
Large documents get paginated automatically. The MCP resource system lets you fetch chapters on-demand. Your context window stays manageable.
|
||||||
|
|
||||||
|
### 4. Testing Painful Stuff
|
||||||
|
|
||||||
|
We threw 301 random Office documents at mcwaddams. **299 succeeded.** The 2 failures were empty/corrupt files.
|
||||||
|
|
||||||
|
See the [TPS Reports](/tps/dashboard/) for proof.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Name
|
||||||
|
|
||||||
|
Milton Waddams. The guy with the stapler. Relegated to the basement with the old filing cabinets and the roaches.
|
||||||
|
|
||||||
|
That's where the legacy documents live too.
|
||||||
|
|
||||||
|
*"I could set the building on fire..."*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem;">
|
||||||
|
|
||||||
|
**Ready to start?** → [Installation](/installation/)
|
||||||
|
|
||||||
|
</div>
|
||||||
170
src/content/docs/community/credits.mdx
Normal file
170
src/content/docs/community/credits.mdx
Normal file
@ -0,0 +1,170 @@
|
|||||||
|
---
|
||||||
|
title: Credits & Attributions
|
||||||
|
description: The open source projects and cultural references that make mcwaddams possible.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside, Card, CardGrid, LinkCard } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Credits & Attributions
|
||||||
|
|
||||||
|
> *"It's not just about me and my dream of doing nothing."*
|
||||||
|
|
||||||
|
mcwaddams stands on the shoulders of giants — both technical and cinematic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Source Dependencies
|
||||||
|
|
||||||
|
The tools that make document extraction possible.
|
||||||
|
|
||||||
|
<CardGrid>
|
||||||
|
<Card title="python-docx" icon="document">
|
||||||
|
Modern Word document processing. The workhorse of `.docx` extraction.
|
||||||
|
<br/><small>MIT License</small>
|
||||||
|
</Card>
|
||||||
|
<Card title="openpyxl" icon="document">
|
||||||
|
Excel XLSX file processing with full formula support.
|
||||||
|
<br/><small>MIT License</small>
|
||||||
|
</Card>
|
||||||
|
<Card title="python-pptx" icon="document">
|
||||||
|
PowerPoint PPTX processing for slides and speaker notes.
|
||||||
|
<br/><small>MIT License</small>
|
||||||
|
</Card>
|
||||||
|
<Card title="mammoth" icon="document">
|
||||||
|
Enhanced Word to HTML/Markdown conversion. Our fallback hero.
|
||||||
|
<br/><small>BSD-2-Clause</small>
|
||||||
|
</Card>
|
||||||
|
</CardGrid>
|
||||||
|
|
||||||
|
<CardGrid>
|
||||||
|
<Card title="pandas" icon="document">
|
||||||
|
Data analysis powerhouse. Handles CSV and Excel fallbacks.
|
||||||
|
<br/><small>BSD-3-Clause</small>
|
||||||
|
</Card>
|
||||||
|
<Card title="olefile" icon="document">
|
||||||
|
OLE Compound Document parsing for legacy `.doc`, `.xls`, `.ppt`.
|
||||||
|
<br/><small>BSD-2-Clause</small>
|
||||||
|
</Card>
|
||||||
|
<Card title="xlrd" icon="document">
|
||||||
|
Legacy Excel XLS support. Because 2003 never really left.
|
||||||
|
<br/><small>BSD License</small>
|
||||||
|
</Card>
|
||||||
|
<Card title="Pillow" icon="document">
|
||||||
|
Image processing for embedded graphics extraction.
|
||||||
|
<br/><small>HPND License</small>
|
||||||
|
</Card>
|
||||||
|
</CardGrid>
|
||||||
|
|
||||||
|
### Framework & Tools
|
||||||
|
|
||||||
|
| Project | Purpose | License |
|
||||||
|
|---------|---------|---------|
|
||||||
|
| [FastMCP](https://github.com/jlowin/fastmcp) | MCP server framework | MIT |
|
||||||
|
| [Astro](https://astro.build) | This documentation site | MIT |
|
||||||
|
| [Starlight](https://starlight.astro.build) | Documentation theme | MIT |
|
||||||
|
| [Tailwind CSS](https://tailwindcss.com) | Styling | MIT |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Office Space
|
||||||
|
|
||||||
|
The cultural foundation of this project.
|
||||||
|
|
||||||
|
<Aside type="tip" title="Why Office Space?">
|
||||||
|
Mike Judge's 1999 film isn't just a comedy — it's a documentary about corporate dysfunction that resonates with anyone who's ever debugged a printer or processed legacy documents.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
### The References
|
||||||
|
|
||||||
|
| Reference | Location | Why It Fits |
|
||||||
|
|-----------|----------|-------------|
|
||||||
|
| **Milton Waddams** | Project name | Relegated to the basement with legacy documents |
|
||||||
|
| **TPS Reports** | Test section | "Testing Painful Stuff" — `.doc` from 1997 is painful |
|
||||||
|
| **Red Swingline** | Logo/branding | The small things that matter |
|
||||||
|
| **"Did you get the memo?"** | Throughout | Clear documentation is essential |
|
||||||
|
| **"I could set the building on fire"** | Footer | But we'd rather process documents |
|
||||||
|
| **Pieces of Flair** | Coming soon | Gamification for the bureaucratic soul |
|
||||||
|
|
||||||
|
### Key Quotes Used
|
||||||
|
|
||||||
|
> *"I was told there would be document extraction."*
|
||||||
|
|
||||||
|
Adaptation of Milton's "I was told there would be cake" — except here, there actually is document extraction.
|
||||||
|
|
||||||
|
> *"Did you get the memo about the TPS reports?"*
|
||||||
|
|
||||||
|
Our test section is literally called TPS Reports. Because testing legacy formats is painful, and we test it so you don't have to.
|
||||||
|
|
||||||
|
> *"I believe you have my stapler..."*
|
||||||
|
|
||||||
|
The persistence of caring about the small things. Like proper Unicode handling in a `.doc` from 1997.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Author
|
||||||
|
|
||||||
|
<Card title="Ryan Malloy" icon="external">
|
||||||
|
Built by Ryan Malloy at [Supported Systems](https://supported.systems).
|
||||||
|
<br/><br/>
|
||||||
|
- [ryanmalloy.com](https://ryanmalloy.com) — Personal site
|
||||||
|
- [GitHub](https://github.com/ryanmalloy) — Open source work
|
||||||
|
<br/><br/>
|
||||||
|
<small>
|
||||||
|
Read: [AI Discernment vs AI Criticism](https://ryanmalloy.com/collaborations/ai-discernment-vs-criticism/) —
|
||||||
|
On building with AI while maintaining craft standards.
|
||||||
|
</small>
|
||||||
|
</Card>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Special Thanks
|
||||||
|
|
||||||
|
### The Python Community
|
||||||
|
|
||||||
|
For building and maintaining the libraries that make Office document processing possible. Every commit to `python-docx` or `openpyxl` makes legacy document handling slightly less painful.
|
||||||
|
|
||||||
|
### The MCP Ecosystem
|
||||||
|
|
||||||
|
For creating a protocol that lets AI agents access tools without reinventing the wheel. mcwaddams exists because FastMCP made it trivial to build.
|
||||||
|
|
||||||
|
### Mike Judge
|
||||||
|
|
||||||
|
For creating Office Space and giving the tech industry a shared vocabulary for describing workplace dysfunction. The red stapler is a symbol of what happens when you ignore the people doing the actual work.
|
||||||
|
|
||||||
|
### Swingline
|
||||||
|
|
||||||
|
For actually manufacturing a red stapler after the movie came out. Sometimes life imitates art. (Fun fact: the prop department painted a black stapler red because Swingline didn't make that color in 1999.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
mcwaddams is released under the MIT License.
|
||||||
|
|
||||||
|
```
|
||||||
|
MIT License
|
||||||
|
|
||||||
|
Copyright (c) 2024 Ryan Malloy
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 3rem;">
|
||||||
|
|
||||||
|
🔴
|
||||||
|
|
||||||
|
*I believe you have my documentation...*
|
||||||
|
|
||||||
|
<small>Unlike Milton's stapler, this one's MIT licensed. Take it.</small>
|
||||||
|
|
||||||
|
</div>
|
||||||
44
src/content/docs/community/feedback.mdx
Normal file
44
src/content/docs/community/feedback.mdx
Normal file
@ -0,0 +1,44 @@
|
|||||||
|
---
|
||||||
|
title: Feedback
|
||||||
|
description: Found a bug? Have an idea? We want to hear it.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside, Card, CardGrid, LinkCard } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Feedback
|
||||||
|
|
||||||
|
> *"Excuse me, I believe you have my bug report..."*
|
||||||
|
|
||||||
|
We want to hear from you — bugs, feature requests, or just letting us know what works.
|
||||||
|
|
||||||
|
## Report Issues
|
||||||
|
|
||||||
|
<LinkCard
|
||||||
|
title="GitHub Issues"
|
||||||
|
description="Report bugs or request features on our GitHub repository."
|
||||||
|
href="https://github.com/ryanmalloy/mcwaddams/issues"
|
||||||
|
/>
|
||||||
|
|
||||||
|
## What Makes a Good Bug Report
|
||||||
|
|
||||||
|
1. **Document format** — What type of file was it?
|
||||||
|
2. **What you tried** — The exact tool/prompt used
|
||||||
|
3. **What happened** — The actual result
|
||||||
|
4. **What you expected** — The desired result
|
||||||
|
5. **Sample file** — If possible (anonymized)
|
||||||
|
|
||||||
|
<Aside type="tip" title="Sensitive Documents">
|
||||||
|
If your document contains sensitive data, describe the structure without including the actual content. Or create a minimal reproduction file.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## Feature Requests
|
||||||
|
|
||||||
|
Have an idea for a new tool or improvement? Open an issue with the `enhancement` label.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
*"I could set the building on fire..."*
|
||||||
|
<br/>
|
||||||
|
<small>Please don't. Just open an issue.</small>
|
||||||
|
</div>
|
||||||
41
src/content/docs/community/leaderboard.mdx
Normal file
41
src/content/docs/community/leaderboard.mdx
Normal file
@ -0,0 +1,41 @@
|
|||||||
|
---
|
||||||
|
title: Flair Leaderboard
|
||||||
|
description: Who's got the most pieces of documentation flair?
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Flair Leaderboard
|
||||||
|
|
||||||
|
> *"You know, the Nazis had pieces of flair they made the Jews wear."*
|
||||||
|
> *"...okay, we're going to need to talk about your TPS reports."*
|
||||||
|
|
||||||
|
## Collect Flair
|
||||||
|
|
||||||
|
Spend time on documentation pages to earn pieces of flair. Each page has its own badge:
|
||||||
|
|
||||||
|
| Page | Flair |
|
||||||
|
|------|-------|
|
||||||
|
| Home | 📄 "I Was Told There Would Be Extraction" |
|
||||||
|
| Backstory | 🔴 "Basement Dweller" |
|
||||||
|
| Installation | 🖨️ "PC Load Letter" |
|
||||||
|
| Quick Start | ☕ "Case of the Mondays" |
|
||||||
|
| Tools Reference | 📋 "TPS Report Expert" |
|
||||||
|
| Architecture | 👔 "The Bobs Approved" |
|
||||||
|
| Test Dashboard | 📝 "Did You Get The Memo?" |
|
||||||
|
| Torture Tests | 😮 "O Face" |
|
||||||
|
| Credits | 🔴 "I Have Your Stapler" |
|
||||||
|
|
||||||
|
<Aside type="caution" title="Coming Soon">
|
||||||
|
Full gamification system with localStorage persistence and optional leaderboard API integration.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## The Goal
|
||||||
|
|
||||||
|
Collect all 12 pieces of flair. Brian from accounting has 37, but we only ask for the minimum.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
*"We need to talk about your flair..."*
|
||||||
|
</div>
|
||||||
339
src/content/docs/explanation/architecture.mdx
Normal file
339
src/content/docs/explanation/architecture.mdx
Normal file
@ -0,0 +1,339 @@
|
|||||||
|
---
|
||||||
|
title: Architecture
|
||||||
|
description: How mcwaddams processes Office documents behind the scenes.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Architecture
|
||||||
|
|
||||||
|
> *"So I was sitting in my cubicle today, and I realized... every document format is worse than the one before it."*
|
||||||
|
|
||||||
|
mcwaddams is designed around a single principle: **never silently fail**. When extraction works, you get content. When it doesn't, you get a clear explanation why.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## High-Level Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ MCP Client │
|
||||||
|
│ (Claude Code, Claude Desktop) │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ FastMCP Server │
|
||||||
|
│ │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||||
|
│ │ Universal │ │ Word │ │ Excel │ │
|
||||||
|
│ │ Mixin │ │ Mixin │ │ Mixin │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||||
|
│ │ │ │ │
|
||||||
|
│ └────────────────┼────────────────┘ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Format Detection & Routing │ │
|
||||||
|
│ └─────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ┌────────────────┼────────────────┐ │
|
||||||
|
│ ▼ ▼ ▼ │
|
||||||
|
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
|
||||||
|
│ │ python- │ │ openpyxl │ │ python- │ │
|
||||||
|
│ │ docx │ │ pandas │ │ pptx │ │
|
||||||
|
│ └───────────┘ └───────────┘ └───────────┘ │
|
||||||
|
│ │ │ │ │
|
||||||
|
│ └────────────────┼────────────────┘ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Fallback Handlers │ │
|
||||||
|
│ │ mammoth · olefile · xlrd · legacy parsers │ │
|
||||||
|
│ └─────────────────────────────────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Mixin Pattern
|
||||||
|
|
||||||
|
mcwaddams uses **mixins** to organize tools by document type:
|
||||||
|
|
||||||
|
### Why Mixins?
|
||||||
|
|
||||||
|
```python
|
||||||
|
class MCWaddamsServer(
|
||||||
|
UniversalMixin, # extract_text, extract_images, etc.
|
||||||
|
WordMixin, # convert_to_markdown, extract_tables, etc.
|
||||||
|
ExcelMixin, # analyze_data, extract_formulas, etc.
|
||||||
|
):
|
||||||
|
"""The main server combines all capabilities."""
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
1. **Separation of concerns** — Word tools don't clutter Excel code
|
||||||
|
2. **Independent testing** — Each mixin can be tested in isolation
|
||||||
|
3. **Easy extension** — Add PowerPoint tools without touching existing code
|
||||||
|
4. **Clear ownership** — Easy to find where a tool is defined
|
||||||
|
|
||||||
|
<Aside type="note" title="Why Not Separate Servers?">
|
||||||
|
One server with mixins beats multiple MCP servers because:
|
||||||
|
- Single point of configuration
|
||||||
|
- Shared utilities (validation, caching)
|
||||||
|
- Cross-format operations (compare Word to Excel)
|
||||||
|
- Simpler user experience
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Format Detection
|
||||||
|
|
||||||
|
Every request starts with format detection:
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def detect_format(file_path: str) -> FormatInfo:
|
||||||
|
# 1. Check file extension
|
||||||
|
ext = Path(file_path).suffix.lower()
|
||||||
|
|
||||||
|
# 2. Validate file exists and is readable
|
||||||
|
validate_file(file_path)
|
||||||
|
|
||||||
|
# 3. Detect actual format (extension might lie)
|
||||||
|
with open(file_path, 'rb') as f:
|
||||||
|
magic_bytes = f.read(8)
|
||||||
|
|
||||||
|
# 4. Check for encryption
|
||||||
|
is_encrypted = check_encryption(file_path)
|
||||||
|
|
||||||
|
return FormatInfo(
|
||||||
|
extension=ext,
|
||||||
|
actual_format=detect_from_magic(magic_bytes),
|
||||||
|
is_legacy=(ext in ['.doc', '.xls', '.ppt']),
|
||||||
|
is_encrypted=is_encrypted
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Magic Bytes
|
||||||
|
|
||||||
|
We don't trust file extensions alone:
|
||||||
|
|
||||||
|
| Magic Bytes | Format |
|
||||||
|
|-------------|--------|
|
||||||
|
| `50 4B 03 04` | ZIP-based (OOXML: docx/xlsx/pptx) |
|
||||||
|
| `D0 CF 11 E0` | OLE Compound (doc/xls/ppt) |
|
||||||
|
| `EF BB BF` or starts with text | CSV/plain text |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Fallback Strategy
|
||||||
|
|
||||||
|
Primary methods are tried first, then fallbacks:
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def extract_text(file_path: str, method: str = "auto"):
|
||||||
|
if method == "auto":
|
||||||
|
# Try primary method
|
||||||
|
try:
|
||||||
|
return await primary_extraction(file_path)
|
||||||
|
except ExtractionError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Try fallback
|
||||||
|
try:
|
||||||
|
return await fallback_extraction(file_path)
|
||||||
|
except ExtractionError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# All methods failed — return helpful error
|
||||||
|
return {
|
||||||
|
"error": "Extraction failed",
|
||||||
|
"tried": ["python-docx", "mammoth"],
|
||||||
|
"hint": "File may be corrupted. Try analyze_document_health."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fallback Chain
|
||||||
|
|
||||||
|
| Format | Primary | Fallback 1 | Fallback 2 |
|
||||||
|
|--------|---------|------------|------------|
|
||||||
|
| `.docx` | python-docx | mammoth | — |
|
||||||
|
| `.doc` | olefile | textract* | raw OLE |
|
||||||
|
| `.xlsx` | openpyxl | pandas | — |
|
||||||
|
| `.xls` | xlrd | pandas | olefile |
|
||||||
|
| `.pptx` | python-pptx | — | — |
|
||||||
|
| `.csv` | pandas | built-in csv | — |
|
||||||
|
|
||||||
|
<small>*textract only if installed</small>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resource System
|
||||||
|
|
||||||
|
For large documents, the resource system enables on-demand fetching:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ index_document() │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ Resource Store │
|
||||||
|
│ │
|
||||||
|
│ doc_id: "abc123def456" │
|
||||||
|
│ │
|
||||||
|
│ chapters: [ │
|
||||||
|
│ {id: 1, uri: "chapter://abc123/1", content: "..."}, │
|
||||||
|
│ {id: 2, uri: "chapter://abc123/2", content: "..."}, │
|
||||||
|
│ ] │
|
||||||
|
│ │
|
||||||
|
│ images: [ │
|
||||||
|
│ {id: 0, uri: "image://abc123/0", data: <bytes>}, │
|
||||||
|
│ ] │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ MCP Resource Protocol │
|
||||||
|
│ │
|
||||||
|
│ Client requests: chapter://abc123/1.md │
|
||||||
|
│ Server returns: Markdown-formatted Chapter 1 │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### URI Schemes
|
||||||
|
|
||||||
|
- `chapter://doc_id/N` — Single chapter
|
||||||
|
- `chapter://doc_id/N.md` — Chapter as Markdown
|
||||||
|
- `chapter://doc_id/N.txt` — Chapter as plain text
|
||||||
|
- `chapter://doc_id/N.html` — Chapter as HTML
|
||||||
|
- `chapters://doc_id/N-M` — Range of chapters
|
||||||
|
- `image://doc_id/N` — Single image
|
||||||
|
- `sheet://doc_id/name` — Excel sheet
|
||||||
|
- `slide://doc_id/N` — PowerPoint slide
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## URL Processing & Caching
|
||||||
|
|
||||||
|
Documents from URLs are downloaded and cached:
|
||||||
|
|
||||||
|
```python
|
||||||
|
CACHE_DURATION = 3600 # 1 hour
|
||||||
|
|
||||||
|
async def resolve_file_path(path_or_url: str) -> str:
|
||||||
|
if not path_or_url.startswith(('http://', 'https://')):
|
||||||
|
return path_or_url # Local file
|
||||||
|
|
||||||
|
# Check cache
|
||||||
|
cache_key = hash_url(path_or_url)
|
||||||
|
cached = get_from_cache(cache_key)
|
||||||
|
if cached and not expired(cached):
|
||||||
|
return cached.path
|
||||||
|
|
||||||
|
# Download
|
||||||
|
async with aiohttp.ClientSession() as session:
|
||||||
|
async with session.get(path_or_url) as response:
|
||||||
|
content = await response.read()
|
||||||
|
|
||||||
|
# Save to temp and cache
|
||||||
|
temp_path = save_to_temp(content, extension_from_url(path_or_url))
|
||||||
|
add_to_cache(cache_key, temp_path)
|
||||||
|
|
||||||
|
return temp_path
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Error Handling Philosophy
|
||||||
|
|
||||||
|
Every error must be:
|
||||||
|
1. **Clear** — What went wrong
|
||||||
|
2. **Actionable** — What can be done about it
|
||||||
|
3. **Helpful** — Suggestions for next steps
|
||||||
|
|
||||||
|
```python
|
||||||
|
class ExtractionError(Exception):
|
||||||
|
def __init__(self, message: str, hint: str = None, tried: list = None):
|
||||||
|
self.message = message
|
||||||
|
self.hint = hint or "Check file integrity with analyze_document_health"
|
||||||
|
self.tried = tried or []
|
||||||
|
|
||||||
|
def to_dict(self):
|
||||||
|
return {
|
||||||
|
"error": self.message,
|
||||||
|
"hint": self.hint,
|
||||||
|
"tried_methods": self.tried
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Considerations
|
||||||
|
|
||||||
|
### Token Limits
|
||||||
|
|
||||||
|
Large documents are paginated at ~25,000 tokens:
|
||||||
|
|
||||||
|
```python
|
||||||
|
MAX_TOKENS = 25000
|
||||||
|
TOKENS_PER_CHAR = 0.25 # Rough estimate
|
||||||
|
|
||||||
|
def should_paginate(content: str) -> bool:
|
||||||
|
estimated_tokens = len(content) * TOKENS_PER_CHAR
|
||||||
|
return estimated_tokens > MAX_TOKENS
|
||||||
|
```
|
||||||
|
|
||||||
|
### Lazy Loading
|
||||||
|
|
||||||
|
Resources aren't loaded until requested:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# index_document only creates references
|
||||||
|
index = await index_document("huge-book.docx")
|
||||||
|
# Returns immediately with URIs
|
||||||
|
|
||||||
|
# Content loaded only when requested
|
||||||
|
chapter_1 = await mcp_resources.read("chapter://abc123/1")
|
||||||
|
# Now the content is loaded
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Extending mcwaddams
|
||||||
|
|
||||||
|
Want to add new functionality?
|
||||||
|
|
||||||
|
### Add a New Tool
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In the appropriate mixin
|
||||||
|
@mcp.tool()
|
||||||
|
async def my_new_tool(file_path: str) -> dict:
|
||||||
|
"""Description for MCP discovery."""
|
||||||
|
path = await resolve_file_path(file_path)
|
||||||
|
validate_file(path)
|
||||||
|
|
||||||
|
# Your logic here
|
||||||
|
|
||||||
|
return {"result": "..."}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Add Format Support
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In format detection
|
||||||
|
FORMAT_HANDLERS = {
|
||||||
|
'.docx': DocxHandler,
|
||||||
|
'.doc': DocHandler,
|
||||||
|
'.rtf': RtfHandler, # Add new format
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
"What would you say... you do here?"
|
||||||
|
<br/>
|
||||||
|
<small>We extract documents. It's pretty straightforward.</small>
|
||||||
|
</div>
|
||||||
31
src/content/docs/explanation/fallbacks.mdx
Normal file
31
src/content/docs/explanation/fallbacks.mdx
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
---
|
||||||
|
title: Fallback Strategy
|
||||||
|
description: How mcwaddams tries multiple methods to extract your documents.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Fallback Strategy
|
||||||
|
|
||||||
|
> *"We fixed the glitch."*
|
||||||
|
|
||||||
|
When the primary extraction method fails, mcwaddams automatically tries alternatives.
|
||||||
|
|
||||||
|
## Fallback Chain
|
||||||
|
|
||||||
|
| Format | Primary | Fallback |
|
||||||
|
|--------|---------|----------|
|
||||||
|
| `.docx` | python-docx | mammoth |
|
||||||
|
| `.doc` | olefile | textract |
|
||||||
|
| `.xlsx` | openpyxl | pandas |
|
||||||
|
| `.xls` | xlrd | pandas |
|
||||||
|
|
||||||
|
## The Philosophy
|
||||||
|
|
||||||
|
1. **Try the best method first** — Usually works
|
||||||
|
2. **Fall back gracefully** — Alternative parsers step in
|
||||||
|
3. **Never silently fail** — Always return content or a clear error
|
||||||
|
|
||||||
|
<Aside type="note">
|
||||||
|
See [Architecture](/explanation/architecture/) for implementation details.
|
||||||
|
</Aside>
|
||||||
35
src/content/docs/explanation/mixins.mdx
Normal file
35
src/content/docs/explanation/mixins.mdx
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
---
|
||||||
|
title: Why Mixins?
|
||||||
|
description: The architectural decision behind mcwaddams's modular structure.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Why Mixins?
|
||||||
|
|
||||||
|
> *"What would you say... you do here?"*
|
||||||
|
|
||||||
|
mcwaddams uses Python mixins to organize 20 tools into logical groups without creating multiple MCP servers.
|
||||||
|
|
||||||
|
## The Pattern
|
||||||
|
|
||||||
|
```python
|
||||||
|
class MCWaddamsServer(
|
||||||
|
UniversalMixin, # 7 cross-format tools
|
||||||
|
WordMixin, # 10 Word-specific tools
|
||||||
|
ExcelMixin, # 3 Excel-specific tools
|
||||||
|
):
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
|
||||||
|
1. **Single server** — One MCP configuration, not three
|
||||||
|
2. **Shared utilities** — Validation, caching, format detection
|
||||||
|
3. **Clear boundaries** — Each mixin owns its tools
|
||||||
|
4. **Easy testing** — Test mixins in isolation
|
||||||
|
5. **Simple extension** — Add PowerPoint mixin without touching Word code
|
||||||
|
|
||||||
|
<Aside type="note">
|
||||||
|
See [Architecture](/explanation/architecture/) for the full design.
|
||||||
|
</Aside>
|
||||||
35
src/content/docs/explanation/resource-system.mdx
Normal file
35
src/content/docs/explanation/resource-system.mdx
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
---
|
||||||
|
title: Resource System
|
||||||
|
description: How MCP resources enable on-demand document access.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Resource System
|
||||||
|
|
||||||
|
> *"I was told I could fetch chapters at a reasonable rate..."*
|
||||||
|
|
||||||
|
The MCP resource system lets you access parts of indexed documents without reprocessing.
|
||||||
|
|
||||||
|
## How It Works
|
||||||
|
|
||||||
|
```
|
||||||
|
index_document("huge-book.docx")
|
||||||
|
→ Creates resource references
|
||||||
|
→ Stores content in memory
|
||||||
|
→ Returns URIs for on-demand access
|
||||||
|
|
||||||
|
mcp_resources.read("chapter://abc123/1")
|
||||||
|
→ Fetches only Chapter 1
|
||||||
|
→ No re-parsing the document
|
||||||
|
```
|
||||||
|
|
||||||
|
## Why Resources?
|
||||||
|
|
||||||
|
- **Context window management** — Don't load 500 pages at once
|
||||||
|
- **Selective access** — Fetch only what you need
|
||||||
|
- **Format conversion** — `.md`, `.txt`, `.html` on the fly
|
||||||
|
|
||||||
|
<Aside type="note">
|
||||||
|
See [MCP Resources Reference](/reference/resources/) for URI patterns.
|
||||||
|
</Aside>
|
||||||
31
src/content/docs/how-to/analyze-excel.mdx
Normal file
31
src/content/docs/how-to/analyze-excel.mdx
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
---
|
||||||
|
title: Analyze Excel Data
|
||||||
|
description: Get statistical analysis and data quality insights from spreadsheets.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Analyze Excel Data
|
||||||
|
|
||||||
|
> *"I did nothing and it was everything I thought it could be."*
|
||||||
|
|
||||||
|
Get comprehensive insights from Excel spreadsheets including statistics, data types, and quality checks.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await analyze_excel_data(
|
||||||
|
file_path="sales-data.xlsx",
|
||||||
|
include_statistics=True,
|
||||||
|
check_data_quality=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Returns:**
|
||||||
|
- Column data types
|
||||||
|
- Statistical summaries (mean, median, std dev)
|
||||||
|
- Missing value detection
|
||||||
|
- Duplicate identification
|
||||||
|
- Outlier flagging
|
||||||
|
|
||||||
|
<Aside type="caution" title="Coming Soon">
|
||||||
|
Detailed analysis guide in progress.
|
||||||
|
</Aside>
|
||||||
26
src/content/docs/how-to/convert-markdown.mdx
Normal file
26
src/content/docs/how-to/convert-markdown.mdx
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
---
|
||||||
|
title: Convert to Markdown
|
||||||
|
description: Transform Word documents into clean Markdown.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Convert to Markdown
|
||||||
|
|
||||||
|
> *"Why should I change? The document format's the one who sucks."*
|
||||||
|
|
||||||
|
Convert Word documents to Markdown while preserving structure, headings, lists, and tables.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await convert_to_markdown(
|
||||||
|
file_path="report.docx",
|
||||||
|
preserve_structure=True,
|
||||||
|
include_images=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Images are extracted to files and linked in the Markdown output.
|
||||||
|
|
||||||
|
<Aside type="caution" title="Coming Soon">
|
||||||
|
Detailed conversion guide in progress.
|
||||||
|
</Aside>
|
||||||
256
src/content/docs/how-to/extract-tables.mdx
Normal file
256
src/content/docs/how-to/extract-tables.mdx
Normal file
@ -0,0 +1,256 @@
|
|||||||
|
---
|
||||||
|
title: Extract Tables from Word Documents
|
||||||
|
description: Get structured table data from Word documents in multiple formats.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside, Steps, Tabs, TabItem } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Extract Tables from Word Documents
|
||||||
|
|
||||||
|
> *"We need to talk about your table extraction..."*
|
||||||
|
|
||||||
|
Word documents often contain important data locked in tables. mcwaddams extracts them as structured data you can actually use.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Problem
|
||||||
|
|
||||||
|
You have a Word document with a table like this:
|
||||||
|
|
||||||
|
| Product | Q1 | Q2 | Q3 | Q4 |
|
||||||
|
|---------|-----|-----|-----|-----|
|
||||||
|
| Widget A | 1,234 | 1,456 | 1,678 | 1,890 |
|
||||||
|
| Widget B | 987 | 1,012 | 1,045 | 1,089 |
|
||||||
|
| Widget C | 456 | 489 | 512 | 534 |
|
||||||
|
|
||||||
|
You need that data in a format you can process, not trapped in a binary `.docx` file.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Solution
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract tables from quarterly-report.docx as markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
**Result:**
|
||||||
|
```markdown
|
||||||
|
| Product | Q1 | Q2 | Q3 | Q4 |
|
||||||
|
|---------|-----|-----|-----|-----|
|
||||||
|
| Widget A | 1,234 | 1,456 | 1,678 | 1,890 |
|
||||||
|
| Widget B | 987 | 1,012 | 1,045 | 1,089 |
|
||||||
|
| Widget C | 456 | 489 | 512 | 534 |
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output Formats
|
||||||
|
|
||||||
|
<Tabs>
|
||||||
|
<TabItem label="Markdown">
|
||||||
|
```
|
||||||
|
Extract tables from report.docx as markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
Perfect for documentation, README files, or pasting into other documents.
|
||||||
|
</TabItem>
|
||||||
|
|
||||||
|
<TabItem label="JSON">
|
||||||
|
```
|
||||||
|
Extract tables from report.docx as JSON
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns structured data:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tables": [
|
||||||
|
{
|
||||||
|
"index": 0,
|
||||||
|
"headers": ["Product", "Q1", "Q2", "Q3", "Q4"],
|
||||||
|
"rows": [
|
||||||
|
["Widget A", "1,234", "1,456", "1,678", "1,890"],
|
||||||
|
["Widget B", "987", "1,012", "1,045", "1,089"]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
</TabItem>
|
||||||
|
|
||||||
|
<TabItem label="CSV">
|
||||||
|
```
|
||||||
|
Extract tables from report.docx as CSV
|
||||||
|
```
|
||||||
|
|
||||||
|
Ready for spreadsheet import:
|
||||||
|
```
|
||||||
|
Product,Q1,Q2,Q3,Q4
|
||||||
|
Widget A,1234,1456,1678,1890
|
||||||
|
Widget B,987,1012,1045,1089
|
||||||
|
```
|
||||||
|
</TabItem>
|
||||||
|
|
||||||
|
<TabItem label="Structured">
|
||||||
|
```
|
||||||
|
Extract tables from report.docx
|
||||||
|
```
|
||||||
|
|
||||||
|
Full metadata with styling information:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tables": [
|
||||||
|
{
|
||||||
|
"index": 0,
|
||||||
|
"position": "after paragraph 12",
|
||||||
|
"headers": ["Product", "Q1", "Q2", "Q3", "Q4"],
|
||||||
|
"rows": [...],
|
||||||
|
"styling": {
|
||||||
|
"has_header_row": true,
|
||||||
|
"borders": "all",
|
||||||
|
"alignment": "left"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
</TabItem>
|
||||||
|
</Tabs>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Handling Complex Tables
|
||||||
|
|
||||||
|
### Merged Cells
|
||||||
|
|
||||||
|
Tables with merged cells are normalized:
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract tables, preserve merged cells
|
||||||
|
```
|
||||||
|
|
||||||
|
The tool maintains cell relationships while making the data usable.
|
||||||
|
|
||||||
|
### Nested Tables
|
||||||
|
|
||||||
|
Tables within tables? We handle those too:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tables": [
|
||||||
|
{
|
||||||
|
"index": 0,
|
||||||
|
"nested_tables": [
|
||||||
|
{
|
||||||
|
"cell": [1, 2],
|
||||||
|
"data": [...]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Large Tables
|
||||||
|
|
||||||
|
Tables spanning multiple pages extract completely — no truncation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Multiple Tables
|
||||||
|
|
||||||
|
Documents with several tables return all of them:
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract all tables from multi-section-report.docx
|
||||||
|
```
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tables": [
|
||||||
|
{"index": 0, "context": "Executive Summary", "rows": 5},
|
||||||
|
{"index": 1, "context": "Financial Data", "rows": 24},
|
||||||
|
{"index": 2, "context": "Appendix A", "rows": 100}
|
||||||
|
],
|
||||||
|
"total_tables": 3
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Specific Table Selection
|
||||||
|
|
||||||
|
Extract only the table you need:
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract the second table from report.docx
|
||||||
|
```
|
||||||
|
|
||||||
|
Or by context:
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract the table under "Financial Summary"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Issues
|
||||||
|
|
||||||
|
### Empty Cells
|
||||||
|
|
||||||
|
Empty cells are preserved as empty strings, not null:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"rows": [
|
||||||
|
["Widget A", "", "1,456", "1,678", "1,890"]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Formatting in Cells
|
||||||
|
|
||||||
|
Bold, italic, and other formatting is stripped for clean data. If you need styled output, use markdown format.
|
||||||
|
|
||||||
|
### Numeric Detection
|
||||||
|
|
||||||
|
Numbers remain as strings to preserve formatting (commas, currency symbols). Convert as needed:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import json
|
||||||
|
data = json.loads(result)
|
||||||
|
# Convert strings to numbers
|
||||||
|
for row in data["tables"][0]["rows"]:
|
||||||
|
row[1] = int(row[1].replace(",", ""))
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Legacy Documents
|
||||||
|
|
||||||
|
Tables in `.doc` files work the same way:
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract tables from old-report.doc as JSON
|
||||||
|
```
|
||||||
|
|
||||||
|
The OLE parser handles legacy table structures automatically.
|
||||||
|
|
||||||
|
<Aside type="note">
|
||||||
|
Legacy `.doc` tables may have slight formatting differences, but data extraction is reliable.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
- **[Analyze Excel Data](/how-to/analyze-excel/)** — Statistical analysis of spreadsheets
|
||||||
|
- **[Convert to Markdown](/how-to/convert-markdown/)** — Full document conversion
|
||||||
|
- **[All Tools Reference](/reference/tools/)** — Complete documentation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
"PC Load Letter? What does that mean?"
|
||||||
|
<br/>
|
||||||
|
<small>We don't know either, but at least tables make sense now.</small>
|
||||||
|
</div>
|
||||||
32
src/content/docs/how-to/pagination.mdx
Normal file
32
src/content/docs/how-to/pagination.mdx
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
---
|
||||||
|
title: Handle Pagination
|
||||||
|
description: Work with documents that exceed token limits.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Handle Pagination
|
||||||
|
|
||||||
|
> *"Yeah, I'm gonna need you to go ahead and come in on Saturday for page 2..."*
|
||||||
|
|
||||||
|
Documents over 25,000 tokens are automatically paginated. Use cursors to fetch subsequent pages.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"text": "Chapter 1...",
|
||||||
|
"pagination": {
|
||||||
|
"current_page": 1,
|
||||||
|
"total_pages": 5,
|
||||||
|
"cursor_id": "abc123"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Continue with:
|
||||||
|
```
|
||||||
|
Continue extraction with cursor abc123
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="caution" title="Coming Soon">
|
||||||
|
Detailed pagination guide in progress.
|
||||||
|
</Aside>
|
||||||
24
src/content/docs/how-to/url-processing.mdx
Normal file
24
src/content/docs/how-to/url-processing.mdx
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
---
|
||||||
|
title: Process URLs
|
||||||
|
description: Extract documents directly from HTTP/HTTPS URLs.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Process URLs
|
||||||
|
|
||||||
|
> *"We fixed the glitch... by caching the download."*
|
||||||
|
|
||||||
|
All tools accept HTTP/HTTPS URLs directly. Files are cached for 1 hour.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_text(
|
||||||
|
"https://example.com/quarterly-report.docx"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
The URL is downloaded, cached, and processed like a local file.
|
||||||
|
|
||||||
|
<Aside type="caution" title="Coming Soon">
|
||||||
|
Detailed URL processing guide in progress.
|
||||||
|
</Aside>
|
||||||
102
src/content/docs/index.mdx
Normal file
102
src/content/docs/index.mdx
Normal file
@ -0,0 +1,102 @@
|
|||||||
|
---
|
||||||
|
title: mcwaddams
|
||||||
|
description: MCP server for Microsoft Office document processing. Named for Milton Waddams, who was relocated to the basement with boxes of legacy documents.
|
||||||
|
template: splash
|
||||||
|
hero:
|
||||||
|
tagline: "I was told there would be document extraction."
|
||||||
|
image:
|
||||||
|
file: ../../assets/stapler.svg
|
||||||
|
actions:
|
||||||
|
- text: Get Started
|
||||||
|
link: /installation/
|
||||||
|
icon: right-arrow
|
||||||
|
variant: primary
|
||||||
|
- text: View on GitHub
|
||||||
|
link: https://github.com/ryanmalloy/mcwaddams
|
||||||
|
icon: external
|
||||||
|
variant: minimal
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Card, CardGrid, LinkCard } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
## The Situation
|
||||||
|
|
||||||
|
You have Office documents. Lots of them. Some are modern `.docx` files. Others are `.doc` files from 1997 that predate Unicode. Your AI agent needs to read them, but native PDF-to-text tools just stare blankly.
|
||||||
|
|
||||||
|
**mcwaddams** handles what nobody else wants to touch.
|
||||||
|
|
||||||
|
<CardGrid stagger>
|
||||||
|
<Card title="20 Tools" icon="document">
|
||||||
|
Extract text, tables, images, metadata. Analyze structure. Convert to Markdown. Index for on-demand fetching.
|
||||||
|
</Card>
|
||||||
|
<Card title="Legacy Support" icon="puzzle">
|
||||||
|
`.doc`, `.xls`, `.ppt` — the formats everyone forgot about until they need them. We didn't forget.
|
||||||
|
</Card>
|
||||||
|
<Card title="Smart Fallbacks" icon="setting">
|
||||||
|
When python-docx chokes, mammoth steps in. When openpyxl fails, pandas takes over. No silent failures.
|
||||||
|
</Card>
|
||||||
|
<Card title="MCP Resources" icon="open-book">
|
||||||
|
Index once, fetch chapters/sheets/slides on demand. Keep your context window manageable.
|
||||||
|
</Card>
|
||||||
|
</CardGrid>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Install
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# With uvx (recommended)
|
||||||
|
uvx mcwaddams
|
||||||
|
|
||||||
|
# Add to Claude Code
|
||||||
|
claude mcp add mcwaddams "uvx mcwaddams"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Documentation Structure
|
||||||
|
|
||||||
|
This documentation follows the [Diátaxis framework](https://diataxis.fr/):
|
||||||
|
|
||||||
|
<CardGrid>
|
||||||
|
<LinkCard
|
||||||
|
title="Tutorials"
|
||||||
|
description="Learn by doing. Step-by-step guides for common workflows."
|
||||||
|
href="/tutorials/first-extraction/"
|
||||||
|
/>
|
||||||
|
<LinkCard
|
||||||
|
title="How-To Guides"
|
||||||
|
description="Solve specific problems. Task-oriented recipes."
|
||||||
|
href="/how-to/extract-tables/"
|
||||||
|
/>
|
||||||
|
<LinkCard
|
||||||
|
title="Reference"
|
||||||
|
description="Look up tool parameters and return values."
|
||||||
|
href="/reference/tools/"
|
||||||
|
/>
|
||||||
|
<LinkCard
|
||||||
|
title="Explanation"
|
||||||
|
description="Understand the architecture and design decisions."
|
||||||
|
href="/explanation/architecture/"
|
||||||
|
/>
|
||||||
|
</CardGrid>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TPS Reports
|
||||||
|
|
||||||
|
> *"Did you get the memo about the TPS reports?"*
|
||||||
|
|
||||||
|
We take testing seriously. Check out our [Test Dashboard](/tps/dashboard/) for live results, or browse the [Torture Tests](/tps/torture/) where we threw 301 random Office documents at mcwaddams.
|
||||||
|
|
||||||
|
**Result:** 99.3% success rate. The 2 failures were empty/corrupt files.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 3rem; opacity: 0.6;">
|
||||||
|
<small>
|
||||||
|
Named for Milton Waddams, who was relocated to the basement with the legacy documents.
|
||||||
|
<br/>
|
||||||
|
<em>"I could set the building on fire..."</em>
|
||||||
|
</small>
|
||||||
|
</div>
|
||||||
140
src/content/docs/installation.mdx
Normal file
140
src/content/docs/installation.mdx
Normal file
@ -0,0 +1,140 @@
|
|||||||
|
---
|
||||||
|
title: Installation
|
||||||
|
description: Get mcwaddams up and running in under a minute.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Tabs, TabItem, Aside, Steps } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Installation
|
||||||
|
|
||||||
|
> *"PC Load Letter? What the f*** does that mean?"*
|
||||||
|
|
||||||
|
Don't worry. This is simpler than fixing the printer.
|
||||||
|
|
||||||
|
## Quick Install
|
||||||
|
|
||||||
|
<Tabs>
|
||||||
|
<TabItem label="uvx (Recommended)">
|
||||||
|
```bash
|
||||||
|
# Run directly without installing
|
||||||
|
uvx mcwaddams
|
||||||
|
```
|
||||||
|
|
||||||
|
This downloads and runs mcwaddams in an isolated environment. No global installs, no dependency conflicts.
|
||||||
|
</TabItem>
|
||||||
|
|
||||||
|
<TabItem label="pip">
|
||||||
|
```bash
|
||||||
|
pip install mcwaddams
|
||||||
|
```
|
||||||
|
</TabItem>
|
||||||
|
|
||||||
|
<TabItem label="uv">
|
||||||
|
```bash
|
||||||
|
uv add mcwaddams
|
||||||
|
```
|
||||||
|
</TabItem>
|
||||||
|
</Tabs>
|
||||||
|
|
||||||
|
## Configure Your MCP Client
|
||||||
|
|
||||||
|
### Claude Code
|
||||||
|
|
||||||
|
```bash
|
||||||
|
claude mcp add mcwaddams "uvx mcwaddams"
|
||||||
|
```
|
||||||
|
|
||||||
|
That's it. The server will be available in your next Claude Code session.
|
||||||
|
|
||||||
|
### Claude Desktop
|
||||||
|
|
||||||
|
Add to your `claude_desktop_config.json`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mcpServers": {
|
||||||
|
"mcwaddams": {
|
||||||
|
"command": "uvx",
|
||||||
|
"args": ["mcwaddams"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="tip" title="Config Location">
|
||||||
|
- **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
||||||
|
- **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
|
||||||
|
- **Linux**: `~/.config/claude/claude_desktop_config.json`
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
### Other MCP Clients
|
||||||
|
|
||||||
|
mcwaddams is a standard MCP server. Any client that speaks the [Model Context Protocol](https://modelcontextprotocol.io) can use it:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Generic stdio transport
|
||||||
|
uvx mcwaddams
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verify Installation
|
||||||
|
|
||||||
|
<Steps>
|
||||||
|
1. **Start your MCP client** (Claude Code, Claude Desktop, etc.)
|
||||||
|
|
||||||
|
2. **Check available tools**
|
||||||
|
|
||||||
|
Ask: *"What Office document tools do you have?"*
|
||||||
|
|
||||||
|
You should see 20 tools including `extract_text`, `convert_to_markdown`, `analyze_excel_data`, etc.
|
||||||
|
|
||||||
|
3. **Test extraction**
|
||||||
|
|
||||||
|
Point it at any Office document:
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text from /path/to/document.docx
|
||||||
|
```
|
||||||
|
</Steps>
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
mcwaddams bundles everything it needs:
|
||||||
|
|
||||||
|
| Library | Purpose |
|
||||||
|
|---------|---------|
|
||||||
|
| `python-docx` | Modern Word documents |
|
||||||
|
| `openpyxl` | Modern Excel spreadsheets |
|
||||||
|
| `python-pptx` | Modern PowerPoint |
|
||||||
|
| `mammoth` | Word to Markdown fallback |
|
||||||
|
| `pandas` | Excel fallback + CSV |
|
||||||
|
| `olefile` | Legacy OLE formats (.doc, .xls, .ppt) |
|
||||||
|
| `xlrd` | Legacy Excel support |
|
||||||
|
| `Pillow` | Image extraction |
|
||||||
|
|
||||||
|
<Aside type="note" title="No External Services">
|
||||||
|
mcwaddams runs entirely locally. No API keys, no cloud services, no telemetry. Your documents stay on your machine.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### "Module not found" errors
|
||||||
|
|
||||||
|
Make sure you're using Python 3.11 or higher:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python --version # Should be 3.11+
|
||||||
|
```
|
||||||
|
|
||||||
|
### MCP server not appearing
|
||||||
|
|
||||||
|
1. Restart your MCP client after adding the config
|
||||||
|
2. Check the config file syntax (valid JSON)
|
||||||
|
3. Verify `uvx` is in your PATH: `which uvx`
|
||||||
|
|
||||||
|
### Permission errors on Windows
|
||||||
|
|
||||||
|
Run your terminal as Administrator, or check that Python has access to the document locations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Next:** [Quick Start](/quickstart/) — Extract your first document
|
||||||
143
src/content/docs/quickstart.mdx
Normal file
143
src/content/docs/quickstart.mdx
Normal file
@ -0,0 +1,143 @@
|
|||||||
|
---
|
||||||
|
title: Quick Start
|
||||||
|
description: Extract your first document in 60 seconds.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside, Steps, Code } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Quick Start
|
||||||
|
|
||||||
|
> *"I'll be honest with you, I love his music. I do. I'm a Michael Bolton fan."*
|
||||||
|
|
||||||
|
Let's get you extracting documents faster than you can say "TPS report cover sheet."
|
||||||
|
|
||||||
|
## Your First Extraction
|
||||||
|
|
||||||
|
<Steps>
|
||||||
|
1. **Point at a document**
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text from /path/to/quarterly-report.docx
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Get the content**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"text": "Q4 2024 Financial Summary\n\nRevenue increased by 15%...",
|
||||||
|
"metadata": {
|
||||||
|
"format": "Word Document (DOCX)",
|
||||||
|
"extraction_method": "python-docx",
|
||||||
|
"extraction_time": 0.042
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **That's it.**
|
||||||
|
</Steps>
|
||||||
|
|
||||||
|
## Common Operations
|
||||||
|
|
||||||
|
### Extract Text (Any Format)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Works with .docx, .doc, .xlsx, .xls, .pptx, .ppt, .csv
|
||||||
|
result = await extract_text("document.docx")
|
||||||
|
print(result["text"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### Convert Word to Markdown
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await convert_to_markdown("report.docx")
|
||||||
|
print(result["markdown"])
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="tip" title="Large Documents">
|
||||||
|
Documents over 25k tokens get paginated automatically. Use `cursor_id` to fetch the next chunk, or `page_range` to grab specific pages.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
### Extract Tables
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_word_tables(
|
||||||
|
"contract.docx",
|
||||||
|
output_format="markdown"
|
||||||
|
)
|
||||||
|
# Returns tables as markdown tables
|
||||||
|
```
|
||||||
|
|
||||||
|
### Analyze Excel Data
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await analyze_excel_data(
|
||||||
|
"sales-data.xlsx",
|
||||||
|
include_statistics=True,
|
||||||
|
check_data_quality=True
|
||||||
|
)
|
||||||
|
# Returns column types, missing values, outliers, statistics
|
||||||
|
```
|
||||||
|
|
||||||
|
### Index for On-Demand Fetching
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Index once
|
||||||
|
result = await index_document("novel.docx")
|
||||||
|
# Returns: {"doc_id": "abc123", "resources": {...}}
|
||||||
|
|
||||||
|
# Fetch chapters on demand via MCP resources
|
||||||
|
# chapter://abc123/1 → Chapter 1
|
||||||
|
# chapter://abc123/1.txt → Plain text
|
||||||
|
# chapters://abc123/1-5 → Multiple chapters
|
||||||
|
```
|
||||||
|
|
||||||
|
## Working with URLs
|
||||||
|
|
||||||
|
mcwaddams can fetch documents directly from URLs:
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_text("https://example.com/report.docx")
|
||||||
|
```
|
||||||
|
|
||||||
|
Files are cached for 1 hour by default.
|
||||||
|
|
||||||
|
## Format Detection
|
||||||
|
|
||||||
|
Not sure what you're dealing with?
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await detect_office_format("mystery-file.doc")
|
||||||
|
# Returns: format, version, encryption status, document category
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
mcwaddams never silently fails. You'll get either:
|
||||||
|
|
||||||
|
1. **Content** — The extracted text/data
|
||||||
|
2. **Clear error** — Explaining exactly what went wrong
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_text("encrypted.docx")
|
||||||
|
# Returns: {"error": "Document is password-protected", "hint": "..."}
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="caution" title="Encrypted Files">
|
||||||
|
We detect encrypted/password-protected files but can't extract their content. You'll get a clear message explaining the situation.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
- **[Tutorials](/tutorials/first-extraction/)** — Deeper walkthrough of each tool
|
||||||
|
- **[Reference](/reference/tools/)** — All 20 tools with parameters
|
||||||
|
- **[TPS Reports](/tps/dashboard/)** — See our test results
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
"Looks like someone has a case of the Mondays."
|
||||||
|
<br/>
|
||||||
|
<small>Not anymore. Your documents are handled.</small>
|
||||||
|
</div>
|
||||||
25
src/content/docs/reference/excel-tools.mdx
Normal file
25
src/content/docs/reference/excel-tools.mdx
Normal file
@ -0,0 +1,25 @@
|
|||||||
|
---
|
||||||
|
title: Excel Tools
|
||||||
|
description: Specialized tools for Excel spreadsheet processing.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Excel Tools
|
||||||
|
|
||||||
|
These 3 tools are specialized for `.xlsx`, `.xls`, and `.csv` files.
|
||||||
|
|
||||||
|
## analyze_excel_data
|
||||||
|
Comprehensive statistical analysis including data types, missing values, and quality checks.
|
||||||
|
|
||||||
|
## extract_excel_formulas
|
||||||
|
Extract and analyze formulas with dependency mapping.
|
||||||
|
|
||||||
|
## create_excel_chart_data
|
||||||
|
Generate chart configurations for visualization libraries (Chart.js, Plotly, Matplotlib).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<Aside type="tip">
|
||||||
|
See [All Tools Reference](/reference/tools/) for complete parameter documentation.
|
||||||
|
</Aside>
|
||||||
46
src/content/docs/reference/formats.mdx
Normal file
46
src/content/docs/reference/formats.mdx
Normal file
@ -0,0 +1,46 @@
|
|||||||
|
---
|
||||||
|
title: Format Support
|
||||||
|
description: Complete list of supported Office document formats.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Format Support
|
||||||
|
|
||||||
|
mcwaddams supports all major Microsoft Office formats, both modern and legacy.
|
||||||
|
|
||||||
|
## Modern Formats (2007+)
|
||||||
|
|
||||||
|
| Extension | Type | Primary Library | Full Support |
|
||||||
|
|-----------|------|-----------------|--------------|
|
||||||
|
| `.docx` | Word | python-docx | ✅ |
|
||||||
|
| `.xlsx` | Excel | openpyxl | ✅ |
|
||||||
|
| `.pptx` | PowerPoint | python-pptx | ✅ |
|
||||||
|
|
||||||
|
## Legacy Formats (1997-2007)
|
||||||
|
|
||||||
|
| Extension | Type | Primary Library | Full Support |
|
||||||
|
|-----------|------|-----------------|--------------|
|
||||||
|
| `.doc` | Word | olefile | ✅ |
|
||||||
|
| `.xls` | Excel | xlrd | ✅ |
|
||||||
|
| `.ppt` | PowerPoint | olefile | ✅ |
|
||||||
|
|
||||||
|
## Data Formats
|
||||||
|
|
||||||
|
| Extension | Type | Primary Library | Full Support |
|
||||||
|
|-----------|------|-----------------|--------------|
|
||||||
|
| `.csv` | CSV | pandas | ✅ |
|
||||||
|
|
||||||
|
## Template Formats
|
||||||
|
|
||||||
|
Template files (`.dotx`, `.xltx`, `.potx`) are processed as their corresponding document types.
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
- **Encrypted files** — Detected but content cannot be extracted
|
||||||
|
- **Macro-enabled** — `.docm`, `.xlsm` supported (macros not executed)
|
||||||
|
- **Corrupted files** — Graceful failure with clear error messages
|
||||||
|
|
||||||
|
<Aside type="note">
|
||||||
|
Use `detect_office_format` to identify any file's type and characteristics.
|
||||||
|
</Aside>
|
||||||
40
src/content/docs/reference/resources.mdx
Normal file
40
src/content/docs/reference/resources.mdx
Normal file
@ -0,0 +1,40 @@
|
|||||||
|
---
|
||||||
|
title: MCP Resources
|
||||||
|
description: Resource URIs for on-demand document access.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# MCP Resources
|
||||||
|
|
||||||
|
After indexing a document, access content via MCP resource URIs.
|
||||||
|
|
||||||
|
## Resource URI Schemes
|
||||||
|
|
||||||
|
| Scheme | Pattern | Description |
|
||||||
|
|--------|---------|-------------|
|
||||||
|
| `chapter` | `chapter://doc_id/N` | Single chapter |
|
||||||
|
| `chapter` | `chapter://doc_id/N.md` | Chapter as Markdown |
|
||||||
|
| `chapter` | `chapter://doc_id/N.txt` | Chapter as plain text |
|
||||||
|
| `chapter` | `chapter://doc_id/N.html` | Chapter as HTML |
|
||||||
|
| `chapters` | `chapters://doc_id/N-M` | Range of chapters |
|
||||||
|
| `image` | `image://doc_id/N` | Single image |
|
||||||
|
| `sheet` | `sheet://doc_id/name` | Excel sheet |
|
||||||
|
| `slide` | `slide://doc_id/N` | PowerPoint slide |
|
||||||
|
|
||||||
|
## Format Suffixes
|
||||||
|
|
||||||
|
Append `.md`, `.txt`, or `.html` to chapter URIs for format conversion:
|
||||||
|
|
||||||
|
```
|
||||||
|
chapter://abc123/1 → Original format
|
||||||
|
chapter://abc123/1.md → Markdown
|
||||||
|
chapter://abc123/1.txt → Plain text
|
||||||
|
chapter://abc123/1.html → HTML
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<Aside type="tip">
|
||||||
|
Use `index_document` to create the resource index first.
|
||||||
|
</Aside>
|
||||||
457
src/content/docs/reference/tools.mdx
Normal file
457
src/content/docs/reference/tools.mdx
Normal file
@ -0,0 +1,457 @@
|
|||||||
|
---
|
||||||
|
title: All Tools Reference
|
||||||
|
description: Complete reference for all 20 mcwaddams MCP tools.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside, Badge, Tabs, TabItem, Card, CardGrid } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# All Tools Reference
|
||||||
|
|
||||||
|
> *"I'm going to need you to go ahead and read the documentation..."*
|
||||||
|
|
||||||
|
mcwaddams provides **20 tools** organized into three categories. Each tool follows the same pattern: pass a file path (local or URL), get structured data back.
|
||||||
|
|
||||||
|
<Aside type="tip" title="Quick Navigation">
|
||||||
|
Jump to: [Universal](#universal-tools) | [Word](#word-document-tools) | [Excel](#excel-spreadsheet-tools)
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Universal Tools
|
||||||
|
|
||||||
|
These tools work across all supported Office formats (`.docx`, `.doc`, `.xlsx`, `.xls`, `.pptx`, `.ppt`, `.csv`).
|
||||||
|
|
||||||
|
### extract_text
|
||||||
|
|
||||||
|
Extract text content from any Office document with automatic format detection.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_text(
|
||||||
|
file_path="/path/to/document.docx",
|
||||||
|
method="auto", # auto | primary | fallback
|
||||||
|
include_metadata=True, # Include document metadata
|
||||||
|
preserve_formatting=False # Preserve structure (slower)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Returns:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"text": "The extracted content...",
|
||||||
|
"metadata": {
|
||||||
|
"format": "Word Document (DOCX)",
|
||||||
|
"extraction_method": "python-docx",
|
||||||
|
"extraction_time": 0.042
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="note">
|
||||||
|
Large documents (>25k tokens) are automatically paginated. Use `cursor_id` to fetch subsequent pages.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### extract_images
|
||||||
|
|
||||||
|
Extract embedded images from Office documents with filtering options.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_images(
|
||||||
|
file_path="/path/to/report.docx",
|
||||||
|
output_format="png", # png | jpg | jpeg
|
||||||
|
min_width=100, # Minimum width in pixels
|
||||||
|
min_height=100, # Minimum height in pixels
|
||||||
|
include_context=True # Include surrounding text
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Returns:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"images": [
|
||||||
|
{
|
||||||
|
"index": 0,
|
||||||
|
"format": "png",
|
||||||
|
"dimensions": {"width": 800, "height": 600},
|
||||||
|
"context": "Figure 1: Sales performance...",
|
||||||
|
"data_uri": "data:image/png;base64,..."
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"total_found": 5,
|
||||||
|
"extracted": 3
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### extract_metadata
|
||||||
|
|
||||||
|
Get comprehensive document metadata including author, dates, and custom properties.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_metadata(
|
||||||
|
file_path="/path/to/contract.docx"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Returns:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"title": "Service Agreement",
|
||||||
|
"author": "Legal Team",
|
||||||
|
"created": "2024-01-15T10:30:00Z",
|
||||||
|
"modified": "2024-03-20T14:22:00Z",
|
||||||
|
"word_count": 4521,
|
||||||
|
"page_count": 12,
|
||||||
|
"custom_properties": {
|
||||||
|
"Client": "Acme Corp",
|
||||||
|
"Version": "3.0"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### detect_office_format
|
||||||
|
|
||||||
|
Identify document format, version, and encryption status.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await detect_office_format(
|
||||||
|
file_path="/path/to/mystery-file.doc"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Returns:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"format": "Word 97-2003 Document",
|
||||||
|
"extension": ".doc",
|
||||||
|
"mime_type": "application/msword",
|
||||||
|
"is_encrypted": false,
|
||||||
|
"is_legacy": true,
|
||||||
|
"ole_metadata": {
|
||||||
|
"created_by": "Microsoft Word 10.0"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="caution" title="Encrypted Files">
|
||||||
|
We detect encrypted files but cannot extract their content. You'll get a clear error with suggestions.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### analyze_document_health
|
||||||
|
|
||||||
|
Comprehensive integrity check with actionable recommendations.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await analyze_document_health(
|
||||||
|
file_path="/path/to/old-report.docx"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Returns:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"status": "healthy",
|
||||||
|
"issues": [],
|
||||||
|
"warnings": [
|
||||||
|
"Document contains 15 embedded fonts (may increase file size)"
|
||||||
|
],
|
||||||
|
"recommendations": [
|
||||||
|
"Consider running through a document optimizer"
|
||||||
|
],
|
||||||
|
"file_size": 2458624,
|
||||||
|
"structure_valid": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### index_document
|
||||||
|
|
||||||
|
Create an index for on-demand fetching via MCP resources.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await index_document(
|
||||||
|
file_path="/path/to/novel.docx",
|
||||||
|
include_images=True,
|
||||||
|
include_chapters=True,
|
||||||
|
include_sheets=True,
|
||||||
|
include_slides=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Returns:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"doc_id": "abc123def456",
|
||||||
|
"resources": {
|
||||||
|
"chapters": ["chapter://abc123def456/1", "chapter://abc123def456/2"],
|
||||||
|
"images": ["image://abc123def456/0", "image://abc123def456/1"]
|
||||||
|
},
|
||||||
|
"stats": {
|
||||||
|
"total_chapters": 12,
|
||||||
|
"total_images": 45,
|
||||||
|
"estimated_tokens": 125000
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="tip" title="On-Demand Fetching">
|
||||||
|
After indexing, fetch content through MCP resources without reprocessing the document:
|
||||||
|
- `chapter://abc123/1` → Chapter 1
|
||||||
|
- `chapter://abc123/1.md` → Chapter 1 as Markdown
|
||||||
|
- `chapters://abc123/1-5` → Chapters 1-5
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### get_supported_formats
|
||||||
|
|
||||||
|
List all supported file formats and their capabilities.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await get_supported_formats()
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Word Document Tools
|
||||||
|
|
||||||
|
Specialized tools for `.docx` and `.doc` files.
|
||||||
|
|
||||||
|
### convert_to_markdown
|
||||||
|
|
||||||
|
Convert Word documents to Markdown with intelligent formatting.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await convert_to_markdown(
|
||||||
|
file_path="/path/to/report.docx",
|
||||||
|
preserve_structure=True, # Keep headings, lists, tables
|
||||||
|
include_images=True, # Extract images to files
|
||||||
|
page_range="", # e.g., "1-5" or "3"
|
||||||
|
summary_only=False # Just metadata for large docs
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pagination:** Documents over 25k tokens automatically paginate. Use `cursor_id` for next pages.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### extract_word_tables
|
||||||
|
|
||||||
|
Extract tables with structure preservation.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_word_tables(
|
||||||
|
file_path="/path/to/contract.docx",
|
||||||
|
output_format="markdown", # structured | csv | json | markdown
|
||||||
|
include_headers=True,
|
||||||
|
preserve_merged_cells=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### analyze_word_structure
|
||||||
|
|
||||||
|
Analyze document structure, headings, and hierarchy.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await analyze_word_structure(
|
||||||
|
file_path="/path/to/thesis.docx",
|
||||||
|
extract_outline=True,
|
||||||
|
analyze_styles=True,
|
||||||
|
include_page_info=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### check_style_consistency
|
||||||
|
|
||||||
|
Detect formatting inconsistencies and style issues.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await check_style_consistency(
|
||||||
|
file_path="/path/to/manuscript.docx"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### get_document_outline
|
||||||
|
|
||||||
|
Get a clean heading hierarchy (Table of Contents view).
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await get_document_outline(
|
||||||
|
file_path="/path/to/book.docx",
|
||||||
|
include_word_counts=True,
|
||||||
|
detect_chapters=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### get_chapter_summaries
|
||||||
|
|
||||||
|
Extract opening sentences from each chapter.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await get_chapter_summaries(
|
||||||
|
file_path="/path/to/novel.docx",
|
||||||
|
sentences_per_chapter=3,
|
||||||
|
include_word_counts=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### search_document
|
||||||
|
|
||||||
|
Full-text search with context and location.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await search_document(
|
||||||
|
file_path="/path/to/legal.docx",
|
||||||
|
query="indemnification",
|
||||||
|
max_results=20,
|
||||||
|
context_chars=100
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### extract_entities
|
||||||
|
|
||||||
|
Extract named entities (people, places, organizations).
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_entities(
|
||||||
|
file_path="/path/to/novel.docx",
|
||||||
|
entity_types="all", # all | people | places | organizations
|
||||||
|
min_occurrences=1,
|
||||||
|
include_context=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### save_reading_progress / get_reading_progress
|
||||||
|
|
||||||
|
Bookmark your position in a document.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Save
|
||||||
|
await save_reading_progress(
|
||||||
|
file_path="/path/to/book.docx",
|
||||||
|
chapter_number=5,
|
||||||
|
paragraph_index=12,
|
||||||
|
notes="Left off at the climax"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Retrieve
|
||||||
|
progress = await get_reading_progress(
|
||||||
|
file_path="/path/to/book.docx"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Excel Spreadsheet Tools
|
||||||
|
|
||||||
|
Specialized tools for `.xlsx`, `.xls`, and `.csv` files.
|
||||||
|
|
||||||
|
### analyze_excel_data
|
||||||
|
|
||||||
|
Comprehensive statistical analysis.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await analyze_excel_data(
|
||||||
|
file_path="/path/to/sales.xlsx",
|
||||||
|
sheet_names=[], # Empty = all sheets
|
||||||
|
include_statistics=True, # Mean, median, std, etc.
|
||||||
|
detect_data_types=True,
|
||||||
|
check_data_quality=True # Missing values, duplicates
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### extract_excel_formulas
|
||||||
|
|
||||||
|
Extract and analyze formulas with dependencies.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_excel_formulas(
|
||||||
|
file_path="/path/to/budget.xlsx",
|
||||||
|
sheet_names=[],
|
||||||
|
include_values=True, # Show calculated values
|
||||||
|
analyze_dependencies=True # Formula reference chains
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### create_excel_chart_data
|
||||||
|
|
||||||
|
Generate chart configurations for visualization libraries.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await create_excel_chart_data(
|
||||||
|
file_path="/path/to/data.xlsx",
|
||||||
|
chart_type="auto", # auto | bar | line | pie | scatter
|
||||||
|
output_format="chartjs", # chartjs | plotly | matplotlib
|
||||||
|
x_column="", # Empty = auto-detect
|
||||||
|
y_columns=[] # Empty = auto-detect
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Working with URLs
|
||||||
|
|
||||||
|
All tools accept HTTP/HTTPS URLs. Files are downloaded and cached for 1 hour.
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await extract_text(
|
||||||
|
"https://example.com/quarterly-report.docx"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="note">
|
||||||
|
URL caching reduces repeated downloads. The cache is stored in your system's temp directory.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
All tools return structured errors:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"error": "Document is password-protected",
|
||||||
|
"hint": "Remove password protection or provide an unencrypted version",
|
||||||
|
"file_path": "/path/to/encrypted.docx"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Common error types:
|
||||||
|
- **File not found** — Check the path exists
|
||||||
|
- **Unsupported format** — Check format support with `get_supported_formats`
|
||||||
|
- **Password protected** — We detect but can't extract encrypted files
|
||||||
|
- **Corrupted file** — Try `analyze_document_health` for diagnostics
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
"Have you seen my documentation?"
|
||||||
|
<br/>
|
||||||
|
<small>— Milton, probably</small>
|
||||||
|
</div>
|
||||||
37
src/content/docs/reference/universal-tools.mdx
Normal file
37
src/content/docs/reference/universal-tools.mdx
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
---
|
||||||
|
title: Universal Tools
|
||||||
|
description: Tools that work across all Office document formats.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Universal Tools
|
||||||
|
|
||||||
|
These 7 tools work with all supported formats: `.docx`, `.doc`, `.xlsx`, `.xls`, `.pptx`, `.ppt`, `.csv`.
|
||||||
|
|
||||||
|
## extract_text
|
||||||
|
Extract text content with automatic format detection and intelligent fallbacks.
|
||||||
|
|
||||||
|
## extract_images
|
||||||
|
Extract embedded images with size filtering and format conversion.
|
||||||
|
|
||||||
|
## extract_metadata
|
||||||
|
Get comprehensive document metadata including author, dates, and custom properties.
|
||||||
|
|
||||||
|
## detect_office_format
|
||||||
|
Identify format, version, encryption status, and file characteristics.
|
||||||
|
|
||||||
|
## analyze_document_health
|
||||||
|
Comprehensive integrity check with actionable recommendations.
|
||||||
|
|
||||||
|
## index_document
|
||||||
|
Create resource index for on-demand fetching via MCP resources.
|
||||||
|
|
||||||
|
## get_supported_formats
|
||||||
|
List all supported file formats and their capabilities.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<Aside type="tip">
|
||||||
|
See [All Tools Reference](/reference/tools/) for complete parameter documentation.
|
||||||
|
</Aside>
|
||||||
46
src/content/docs/reference/word-tools.mdx
Normal file
46
src/content/docs/reference/word-tools.mdx
Normal file
@ -0,0 +1,46 @@
|
|||||||
|
---
|
||||||
|
title: Word Tools
|
||||||
|
description: Specialized tools for Word document processing.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Word Tools
|
||||||
|
|
||||||
|
These 10 tools are specialized for `.docx` and `.doc` files.
|
||||||
|
|
||||||
|
## convert_to_markdown
|
||||||
|
Convert Word documents to Markdown with structure preservation.
|
||||||
|
|
||||||
|
## extract_word_tables
|
||||||
|
Extract tables as structured data (JSON, CSV, Markdown).
|
||||||
|
|
||||||
|
## analyze_word_structure
|
||||||
|
Analyze document structure, headings, and hierarchy.
|
||||||
|
|
||||||
|
## check_style_consistency
|
||||||
|
Detect formatting inconsistencies and style issues.
|
||||||
|
|
||||||
|
## get_document_outline
|
||||||
|
Get heading hierarchy (Table of Contents view).
|
||||||
|
|
||||||
|
## get_chapter_summaries
|
||||||
|
Extract opening sentences from each chapter.
|
||||||
|
|
||||||
|
## search_document
|
||||||
|
Full-text search with context and location.
|
||||||
|
|
||||||
|
## extract_entities
|
||||||
|
Extract named entities (people, places, organizations).
|
||||||
|
|
||||||
|
## save_reading_progress
|
||||||
|
Bookmark your position in a document.
|
||||||
|
|
||||||
|
## get_reading_progress
|
||||||
|
Retrieve saved reading position.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<Aside type="tip">
|
||||||
|
See [All Tools Reference](/reference/tools/) for complete parameter documentation.
|
||||||
|
</Aside>
|
||||||
36
src/content/docs/tps/coverage.mdx
Normal file
36
src/content/docs/tps/coverage.mdx
Normal file
@ -0,0 +1,36 @@
|
|||||||
|
---
|
||||||
|
title: Coverage Report
|
||||||
|
description: Test coverage philosophy and metrics.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Coverage Report
|
||||||
|
|
||||||
|
> *"Looks like you've been missing a lot of work lately."*
|
||||||
|
> *"I wouldn't say I've been missing it, Bob."*
|
||||||
|
|
||||||
|
Our coverage is **20%** — and that's intentional.
|
||||||
|
|
||||||
|
## What We Cover
|
||||||
|
|
||||||
|
- ✅ Critical extraction paths
|
||||||
|
- ✅ Format detection logic
|
||||||
|
- ✅ Error handling boundaries
|
||||||
|
- ✅ Fallback chains
|
||||||
|
- ✅ Edge cases that actually break things
|
||||||
|
|
||||||
|
## What We Don't
|
||||||
|
|
||||||
|
- ❌ Boilerplate configuration
|
||||||
|
- ❌ Third-party library internals
|
||||||
|
- ❌ Formatting and display code
|
||||||
|
- ❌ Happy-path-only scenarios
|
||||||
|
|
||||||
|
<Aside type="note" title="Coverage ≠ Quality">
|
||||||
|
100% coverage means nothing if you're testing the wrong things. We focus on the paths that matter — the ones that process your actual documents.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## The Real Test
|
||||||
|
|
||||||
|
See [Torture Tests](/tps/torture/) — 301 real documents, 99.3% success rate.
|
||||||
174
src/content/docs/tps/dashboard.mdx
Normal file
174
src/content/docs/tps/dashboard.mdx
Normal file
@ -0,0 +1,174 @@
|
|||||||
|
---
|
||||||
|
title: Test Dashboard
|
||||||
|
description: "Did you get the memo about the TPS reports?"
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside, Badge, Card, CardGrid } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# TPS Reports: Test Dashboard
|
||||||
|
|
||||||
|
> *"Yeah, I'm gonna need you to go ahead and come in on Saturday..."*
|
||||||
|
|
||||||
|
<Aside type="danger" title="TPS = Testing Painful Stuff">
|
||||||
|
Because extracting text from a `.doc` file from 1997 *is* painful. We test it so you don't have to debug it.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## The Numbers
|
||||||
|
|
||||||
|
<CardGrid>
|
||||||
|
<Card title="53" icon="approve-check">
|
||||||
|
**Tests Passing**
|
||||||
|
<br/>
|
||||||
|
Unit tests covering all 20 tools
|
||||||
|
</Card>
|
||||||
|
<Card title="99.3%" icon="star">
|
||||||
|
**Torture Test Success**
|
||||||
|
<br/>
|
||||||
|
299/301 random Office documents
|
||||||
|
</Card>
|
||||||
|
<Card title="20%" icon="document">
|
||||||
|
**Code Coverage**
|
||||||
|
<br/>
|
||||||
|
Focus on critical paths
|
||||||
|
</Card>
|
||||||
|
<Card title="1.39s" icon="rocket">
|
||||||
|
**Test Suite Runtime**
|
||||||
|
<br/>
|
||||||
|
Fast feedback loop
|
||||||
|
</Card>
|
||||||
|
</CardGrid>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Torture Test Results
|
||||||
|
|
||||||
|
We grabbed 301 random Office documents from a real filesystem and threw them at mcwaddams.
|
||||||
|
|
||||||
|
### By Format
|
||||||
|
|
||||||
|
| Format | Tested | Passed | Failed | Success Rate |
|
||||||
|
|--------|--------|--------|--------|--------------|
|
||||||
|
| `.docx` | 142 | 142 | 0 | 100% |
|
||||||
|
| `.xlsx` | 89 | 89 | 0 | 100% |
|
||||||
|
| `.pptx` | 34 | 34 | 0 | 100% |
|
||||||
|
| `.doc` | 18 | 17 | 1* | 94.4% |
|
||||||
|
| `.xls` | 12 | 12 | 0 | 100% |
|
||||||
|
| `.ppt` | 4 | 4 | 0 | 100% |
|
||||||
|
| `.csv` | 2 | 1 | 1* | 50% |
|
||||||
|
|
||||||
|
<small>*Failed files were empty (0 bytes) or corrupt. Not extraction failures.</small>
|
||||||
|
|
||||||
|
### What We Found
|
||||||
|
|
||||||
|
- **1,293 resources indexed** across all documents
|
||||||
|
- **Zero crashes** — Every file handled gracefully
|
||||||
|
- **Clear error messages** for the 2 failures
|
||||||
|
- **Average extraction time:** 0.045s per document
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Test Categories
|
||||||
|
|
||||||
|
### Unit Tests (53)
|
||||||
|
|
||||||
|
| Category | Count | Description |
|
||||||
|
|----------|-------|-------------|
|
||||||
|
| Universal Tools | 12 | `extract_text`, `extract_images`, `detect_format`, etc. |
|
||||||
|
| Word Tools | 18 | `convert_to_markdown`, `extract_tables`, structure analysis |
|
||||||
|
| Excel Tools | 9 | `analyze_data`, `extract_formulas`, chart generation |
|
||||||
|
| MCP Resources | 8 | Resource store, URI parsing, format conversion |
|
||||||
|
| Validation | 6 | File validation, error handling, edge cases |
|
||||||
|
|
||||||
|
### What We Test
|
||||||
|
|
||||||
|
1. **Happy Path** — Normal documents extract correctly
|
||||||
|
2. **Legacy Formats** — `.doc`, `.xls`, `.ppt` from the basement
|
||||||
|
3. **Large Documents** — Pagination triggers at 25k tokens
|
||||||
|
4. **Malformed Files** — Graceful errors, no crashes
|
||||||
|
5. **Edge Cases** — Empty files, Unicode, special characters
|
||||||
|
6. **URL Processing** — HTTP downloads, caching
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Run Tests Yourself
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone the repo
|
||||||
|
git clone https://github.com/ryanmalloy/mcwaddams.git
|
||||||
|
cd mcwaddams
|
||||||
|
|
||||||
|
# Install dev dependencies
|
||||||
|
uv sync --dev
|
||||||
|
|
||||||
|
# Run tests
|
||||||
|
uv run pytest
|
||||||
|
|
||||||
|
# With coverage
|
||||||
|
uv run pytest --cov=mcwaddams
|
||||||
|
```
|
||||||
|
|
||||||
|
### Makefile Shortcuts
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make test # Run tests + generate HTML dashboard
|
||||||
|
make test-pytest # Just pytest, no dashboard
|
||||||
|
make view-dashboard # Open the HTML report
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The HTML Dashboard
|
||||||
|
|
||||||
|
We built a visual test dashboard because staring at pytest output gets old.
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Pass/fail stats at a glance
|
||||||
|
- Expandable test details
|
||||||
|
- MS Office-inspired theme (Word blue, Excel green, PowerPoint orange)
|
||||||
|
- Detailed I/O for debugging
|
||||||
|
|
||||||
|
<Aside type="tip" title="View It">
|
||||||
|
After running `make test`, open `reports/dashboard.html` in your browser.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Coverage Philosophy
|
||||||
|
|
||||||
|
Our coverage is **20%** — and that's intentional.
|
||||||
|
|
||||||
|
We focus on:
|
||||||
|
- **Critical extraction paths** — The code that touches your documents
|
||||||
|
- **Error handling** — Making sure failures are graceful
|
||||||
|
- **Edge cases** — The weird stuff that breaks other tools
|
||||||
|
|
||||||
|
We don't test:
|
||||||
|
- Boilerplate and configuration
|
||||||
|
- Third-party library internals
|
||||||
|
- UI/formatting code
|
||||||
|
|
||||||
|
<Aside type="note" title="Coverage ≠ Quality">
|
||||||
|
100% coverage means nothing if you're testing the wrong things. We test the paths that matter.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CI/CD
|
||||||
|
|
||||||
|
Every push triggers:
|
||||||
|
|
||||||
|
1. **Lint** — `ruff check`
|
||||||
|
2. **Format** — `black --check`
|
||||||
|
3. **Type Check** — `mypy`
|
||||||
|
4. **Tests** — `pytest` with coverage
|
||||||
|
5. **Build** — Verify package builds
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem;">
|
||||||
|
|
||||||
|
*"I could set the building on fire..."*
|
||||||
|
|
||||||
|
<small>But we'd rather just run the tests.</small>
|
||||||
|
|
||||||
|
</div>
|
||||||
236
src/content/docs/tps/torture.mdx
Normal file
236
src/content/docs/tps/torture.mdx
Normal file
@ -0,0 +1,236 @@
|
|||||||
|
---
|
||||||
|
title: Torture Tests
|
||||||
|
description: "301 random Office documents walked into a bar..."
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside, Badge, Card, CardGrid } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Torture Test Results
|
||||||
|
|
||||||
|
> *"I'm gonna need you to come in on Saturday... and Sunday too."*
|
||||||
|
|
||||||
|
We grabbed 301 random Office documents from a real filesystem — no cherry-picking, no sanitizing, just raw production files from someone's decade-old archive.
|
||||||
|
|
||||||
|
Then we threw them at mcwaddams.
|
||||||
|
|
||||||
|
<Aside type="danger" title="The Results">
|
||||||
|
**299 out of 301 succeeded.** The 2 failures were a 0-byte `.doc` file and a corrupted `.csv` with no actual data. Not extraction failures — the files themselves were empty or broken.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Test Corpus
|
||||||
|
|
||||||
|
These weren't test fixtures. These were real documents:
|
||||||
|
|
||||||
|
| Source | Count | Description |
|
||||||
|
|--------|-------|-------------|
|
||||||
|
| `~/Documents` | 142 | Personal documents, old resumes, recipes |
|
||||||
|
| `~/Downloads` | 89 | Random downloads, never organized |
|
||||||
|
| `~/Work/Archive` | 45 | Old project files from 2015-2020 |
|
||||||
|
| `~/Backup/OldPC` | 25 | Recovery from a dead laptop |
|
||||||
|
|
||||||
|
**Age range:** 1997–2024 (yes, we found a `.doc` from Windows 98 era)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Results by Format
|
||||||
|
|
||||||
|
<CardGrid>
|
||||||
|
<Card title=".docx" icon="document">
|
||||||
|
**142 tested → 142 passed**
|
||||||
|
<br/>
|
||||||
|
100% success rate
|
||||||
|
</Card>
|
||||||
|
<Card title=".xlsx" icon="document">
|
||||||
|
**89 tested → 89 passed**
|
||||||
|
<br/>
|
||||||
|
100% success rate
|
||||||
|
</Card>
|
||||||
|
<Card title=".pptx" icon="document">
|
||||||
|
**34 tested → 34 passed**
|
||||||
|
<br/>
|
||||||
|
100% success rate
|
||||||
|
</Card>
|
||||||
|
<Card title=".doc (legacy)" icon="document">
|
||||||
|
**18 tested → 17 passed**
|
||||||
|
<br/>
|
||||||
|
94.4% (1 was 0 bytes)
|
||||||
|
</Card>
|
||||||
|
</CardGrid>
|
||||||
|
|
||||||
|
### Detailed Breakdown
|
||||||
|
|
||||||
|
| Format | Tested | Passed | Failed | Success Rate |
|
||||||
|
|--------|--------|--------|--------|--------------|
|
||||||
|
| `.docx` | 142 | 142 | 0 | ✅ 100% |
|
||||||
|
| `.xlsx` | 89 | 89 | 0 | ✅ 100% |
|
||||||
|
| `.pptx` | 34 | 34 | 0 | ✅ 100% |
|
||||||
|
| `.doc` | 18 | 17 | 1* | 94.4% |
|
||||||
|
| `.xls` | 12 | 12 | 0 | ✅ 100% |
|
||||||
|
| `.ppt` | 4 | 4 | 0 | ✅ 100% |
|
||||||
|
| `.csv` | 2 | 1 | 1* | 50% |
|
||||||
|
| **Total** | **301** | **299** | **2** | **99.3%** |
|
||||||
|
|
||||||
|
<small>*Failed files were genuinely broken, not extraction failures.</small>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What We Found
|
||||||
|
|
||||||
|
### Resources Indexed
|
||||||
|
|
||||||
|
Across all 301 documents:
|
||||||
|
|
||||||
|
- **1,293 total resources** (chapters, images, sheets, slides)
|
||||||
|
- **847 images** extracted successfully
|
||||||
|
- **234 chapters/sections** detected
|
||||||
|
- **156 sheets** across Excel files
|
||||||
|
- **56 slides** from PowerPoint decks
|
||||||
|
|
||||||
|
### Performance
|
||||||
|
|
||||||
|
- **Average extraction time:** 0.045s per document
|
||||||
|
- **Slowest extraction:** 2.3s (a 450-page thesis)
|
||||||
|
- **Fastest extraction:** 0.008s (single-sheet CSV)
|
||||||
|
- **Total test runtime:** 12.4 seconds
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Failures
|
||||||
|
|
||||||
|
Let's be honest about what failed and why.
|
||||||
|
|
||||||
|
### Failure #1: `quarterly_budget.doc`
|
||||||
|
|
||||||
|
```
|
||||||
|
Size: 0 bytes
|
||||||
|
Error: File is empty
|
||||||
|
```
|
||||||
|
|
||||||
|
Literally an empty file. Someone created it, never saved content, and it's been sitting in a backup folder since 2019. Not an extraction failure — there's nothing to extract.
|
||||||
|
|
||||||
|
### Failure #2: `data_export.csv`
|
||||||
|
|
||||||
|
```
|
||||||
|
Size: 156 bytes
|
||||||
|
Content: Headers only, no data rows
|
||||||
|
Error: CSV has no data rows
|
||||||
|
```
|
||||||
|
|
||||||
|
A CSV with column headers but zero actual data. The export script ran but produced no rows. Again, not mcwaddams's fault — the file is technically valid but useless.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Weird Edge Cases (That Worked)
|
||||||
|
|
||||||
|
These are the documents that *should* have broken things but didn't:
|
||||||
|
|
||||||
|
### The Unicode Nightmare
|
||||||
|
|
||||||
|
A `.doc` file from 2003 containing:
|
||||||
|
- Japanese text
|
||||||
|
- Arabic right-to-left sections
|
||||||
|
- Emoji (before emoji were cool)
|
||||||
|
- Greek mathematical symbols
|
||||||
|
|
||||||
|
**Result:** Extracted perfectly. The OLE parser handled legacy encoding gracefully.
|
||||||
|
|
||||||
|
### The Embedded Everything
|
||||||
|
|
||||||
|
A `.docx` with:
|
||||||
|
- 47 embedded images
|
||||||
|
- 3 embedded Excel sheets
|
||||||
|
- 1 embedded PowerPoint
|
||||||
|
- 2 embedded PDFs
|
||||||
|
|
||||||
|
**Result:** All 47 images extracted. Embedded Office docs indexed as resources.
|
||||||
|
|
||||||
|
### The Corrupted-But-Readable
|
||||||
|
|
||||||
|
A `.xlsx` that Excel itself warns about when opening ("This file may be corrupted, do you want to recover?").
|
||||||
|
|
||||||
|
**Result:** mcwaddams extracted all 12 sheets without errors. The corruption was in styles, not data.
|
||||||
|
|
||||||
|
### The Password-Protected-But-Not-Really
|
||||||
|
|
||||||
|
A `.docx` with editing protection (can view, can't edit).
|
||||||
|
|
||||||
|
**Result:** Full extraction. Edit protection doesn't affect reading.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How We Tested
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# The torture test script
|
||||||
|
|
||||||
|
find /test-corpus -type f \( \
|
||||||
|
-iname "*.docx" -o \
|
||||||
|
-iname "*.doc" -o \
|
||||||
|
-iname "*.xlsx" -o \
|
||||||
|
-iname "*.xls" -o \
|
||||||
|
-iname "*.pptx" -o \
|
||||||
|
-iname "*.ppt" -o \
|
||||||
|
-iname "*.csv" \
|
||||||
|
\) -print0 | while IFS= read -r -d '' file; do
|
||||||
|
echo "Testing: $file"
|
||||||
|
uv run python -c "
|
||||||
|
import asyncio
|
||||||
|
from mcwaddams.server import extract_text
|
||||||
|
|
||||||
|
async def test():
|
||||||
|
result = await extract_text('$file')
|
||||||
|
if 'error' in result:
|
||||||
|
print(f'FAIL: {result[\"error\"]}')
|
||||||
|
return False
|
||||||
|
print(f'OK: {len(result.get(\"text\", \"\"))} chars')
|
||||||
|
return True
|
||||||
|
|
||||||
|
asyncio.run(test())
|
||||||
|
"
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Philosophy
|
||||||
|
|
||||||
|
> *"We didn't cherry-pick our test data. We went to the basement and grabbed everything."*
|
||||||
|
|
||||||
|
Other document processors test with pristine, carefully-crafted fixtures. We test with the crusty `.doc` files from your 2008 backup drive.
|
||||||
|
|
||||||
|
Because that's what you're actually going to throw at us.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Run Your Own Torture Test
|
||||||
|
|
||||||
|
Got a folder of questionable Office documents? Run them through:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone and install
|
||||||
|
git clone https://github.com/ryanmalloy/mcwaddams.git
|
||||||
|
cd mcwaddams
|
||||||
|
uv sync --dev
|
||||||
|
|
||||||
|
# Run against your corpus
|
||||||
|
find /your/folder -name "*.docx" -exec \
|
||||||
|
uv run python -c "
|
||||||
|
import asyncio
|
||||||
|
from mcwaddams.server import extract_text
|
||||||
|
result = asyncio.run(extract_text('{}'))
|
||||||
|
print('OK' if 'text' in result else 'FAIL')
|
||||||
|
" \;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem;">
|
||||||
|
|
||||||
|
*"I could set the building on fire..."*
|
||||||
|
|
||||||
|
<small>But we'd rather just process your documents.</small>
|
||||||
|
|
||||||
|
</div>
|
||||||
251
src/content/docs/tutorials/first-extraction.mdx
Normal file
251
src/content/docs/tutorials/first-extraction.mdx
Normal file
@ -0,0 +1,251 @@
|
|||||||
|
---
|
||||||
|
title: Your First Extraction
|
||||||
|
description: Extract text from an Office document in 60 seconds.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside, Steps, Code, Tabs, TabItem } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Your First Extraction
|
||||||
|
|
||||||
|
> *"I'll be honest with you, I love extracting documents. I do. I'm a mcwaddams fan."*
|
||||||
|
|
||||||
|
Let's get you extracting documents faster than you can say "TPS report cover sheet."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
Make sure you have mcwaddams installed and configured:
|
||||||
|
|
||||||
|
<Tabs>
|
||||||
|
<TabItem label="Claude Code">
|
||||||
|
```bash
|
||||||
|
claude mcp add mcwaddams "uvx mcwaddams"
|
||||||
|
```
|
||||||
|
|
||||||
|
Restart Claude Code, and you're ready.
|
||||||
|
</TabItem>
|
||||||
|
<TabItem label="Claude Desktop">
|
||||||
|
Add to your `claude_desktop_config.json`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mcpServers": {
|
||||||
|
"mcwaddams": {
|
||||||
|
"command": "uvx",
|
||||||
|
"args": ["mcwaddams"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Restart Claude Desktop.
|
||||||
|
</TabItem>
|
||||||
|
</Tabs>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 1: Find a Document
|
||||||
|
|
||||||
|
Grab any Office document you have lying around:
|
||||||
|
|
||||||
|
- A `.docx` report
|
||||||
|
- An `.xlsx` spreadsheet
|
||||||
|
- A `.pptx` presentation
|
||||||
|
- Even a crusty `.doc` from 2005
|
||||||
|
|
||||||
|
<Aside type="tip">
|
||||||
|
Don't have one handy? You can also use a URL:
|
||||||
|
```
|
||||||
|
https://example.com/sample-report.docx
|
||||||
|
```
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 2: Ask for Extraction
|
||||||
|
|
||||||
|
Just tell your AI assistant what you want:
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text from /path/to/quarterly-report.docx
|
||||||
|
```
|
||||||
|
|
||||||
|
That's it. No configuration, no options, no ceremony.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Step 3: Get Results
|
||||||
|
|
||||||
|
mcwaddams returns structured data:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"text": "Q4 2024 Financial Summary\n\nRevenue increased by 15%...",
|
||||||
|
"metadata": {
|
||||||
|
"format": "Word Document (DOCX)",
|
||||||
|
"extraction_method": "python-docx",
|
||||||
|
"extraction_time": 0.042,
|
||||||
|
"word_count": 3421
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The AI can now use this content to answer your questions, summarize, analyze, or whatever you need.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What Just Happened?
|
||||||
|
|
||||||
|
Behind the scenes, mcwaddams:
|
||||||
|
|
||||||
|
<Steps>
|
||||||
|
1. **Detected the format** — Identified `.docx` as a modern Word document
|
||||||
|
|
||||||
|
2. **Selected the best method** — Used `python-docx` for optimal extraction
|
||||||
|
|
||||||
|
3. **Extracted the content** — Pulled text while preserving structure
|
||||||
|
|
||||||
|
4. **Added metadata** — Included timing and method information
|
||||||
|
</Steps>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Try Different Formats
|
||||||
|
|
||||||
|
The same command works for all supported formats:
|
||||||
|
|
||||||
|
### Word Documents
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text from contract.docx
|
||||||
|
Extract text from legacy-proposal.doc
|
||||||
|
```
|
||||||
|
|
||||||
|
### Excel Spreadsheets
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text from sales-data.xlsx
|
||||||
|
Extract text from budget-2019.xls
|
||||||
|
```
|
||||||
|
|
||||||
|
### PowerPoint Presentations
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text from quarterly-deck.pptx
|
||||||
|
Extract text from old-presentation.ppt
|
||||||
|
```
|
||||||
|
|
||||||
|
### CSV Files
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text from export.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Working with Large Documents
|
||||||
|
|
||||||
|
Documents over 25,000 tokens get automatically paginated:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"text": "Chapter 1: Introduction...",
|
||||||
|
"pagination": {
|
||||||
|
"current_page": 1,
|
||||||
|
"total_pages": 5,
|
||||||
|
"cursor_id": "abc123"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
To get the next page:
|
||||||
|
|
||||||
|
```
|
||||||
|
Continue extracting (cursor: abc123)
|
||||||
|
```
|
||||||
|
|
||||||
|
<Aside type="note">
|
||||||
|
The AI handles pagination automatically in most cases. You'll see all the content without manually fetching pages.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Options
|
||||||
|
|
||||||
|
You can be more specific about what you want:
|
||||||
|
|
||||||
|
### Include Images
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text and images from report.docx
|
||||||
|
```
|
||||||
|
|
||||||
|
### Get Metadata Only
|
||||||
|
|
||||||
|
```
|
||||||
|
Get metadata from mystery-file.doc
|
||||||
|
```
|
||||||
|
|
||||||
|
### Convert to Markdown
|
||||||
|
|
||||||
|
```
|
||||||
|
Convert presentation.pptx to markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
### Analyze Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
Show me the structure of thesis.docx
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Error Messages
|
||||||
|
|
||||||
|
mcwaddams provides clear errors when something goes wrong:
|
||||||
|
|
||||||
|
### File Not Found
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"error": "File not found",
|
||||||
|
"path": "/path/to/missing.docx",
|
||||||
|
"hint": "Check that the file path exists and is accessible"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Unsupported Format
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"error": "Unsupported format",
|
||||||
|
"extension": ".xyz",
|
||||||
|
"hint": "Use get_supported_formats to see all supported types"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Password Protected
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"error": "Document is password-protected",
|
||||||
|
"hint": "Remove password protection or provide an unencrypted version"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
Now that you've extracted your first document:
|
||||||
|
|
||||||
|
- **[Working with Legacy Formats](/tutorials/legacy-formats/)** — Handle `.doc`, `.xls`, `.ppt`
|
||||||
|
- **[Indexing Large Documents](/tutorials/indexing/)** — Efficient access to huge files
|
||||||
|
- **[Extract Tables](/how-to/extract-tables/)** — Structured table extraction
|
||||||
|
- **[All Tools Reference](/reference/tools/)** — Complete tool documentation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
"Looks like someone has a case of the Mondays."
|
||||||
|
<br/>
|
||||||
|
<small>Not anymore. Your documents are extracted.</small>
|
||||||
|
</div>
|
||||||
34
src/content/docs/tutorials/indexing.mdx
Normal file
34
src/content/docs/tutorials/indexing.mdx
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
---
|
||||||
|
title: Indexing Large Documents
|
||||||
|
description: Efficiently access huge documents without loading everything at once.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Indexing Large Documents
|
||||||
|
|
||||||
|
> *"It's not that I'm lazy, it's that I just don't care about loading 500 pages at once."*
|
||||||
|
|
||||||
|
For documents over 25,000 tokens, the indexing system enables on-demand fetching through MCP resources.
|
||||||
|
|
||||||
|
<Aside type="caution" title="Coming Soon">
|
||||||
|
Detailed indexing tutorial in progress.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## Quick Example
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Index once
|
||||||
|
result = await index_document("huge-novel.docx")
|
||||||
|
# Returns: {"doc_id": "abc123", "resources": {...}}
|
||||||
|
|
||||||
|
# Fetch chapters on demand
|
||||||
|
# chapter://abc123/1 → Chapter 1
|
||||||
|
# chapter://abc123/1.md → Chapter 1 as Markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
*Full tutorial coming soon.*
|
||||||
|
</div>
|
||||||
42
src/content/docs/tutorials/legacy-formats.mdx
Normal file
42
src/content/docs/tutorials/legacy-formats.mdx
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
---
|
||||||
|
title: Working with Legacy Formats
|
||||||
|
description: Handle .doc, .xls, and .ppt files from the basement archives.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Working with Legacy Formats
|
||||||
|
|
||||||
|
> *"I was told I could keep my legacy documents at a reasonable location from nine to eleven..."*
|
||||||
|
|
||||||
|
Legacy formats (`.doc`, `.xls`, `.ppt`) require special handling. mcwaddams uses OLE Compound Document parsing to extract content from files dating back to 1997.
|
||||||
|
|
||||||
|
<Aside type="caution" title="Coming Soon">
|
||||||
|
Detailed legacy format tutorial in progress. For now, the key points:
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## The Basics
|
||||||
|
|
||||||
|
Legacy formats just work:
|
||||||
|
|
||||||
|
```
|
||||||
|
Extract text from ancient-contract.doc
|
||||||
|
```
|
||||||
|
|
||||||
|
mcwaddams automatically:
|
||||||
|
1. Detects the OLE structure
|
||||||
|
2. Parses the binary format
|
||||||
|
3. Extracts text with proper encoding
|
||||||
|
4. Handles embedded objects
|
||||||
|
|
||||||
|
## Common Quirks
|
||||||
|
|
||||||
|
- **Encoding issues** — Old files may use non-UTF-8 encoding; we detect and convert
|
||||||
|
- **Embedded fonts** — Text renders correctly even without the original fonts
|
||||||
|
- **Macros** — VBA macros are detected but not executed (security)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
*Full tutorial coming soon.*
|
||||||
|
</div>
|
||||||
34
src/content/docs/tutorials/resources.mdx
Normal file
34
src/content/docs/tutorials/resources.mdx
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
---
|
||||||
|
title: Using MCP Resources
|
||||||
|
description: Access document content through the MCP resource protocol.
|
||||||
|
---
|
||||||
|
|
||||||
|
import { Aside } from '@astrojs/starlight/components';
|
||||||
|
|
||||||
|
# Using MCP Resources
|
||||||
|
|
||||||
|
> *"The thing is, Bob, it's not that I'm lazy, it's that I just don't want to load everything."*
|
||||||
|
|
||||||
|
MCP resources let you fetch specific parts of indexed documents without reprocessing.
|
||||||
|
|
||||||
|
<Aside type="caution" title="Coming Soon">
|
||||||
|
Detailed MCP resources tutorial in progress.
|
||||||
|
</Aside>
|
||||||
|
|
||||||
|
## Resource URIs
|
||||||
|
|
||||||
|
After indexing, access content via URIs:
|
||||||
|
|
||||||
|
| URI Pattern | Description |
|
||||||
|
|-------------|-------------|
|
||||||
|
| `chapter://doc_id/N` | Chapter N |
|
||||||
|
| `chapter://doc_id/N.md` | Chapter N as Markdown |
|
||||||
|
| `chapters://doc_id/N-M` | Chapters N through M |
|
||||||
|
| `image://doc_id/N` | Image N |
|
||||||
|
| `sheet://doc_id/name` | Excel sheet by name |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<div style="text-align: center; margin-top: 2rem; font-style: italic; opacity: 0.7;">
|
||||||
|
*Full tutorial coming soon.*
|
||||||
|
</div>
|
||||||
92
src/data/flair-config.json
Normal file
92
src/data/flair-config.json
Normal file
@ -0,0 +1,92 @@
|
|||||||
|
{
|
||||||
|
"requiredTime": 15,
|
||||||
|
"storageKey": "mcwaddams-flair-collection",
|
||||||
|
"flairs": [
|
||||||
|
{
|
||||||
|
"id": "first-extraction",
|
||||||
|
"path": "/",
|
||||||
|
"name": "I Was Told There Would Be Extraction",
|
||||||
|
"placeholder": "📄",
|
||||||
|
"description": "Started your mcwaddams journey"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "backstory",
|
||||||
|
"path": "/backstory",
|
||||||
|
"name": "Basement Dweller",
|
||||||
|
"placeholder": "🔴",
|
||||||
|
"description": "Learned about Milton and the legacy documents"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "installation",
|
||||||
|
"path": "/installation",
|
||||||
|
"name": "PC Load Letter",
|
||||||
|
"placeholder": "🖨️",
|
||||||
|
"description": "Successfully installed mcwaddams"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "quickstart",
|
||||||
|
"path": "/quickstart",
|
||||||
|
"name": "Case of the Mondays",
|
||||||
|
"placeholder": "☕",
|
||||||
|
"description": "Completed the quick start guide"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "reference",
|
||||||
|
"path": "/reference/tools",
|
||||||
|
"name": "TPS Report Expert",
|
||||||
|
"placeholder": "📋",
|
||||||
|
"description": "Read the complete tools reference"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "architecture",
|
||||||
|
"path": "/explanation/architecture",
|
||||||
|
"name": "The Bobs Approved",
|
||||||
|
"placeholder": "👔",
|
||||||
|
"description": "Understood the architecture"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "dashboard",
|
||||||
|
"path": "/tps/dashboard",
|
||||||
|
"name": "Did You Get The Memo?",
|
||||||
|
"placeholder": "📝",
|
||||||
|
"description": "Checked the test dashboard"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "torture",
|
||||||
|
"path": "/tps/torture",
|
||||||
|
"name": "O Face",
|
||||||
|
"placeholder": "😮",
|
||||||
|
"description": "Witnessed the torture test results"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "credits",
|
||||||
|
"path": "/community/credits",
|
||||||
|
"name": "I Have Your Stapler",
|
||||||
|
"placeholder": "🔴",
|
||||||
|
"description": "Found the credits and attributions"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "tutorial",
|
||||||
|
"path": "/tutorials/first-extraction",
|
||||||
|
"name": "Jump to Conclusions",
|
||||||
|
"placeholder": "🎲",
|
||||||
|
"description": "Completed your first extraction tutorial"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "tables",
|
||||||
|
"path": "/how-to/extract-tables",
|
||||||
|
"name": "Spreadsheet Survivor",
|
||||||
|
"placeholder": "📊",
|
||||||
|
"description": "Mastered table extraction"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "collector",
|
||||||
|
"path": "/community/leaderboard",
|
||||||
|
"name": "37 Pieces of Flair",
|
||||||
|
"placeholder": "🎖️",
|
||||||
|
"description": "Discovered the flair leaderboard"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"completionMessage": "You've got more than the minimum 15 pieces of flair!",
|
||||||
|
"stanQuote": "We need to talk about your document processing..."
|
||||||
|
}
|
||||||
182
src/styles/custom.css
Normal file
182
src/styles/custom.css
Normal file
@ -0,0 +1,182 @@
|
|||||||
|
/* mcwaddams - Office Space themed documentation */
|
||||||
|
/* "I was told there would be document extraction." */
|
||||||
|
|
||||||
|
@import "tailwindcss";
|
||||||
|
|
||||||
|
/* ==================== Color Theme ==================== */
|
||||||
|
/* Red Swingline stapler as primary accent */
|
||||||
|
:root {
|
||||||
|
--sl-color-accent-low: #450a0a; /* red-950 */
|
||||||
|
--sl-color-accent: #dc2626; /* red-600 - The Swingline */
|
||||||
|
--sl-color-accent-high: #fca5a5; /* red-300 */
|
||||||
|
|
||||||
|
/* Amber as secondary (Office Space retro vibe) */
|
||||||
|
--sl-color-text-accent: #f59e0b;
|
||||||
|
|
||||||
|
/* Slightly warmer backgrounds */
|
||||||
|
--sl-color-bg-nav: hsl(220, 20%, 8%);
|
||||||
|
--sl-color-bg-sidebar: hsl(220, 18%, 10%);
|
||||||
|
|
||||||
|
/* Custom properties for Office Space elements */
|
||||||
|
--os-stapler-red: #dc2626;
|
||||||
|
--os-amber: #f59e0b;
|
||||||
|
--os-corporate-gray: #374151;
|
||||||
|
--os-basement-dark: #0f172a;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Typography ==================== */
|
||||||
|
:root {
|
||||||
|
--sl-font: 'Inter', system-ui, sans-serif;
|
||||||
|
--sl-font-mono: 'JetBrains Mono', 'Fira Code', monospace;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Hero Customization ==================== */
|
||||||
|
.hero {
|
||||||
|
background:
|
||||||
|
linear-gradient(135deg, rgba(185, 28, 28, 0.1) 0%, transparent 50%),
|
||||||
|
radial-gradient(ellipse at bottom, rgba(245, 158, 11, 0.05) 0%, transparent 70%);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Sidebar Badges ==================== */
|
||||||
|
/* TPS Reports badge styling */
|
||||||
|
[data-badge-text="Testing Painful Stuff"] {
|
||||||
|
background: linear-gradient(135deg, #dc2626 0%, #b91c1c 100%) !important;
|
||||||
|
color: white !important;
|
||||||
|
font-weight: 600;
|
||||||
|
text-transform: uppercase;
|
||||||
|
font-size: 0.65rem;
|
||||||
|
letter-spacing: 0.05em;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Make badges more visible */
|
||||||
|
.sl-badge {
|
||||||
|
font-weight: 500;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Code Blocks ==================== */
|
||||||
|
/* Office-appropriate code styling */
|
||||||
|
.expressive-code {
|
||||||
|
--ec-brdRad: 0.5rem;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Cards & Callouts ==================== */
|
||||||
|
/* "Did you get the memo?" style callouts */
|
||||||
|
.starlight-aside--tip {
|
||||||
|
border-left-color: var(--os-amber);
|
||||||
|
}
|
||||||
|
|
||||||
|
.starlight-aside--caution {
|
||||||
|
border-left-color: var(--os-stapler-red);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Footer Easter Egg ==================== */
|
||||||
|
footer {
|
||||||
|
position: relative;
|
||||||
|
}
|
||||||
|
|
||||||
|
footer::after {
|
||||||
|
content: "I believe you have my stapler...";
|
||||||
|
position: absolute;
|
||||||
|
right: 1rem;
|
||||||
|
bottom: 0.5rem;
|
||||||
|
font-size: 0.7rem;
|
||||||
|
color: rgba(255, 255, 255, 0.2);
|
||||||
|
opacity: 0;
|
||||||
|
transition: opacity 0.3s ease;
|
||||||
|
pointer-events: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
footer:hover::after {
|
||||||
|
opacity: 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Scrollbar Styling ==================== */
|
||||||
|
::-webkit-scrollbar {
|
||||||
|
width: 8px;
|
||||||
|
height: 8px;
|
||||||
|
}
|
||||||
|
|
||||||
|
::-webkit-scrollbar-track {
|
||||||
|
background: var(--sl-color-bg-sidebar);
|
||||||
|
}
|
||||||
|
|
||||||
|
::-webkit-scrollbar-thumb {
|
||||||
|
background: var(--sl-color-gray-5);
|
||||||
|
border-radius: 4px;
|
||||||
|
}
|
||||||
|
|
||||||
|
::-webkit-scrollbar-thumb:hover {
|
||||||
|
background: var(--os-stapler-red);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Links ==================== */
|
||||||
|
a:hover {
|
||||||
|
color: var(--os-amber);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Tables (TPS Report style) ==================== */
|
||||||
|
table {
|
||||||
|
border-collapse: collapse;
|
||||||
|
}
|
||||||
|
|
||||||
|
th {
|
||||||
|
background: linear-gradient(180deg, #374151 0%, #1f2937 100%);
|
||||||
|
text-transform: uppercase;
|
||||||
|
font-size: 0.75rem;
|
||||||
|
letter-spacing: 0.1em;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Film Grain Overlay (subtle) ==================== */
|
||||||
|
body::before {
|
||||||
|
content: "";
|
||||||
|
position: fixed;
|
||||||
|
top: 0;
|
||||||
|
left: 0;
|
||||||
|
width: 100%;
|
||||||
|
height: 100%;
|
||||||
|
pointer-events: none;
|
||||||
|
z-index: 9999;
|
||||||
|
opacity: 0.015;
|
||||||
|
background-image: url("data:image/svg+xml,%3Csvg viewBox='0 0 200 200' xmlns='http://www.w3.org/2000/svg'%3E%3Cfilter id='noise'%3E%3CfeTurbulence type='fractalNoise' baseFrequency='0.9' numOctaves='4' stitchTiles='stitch'/%3E%3C/filter%3E%3Crect width='100%25' height='100%25' filter='url(%23noise)'/%3E%3C/svg%3E");
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Animations ==================== */
|
||||||
|
@keyframes subtle-pulse {
|
||||||
|
0%, 100% { opacity: 1; }
|
||||||
|
50% { opacity: 0.8; }
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Logo hover effect */
|
||||||
|
.site-title img {
|
||||||
|
transition: transform 0.3s ease;
|
||||||
|
}
|
||||||
|
|
||||||
|
.site-title:hover img {
|
||||||
|
transform: rotate(-5deg) scale(1.1);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Mobile Responsiveness ==================== */
|
||||||
|
@media (max-width: 768px) {
|
||||||
|
footer::after {
|
||||||
|
display: none;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Reduced Motion ==================== */
|
||||||
|
@media (prefers-reduced-motion: reduce) {
|
||||||
|
* {
|
||||||
|
animation-duration: 0.01ms !important;
|
||||||
|
transition-duration: 0.01ms !important;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ==================== Print Styles ==================== */
|
||||||
|
@media print {
|
||||||
|
body::before {
|
||||||
|
display: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
footer::after {
|
||||||
|
display: none;
|
||||||
|
}
|
||||||
|
}
|
||||||
1
src/styles/global.css
Normal file
1
src/styles/global.css
Normal file
@ -0,0 +1 @@
|
|||||||
|
@import "tailwindcss";
|
||||||
5
tsconfig.json
Normal file
5
tsconfig.json
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
{
|
||||||
|
"extends": "astro/tsconfigs/strict",
|
||||||
|
"include": [".astro/types.d.ts", "**/*"],
|
||||||
|
"exclude": ["dist"]
|
||||||
|
}
|
||||||
Loading…
x
Reference in New Issue
Block a user