mcaxl/tests/test_css_impact.py
Ryan Malloy dee5fdacda Hamilton review fixes: validator literal preservation, cache cluster id, CSS impact partial-failure reporting
Three findings from a margaret-hamilton-style review of the MCP server,
fixed with regression tests written first (red → green). One bonus
finding (huntpilotqueue column name) was surfaced by the third fix
itself — exactly the audit-trust failure mode that fix exists to expose.

CRITICAL #1 — sql_validator: comment-strip mutated string literals.

The cleaned query returned by validate_select() is what travels to AXL.
Previously, the comment-strip pass ran before the literal-aware pass,
so `--` or `/* */` markers inside a string literal were silently eaten:

  input:  WHERE description = 'Smith -- old line'
  to AXL: WHERE description = 'Smith    (truncated mid-literal)

The LLM saw rows that looked plausible but were not what its query
asked for. "Confidently wrong" is exactly the failure mode the review
was hunting.

Fix: only strip comments on the analysis-only copy used for keyword
detection. The cleaned output preserves the input verbatim (modulo
trailing semicolon and outer whitespace). 6 new tests covering literal
preservation across `--`, `/* */`, LIKE patterns with embedded comment
markers, and forbidden keywords inside real comments.

CRITICAL #2 — cache key omitted cluster identity.

The on-disk cache key was `method::args_json`. An operator swapping
AXL_URL between test and prod (or between two clusters) would silently
serve stale data from cluster A as if from cluster B. The audit
report would be confidently wrong with no signal anything happened.

Fix: AxlCache now takes cluster_id and prefixes all keys with it.
Server bootstrap derives cluster_id as a 12-char SHA-256 prefix of
AXL_URL. cache_stats() surfaces both the current cluster_id and a
`foreign_cluster_entries` count so an env-swap is visible. Schema
migration handles pre-fix cache files via PRAGMA table_info introspection
plus a one-shot ALTER TABLE ADD COLUMN. 5 new tests covering isolation,
shared-id sharing, stats reporting, legacy DB upgrade, and per-cluster
clear() scoping.

MAJOR #3 — find_devices_using_css summary undercounted partial failures.

The function is per-category resilient (one failed query doesn't kill
the whole impact analysis), but the resilience never propagated up to
the response. total_returned and any_truncated only reflected SUCCESSFUL
categories. An LLM consuming "47 references" had no way to know 5
categories errored and the real number was likely much higher.

Fix: response now includes complete: bool, categories_with_errors: int,
and error_categories: [list]. The LLM/auditor sees the partial-failure
state and can decide whether to act on incomplete data. 5 new tests
using a FakeAxlClient stand-in to simulate per-category failures.

BONUS finding (uncovered by Major #3 fix): huntpilotqueue join used
the wrong column. Three CSS impact categories (huntpilot_max_wait_css,
huntpilot_no_agent_css, huntpilot_queue_full_css) were silently
erroring with "Column (fknumplan) not found" because huntpilotqueue
joins via fknumplan_pilot, not fknumplan. With the Major #3 fix in
place, this surfaced immediately as `complete: False, error_categories:
[3 huntpilot_*]` against the live cluster. Fixed inline; live re-run
now reports `complete: True, total_returned: 163` for Internal-CSS.

87 unit tests passing (up from 70). Live cluster smoke test
(cucm-pub.binghammemorial.org, CUCM 15.0.1.12900-234) verifies all
three fixes plus the bonus finding work end-to-end.
2026-04-25 23:09:55 -06:00

120 lines
4.7 KiB
Python

"""Hamilton review MAJOR #3: find_devices_using_css must surface partial failures.
The function is per-category resilient by design — if one schema query fails,
the others still produce results. But the top-level summary previously hid
that some categories errored out: `total_returned` and `any_truncated` only
reflected the SUCCESSFUL categories. An LLM consuming "47 references, low
impact" wouldn't know that 5 categories errored and the real number is
likely much higher.
After the fix: the response includes `complete: bool`, `categories_with_errors`,
and `error_categories`, so an LLM (or human auditor) can see the partial-failure
state and act on it.
"""
import pytest
from mcp_cucm_axl.route_plan import find_devices_using_css
class FakeAxlClient:
"""Minimal stand-in for AxlClient that lets us simulate per-query failures.
Returns a fake CSS pkid for the lookup query, then either a single fake row
or an exception based on substring matching.
"""
def __init__(self, error_on_columns: list[str] | None = None):
self.error_on_columns = error_on_columns or []
self.queries: list[str] = []
def execute_sql_query(self, sql: str) -> dict:
self.queries.append(sql)
# The CSS lookup query — return a fake pkid
if "callingsearchspace WHERE name" in sql:
return {"row_count": 1, "rows": [{"pkid": "fake-css-pkid"}]}
# Any query referencing an "error trigger" column → simulate failure
for trigger in self.error_on_columns:
if trigger in sql:
raise RuntimeError(f"simulated cluster failure on {trigger}")
# Otherwise return one fake reference row so the category isn't empty
return {
"row_count": 1,
"rows": [{"name": "FakeRef", "context": "FakePart", "description": "fake"}],
}
def test_no_errors_reports_complete():
"""Baseline: when every category succeeds, complete=True and no error fields populated."""
client = FakeAxlClient()
result = find_devices_using_css(client, "Some-CSS")
assert result["complete"] is True
assert result["categories_with_errors"] == 0
assert result["error_categories"] == []
# And total_returned reflects the successful categories
assert result["total_returned"] >= 1
def test_one_errored_category_marks_incomplete():
"""The audit-trust failure mode: one category errors out and the summary lies.
Fix: complete=False, categories_with_errors >= 1.
"""
client = FakeAxlClient(error_on_columns=["fkcallingsearchspace_cgpnunknown"])
result = find_devices_using_css(client, "Some-CSS")
assert result["complete"] is False, (
"complete must be False when any category errored"
)
assert result["categories_with_errors"] >= 1
assert "device_cgpn_unknown_css" in result["error_categories"]
def test_multiple_errors_all_listed():
"""All errored categories must be enumerated in error_categories."""
client = FakeAxlClient(
error_on_columns=[
"fkcallingsearchspace_cgpnunknown",
"fkcallingsearchspace_reroute",
"fkcallingsearchspace_pilotqueuefull",
]
)
result = find_devices_using_css(client, "Some-CSS")
assert result["complete"] is False
assert result["categories_with_errors"] == 3
assert set(result["error_categories"]) == {
"device_cgpn_unknown_css",
"device_reroute_css",
"huntpilot_queue_full_css",
}
def test_total_returned_does_not_include_error_categories():
"""An errored category contributes 0 to total_returned (correct behavior).
What's NEW: the response also flags that the count is partial.
"""
client = FakeAxlClient(error_on_columns=["fkcallingsearchspace_cgpnunknown"])
result = find_devices_using_css(client, "Some-CSS")
# The count itself is unchanged from before — what's new is the warning
assert result["complete"] is False
# The error category has no rows in references_by_category
err_cat = result["references_by_category"].get("device_cgpn_unknown_css", {})
assert "error" in err_cat
def test_css_not_found_returns_error_not_partial():
"""If the CSS lookup itself fails (CSS doesn't exist), we return the
'not found' error early, NOT a partial-failure response. Distinct
failure modes deserve distinct shapes.
"""
class CssNotFoundClient:
def execute_sql_query(self, sql):
if "callingsearchspace WHERE name" in sql:
return {"row_count": 0, "rows": []}
return {"row_count": 1, "rows": [{}]}
result = find_devices_using_css(CssNotFoundClient(), "Nonexistent-CSS")
assert "error" in result
assert "complete" not in result, (
"CSS-not-found is a hard error; we shouldn't dress it up as partial"
)