Closes the four remaining findings from the margaret-hamilton review. 13 new regression tests; all 100 pass; live cluster smoke verified. MAJOR #4 — wildcard regex catastrophic backtracking + silent malformed. Two changes to _wildcard_to_regex(): a) Bounded the `!` and `@` wildcards to \d{1,50} (was \d+). Adjacent `!` patterns previously compiled to (\d+)(\d+)... which has exponential backtracking on near-miss inputs. CUCM dial strings are practically capped well below 50 digits; the bound keeps complexity polynomial without losing real-world coverage. Verified: 10 adjacent `!` against a 30-digit near-miss now finishes in ~240ms (was unbounded; could have been minutes on real pathological cases). b) Unclosed `[` now raises ValueError instead of silently treating the bracket as a literal. _pattern_matches_number catches the error and returns False so a single bad pattern doesn't crash translation_chain — but the bad pattern is no longer invisibly producing wrong matches. The previous silent fallback meant a pattern like `[0-9` (typo, missing `]`) would match input containing the literal characters `[` `0` `-` `9`. 3 new tests covering: bounded-regex shape (`\d{1,N}`), pathological input completes quickly, unclosed bracket raises explicitly, well-formed character class still works. MAJOR #5 — distinguish config errors from operational errors. Pre-fix: any first-time connection failure set `_connection_error` and pinned it forever. A transient network blip or session timeout required restarting the MCP server. Hamilton's framing: Apollo's software was *designed* to recover from transient faults; pinning forever is the antithesis of "design the error path first." Fix: split into two state fields: _config_error — permanent until restart (missing env vars only) _last_error — last operational failure, NOT a pin Operational failures (zeep Client construction, network, TLS, session) clear from the next call's perspective: the next call attempts fresh. Configuration errors (missing AXL_URL etc.) stay pinned because they don't get better on retry. Added _ConfigError as a private subclass to make the distinction explicit at the raise site, and connection_status() to expose connected/connected_at/config_error/last_error for diagnostic transparency. 3 new tests: config errors pin, operational errors don't pin, connection_status() reports state. MINOR #6 — _to_int silent coercion of bad data. Pre-fix: a non-numeric value from the cluster (data corruption, schema drift across CUCM versions) silently became None, which downstream sort logic defaulted to 0 — jumbling the failover order in the displayed result with no warning. Fix: still returns None on bad data (caller error path unchanged), but logs the offending value to stderr so an operator notices something's wrong at the data layer. None itself is silent (legitimately-unset column). 2 new tests: real None is silent, bad string logs to stderr with the offending value visible. MINOR #7 — standardize tool failure shapes; add health() tool. Pre-fix: cache_stats and cache_clear returned `{"error": "..."}` when _cache was None, while AXL-touching tools raised RuntimeError. LLM consumers had to handle two shapes. Fix: _require_cache() helper raises RuntimeError consistently with _client(). All tool failures now use the same exception shape. Added health() tool that reports cache/axl/docs initialization status plus the AXL connection_status — gives operators a self-diagnostic when something fails at bootstrap. 3 new tests: cache_stats raises, cache_clear raises, health() reports each subsystem.
84 lines
3.3 KiB
Python
84 lines
3.3 KiB
Python
"""Hamilton review MAJOR #5: connection recovery and config-vs-operational errors.
|
|
|
|
Pre-fix: any connection failure set `_connection_error` and pinned it forever.
|
|
A transient network blip required restarting the MCP server. Fix: distinguish
|
|
*configuration* errors (missing env, bad WSDL) which are pinned, from
|
|
*operational* errors (network, TLS, session timeout) which can be retried
|
|
on the next call.
|
|
"""
|
|
|
|
from pathlib import Path
|
|
|
|
import pytest
|
|
|
|
from mcp_cucm_axl.cache import AxlCache
|
|
from mcp_cucm_axl.client import AxlClient
|
|
|
|
|
|
@pytest.fixture
|
|
def cache(tmp_path: Path) -> AxlCache:
|
|
return AxlCache(tmp_path / "test.sqlite", default_ttl=60, cluster_id="test")
|
|
|
|
|
|
def test_config_error_is_pinned(cache: AxlCache, monkeypatch):
|
|
"""Missing AXL_URL is a config error — it doesn't get better on retry,
|
|
and the next call should still raise the same clear message."""
|
|
monkeypatch.delenv("AXL_URL", raising=False)
|
|
monkeypatch.delenv("AXL_USER", raising=False)
|
|
monkeypatch.delenv("AXL_PASS", raising=False)
|
|
client = AxlClient(cache)
|
|
|
|
with pytest.raises(RuntimeError, match="AXL_URL"):
|
|
client._ensure_connected()
|
|
# Second call: same config error, pinned
|
|
with pytest.raises(RuntimeError, match="AXL_URL"):
|
|
client._ensure_connected()
|
|
|
|
|
|
def test_operational_error_is_not_pinned(cache: AxlCache, monkeypatch):
|
|
"""A transient operational error (zeep Client construction failing,
|
|
network blip, etc.) should NOT pin the client forever. The next call
|
|
must be allowed to retry."""
|
|
monkeypatch.setenv("AXL_URL", "https://test.invalid:8443/axl")
|
|
monkeypatch.setenv("AXL_USER", "test")
|
|
monkeypatch.setenv("AXL_PASS", "test")
|
|
monkeypatch.setenv("AXL_VERIFY_TLS", "false")
|
|
|
|
# Force the zeep Client constructor inside _ensure_connected to raise.
|
|
# This simulates "WSDL fetch failed", "TLS handshake error", etc. —
|
|
# transient operational failures.
|
|
from mcp_cucm_axl import client as client_mod
|
|
|
|
def boom(*args, **kwargs):
|
|
raise ConnectionError("simulated transient network failure")
|
|
|
|
monkeypatch.setattr(client_mod, "Client", boom)
|
|
|
|
client = AxlClient(cache)
|
|
with pytest.raises(RuntimeError, match="simulated transient"):
|
|
client._ensure_connected()
|
|
|
|
# Hamilton review MAJOR #5: operational errors must NOT set _config_error.
|
|
# _config_error is the permanent pin; only set on missing env vars / config
|
|
# mistakes. A failed network connection is operational and the next call
|
|
# must be allowed to retry.
|
|
assert client._config_error is None, (
|
|
"operational errors must not set _config_error (the pin); "
|
|
"only configuration errors (missing env vars, bad WSDL) should pin"
|
|
)
|
|
# _last_error is set for diagnostics, but it does not block retries.
|
|
assert client._last_error is not None, (
|
|
"_last_error should record the operational failure for diagnostics"
|
|
)
|
|
assert "simulated transient" in client._last_error
|
|
|
|
|
|
def test_health_diagnostic_includes_connection_state(cache: AxlCache):
|
|
"""The client should expose its connection age / last-attempt info
|
|
so an operator can see what's going on without reading sys.stderr."""
|
|
client = AxlClient(cache)
|
|
info = client.connection_status()
|
|
assert "connected" in info
|
|
assert info["connected"] is False # never tried yet
|
|
assert "last_error" in info
|