# Proxy Pool Governor
An **online governor** that continuously steers request traffic toward the
best-performing proxy providers. For every service it watches the live health of
each proxy pool and adjusts that pool's routing **weight** — the share of traffic
it receives — pulling weight away from pools that start to degrade and handing it
back as they recover. The goal is to keep overall success rates and latency good
without a human having to react to every incident.
The decision logic is a rule-based expert system written in **CLIPS** (`gov.clp`).
A small C harness (`gov-sim`) feeds it observations one cycle at a time and applies
the weight changes it emits. Because the policy is just rules, it can be read,
audited, and tuned without touching the harness.
## How it works
Routing is expressed as integer weights per (pool, service): the more weight a pool
has for a service, the larger its share of that service's traffic. Each pool has a
**base weight** (the operator's default preference for that pool) and an
**effective weight** (what the governor is currently using). The governor only ever
moves the effective weight within `min_weight ≤ effective_weight ≤ base_weight`,
so it can throttle a bad pool down to a floor and restore it up to — but never
above — its baseline.
The governor runs as a loop. Each **cycle** corresponds to one observation
timestamp:
1. Fresh per-pool/per-service observations are injected as facts: success rate,
response time, timeout rate, SSL-error rate, plus the rolling average and
standard deviation of success rate and response time.
2. CLIPS fires its rules to detect degradation, roll degradation up to a
service-wide view, reduce weight on degrading pools, and restore weight on
recovered ones.
3. The harness reads the `weight-adjustment` and `alert` facts the rules produced,
applies the new effective weights (carrying them into the next cycle), and
surfaces alerts.
CLIPS owns the operational state across the cycle (degradation status, how long a
pool has been healthy); the harness owns the effective-weight matrix and carries it
forward. Statistics (moving average / stddev) are computed outside the rules and
supplied with each observation — the rules consume them, they don't maintain a
history window themselves.
### Detecting a degrading provider
A pool is flagged as degraded for a service when any of these fire:
| Signal | Trigger |
|--------|---------|
| Response time | `response_time` exceeds `avg + sigma_threshold × stddev`, **or** exceeds `max_response_time` |
| Success rate | `rate_success` falls below `avg − sigma_threshold × stddev`, **or** below `min_success_rate` |
| SSL errors | SSL error rate exceeds 5% |
| Timeouts | Timeout rate exceeds 10% |
Each detection carries a **severity** derived from how far the metric has moved
(in standard deviations for the statistical checks). Severity drives how hard the
governor reacts.
### Shifting traffic away
When a pool is degraded *and the service as a whole is still healthy*, the governor
reduces that pool's effective weight:
- First reduction multiplies the weight by `weight_reduction` (e.g. 0.5 halves it),
clamped to `[min_weight, base_weight]`.
- An already-reduced pool whose severity exceeds 3σ is cut again (halved).
Reducing one pool's weight naturally shifts its traffic to its healthier peers —
that is the core "move traffic to the best providers" behavior.
**Service-wide safety:** if too many of a service's pools are degraded at once
(degraded fraction ≥ `service_degrade_threshold`), the service is marked degraded,
an alert is raised, and weight reductions are **suspended**. When everything is
already bad there is no "good" pool to shift toward, so the governor stops cutting
rather than gutting the whole service.
### Restoring traffic
A pool that is no longer degraded is timestamped as healthy. Once it has stayed
healthy for longer than `restore_cooldown`, the governor steps its weight back up by
`base_weight / 4` per cycle (at least 1), never exceeding `base_weight`. Restoration
is deliberately gradual so a flapping pool doesn't immediately reclaim full traffic.
## Tunable parameters
Tuning is per service, via `service-config` facts (seeded from `main.c`). The
parameters the rules actually read:
| Parameter | Role |
|-----------|------|
| `sigma_threshold` | How many standard deviations count as anomalous |
| `min_success_rate` | Hard floor below which a pool is degraded regardless of its own baseline |
| `max_response_time` | Hard ceiling above which a pool is degraded |
| `weight_reduction` | Multiplier applied on first reduction |
| `min_weight` | Floor the governor will not throttle below (e.g. a contractual minimum) |
| `restore_cooldown` | How long a pool must stay healthy before restoration begins |
| `service_degrade_threshold` | Fraction of degraded pools that marks the whole service degraded |
The SSL (5%) and timeout (10%) detection thresholds are currently constants in the
rules rather than per-service config.
## Building and running
This targets **OpenBSD**: the build uses BSD `make` and the binary calls
`pledge(2)`/`unveil(2)`.
```sh
# CLIPS_DIR must point at a built CLIPS core (libclips.a + clips.h).
make CLIPS_DIR=/path/to/clips/core
./gov-sim # synthetic simulation, 60 cycles (default)
./gov-sim -s 200 # synthetic simulation, N cycles
./gov-sim -r scenarios/foo.pps # replay recorded observations from a scenario file
```
- **Simulation mode** generates synthetic pool health (periodic incidents and
recoveries) to exercise the rules end to end.
- **Replay mode** drives the governor from a recorded `.pps` scenario file, so the
same observation stream can be replayed against different parameter settings.
See `scenario.h` for the file format and `scenario.R` for tooling to read, write,
generate, and analyze scenarios.
Each cycle prints the weight adjustments and alerts it produced, and the weight
matrix is printed periodically and at the end.
## Operational notes
- **Pool semantics:** response times degrade non-linearly as a pool saturates, and
different providers degrade differently — some gracefully, some sharply. The
statistical (sigma) checks adapt to each pool's own baseline; the hard floors
(`min_success_rate`, `max_response_time`) catch absolute-bad behavior.
- **Restart:** effective weights are the durable state; the in-memory facts
(healthy-since timers, degradation status) are rebuilt from the next few cycles of
observations. Behavior after restart is conservative — a still-degraded pool is
re-detected quickly, while restoration timers simply start over.
## Design intent (not yet implemented)
The following are part of the governor's intended direction but are **not** in the
current rules. They are listed so the gap between intent and implementation is
explicit:
- **Capacity awareness:** respect each pool's soft/hard request limits — reduce
weight as a pool nears its limit even if quality is fine, block restoration into a
near-full pool, and allow boosting a healthy pool *above* its base weight (up to a
ceiling, e.g. 1.5×) to absorb load shed from constrained peers.
- **Proactive, time-of-day shifts:** anticipate the daily traffic cycle
(trough → ramp-up → peak → ramp-down) and pre-shift toward load-efficient pools
before peak rather than only reacting after degradation.
- **Probation / flap handling:** a `healthy → degraded → probation → healthy` state
machine with flap detection and extended probation for pools that oscillate.
- **Trend detection:** act on multi-cycle trends, not just single-cycle snapshots.
- **Persistence and scale:** sourcing observations from a production metrics store
on a fixed poll interval and persisting effective weights across restarts.
When this document and `gov.clp` disagree, `gov.clp` is what actually runs.
## License
Copyright (C) 2026 SWGY, Inc.
This program is free software: you can redistribute it and/or modify it under the
terms of the GNU Affero General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later version.
See [`LICENSE`](LICENSE) for the full text.