Process Monitoring¶
The agent monitors configured processes every 5 seconds, detecting crashes, stalls, and exits. When a process goes down, the agent automatically restarts it (if autolaunch is enabled).
Process State Machine¶
Every configured process is in one of five states:
┌──────────┐
launch │ RUNNING │ crash/exit
┌────────▶│ │──────────┐
│ └──────────┘ │
│ │ ▼
┌──────────┐ stall detected ┌──────────┐
│ STOPPED │ │ │ KILLED │
│ │ ▼ │ │
└──────────┘ ┌──────────┐ └──────────┘
▲ │ STALLED │ │
│ │ │ │ auto-restart
│ └──────────┘ │ (if autolaunch)
│ │ │
│ kill after confirm │
└──────────────┘◀─────────────┘
State Definitions¶
| State | Description | Dashboard Indicator |
|---|---|---|
| RUNNING | Process is alive and responsive | Green |
| STALLED | Process exists but is not responding (hang detected) | Yellow |
| KILLED | Process was terminated (manually or by agent) | Red |
| STOPPED | Process is not running, autolaunch disabled | Grey |
| INACTIVE | Process is configured but its executable was not found | Grey (dimmed) |
Monitoring Loop¶
Every 10 seconds, the agent runs through all configured processes:
1. Check if Process is Running¶
The agent validates the process by:
- PID check — Is there a process with the stored PID?
- Path verification — Does the running process match the configured
exe_path? (prevents PID reuse false positives) - Status update — Set state to RUNNING or detect crash
2. Crash Detection¶
A process is considered crashed when:
- Its PID no longer exists
- The PID exists but belongs to a different executable (PID was reused by the OS)
- The process exit code indicates abnormal termination
3. Hang Detection (Multi-Stage)¶
The agent uses a progressive approach to detect frozen applications:
| Stage | Time | Action |
|---|---|---|
| Detection | 0-10s | owlette_scout.py sends WM_NULL to the process window |
| Wait | 10-15s | If no response, wait for possible recovery |
| Confirmation | 15s+ | If still unresponsive, mark as STALLED |
WM_NULL is a harmless Windows message — if the process responds, it's alive. If it doesn't respond within the timeout, the process is likely hung.
4. Auto-Restart¶
When a crash is detected and autolaunch is enabled:
- Agent increments the relaunch counter
- If under the limit (
relaunch_attempts), restart the process - Wait
launch_delayseconds before starting - Wait
init_timeseconds before monitoring responsiveness - If at the limit, show a reboot prompt to the user
Process Launch Methods¶
The agent uses a two-stage launch strategy:
Primary: Task Scheduler¶
Agent creates one-time scheduled task
→ Task runs under logged-in user account
→ Agent finds the new PID
→ Task is deleted (cleanup)
Advantages: Processes survive service restarts (not killed by NSSM job objects).
Fallback: CreateProcessAsUser¶
If Task Scheduler fails, the agent falls back to CreateProcessAsUser via pywin32:
Agent gets user token (WTSQueryUserToken)
→ CreateProcessAsUser with the token
→ Process runs under user session
PID Recovery¶
When the service restarts, it doesn't re-launch processes that are already running. Instead, it recovers existing PIDs:
- For each configured process, scan running processes for matching
exe_path - If found, adopt the PID — mark as RUNNING without relaunching
- If not found and autolaunch is enabled, start the process
This prevents duplicate instances after service restarts or crashes.
Relaunch Limits¶
Each process has a configurable relaunch_attempts limit (default: 5). When the limit is reached:
- The agent stops trying to restart the process
- A reboot countdown prompt appears on screen (
prompt_restart.py) - The user can dismiss the prompt or allow the reboot
- The relaunch counter resets after a successful process start or manual intervention
Crash alerts
When a process crashes, the agent reports the event to the web dashboard via the alert API. If email alerts are configured for the site, the dashboard sends a process crash alert email including the process name, machine name, and error details. Webhooks are also triggered if configured.
Metrics Collection¶
Every 60 seconds, the agent collects and reports:
| Metric | Source | Description |
|---|---|---|
| CPU | psutil.cpu_percent() |
Overall CPU usage percentage |
| Memory | psutil.virtual_memory() |
RAM usage percentage |
| Disk | psutil.disk_usage('/') |
Primary disk usage percentage |
| GPU | WinTmp / nvidia-ml-py | GPU usage percentage (if available) |
| CPU Model | Registry/psutil | CPU model name (e.g., "Intel Core i9-9900X") |
| Processes | Per-process | Status, PID, uptime for each configured process |
GPU monitoring uses a fallback chain:
- NVIDIA GPUs: GPUtil or pynvml (NVML) for load and temperature
- Other GPUs: WinTmp/LibreHardwareMonitor for basic metrics
- No GPU: Gracefully returns 0