added netbird-watcher script
All checks were successful
Terraform / terraform (push) Successful in 7s

This commit is contained in:
Prox
2026-02-15 19:11:39 +02:00
parent ec0d96f6a0
commit ca546ff6d8
10 changed files with 803 additions and 275 deletions

View File

@@ -1,204 +1,128 @@
# NetBird GitOps - Remaining Pain Points
# NetBird GitOps - Pain Points Status
This document captures challenges discovered during the POC that need resolution before production use.
## Summary
## Context
**Use case:** ~100+ operators, each with 2 devices (BlastPilot + BlastGS-Agent)
**Workflow:** Ticket-based onboarding, engineer creates PR, merge triggers setup key creation
**Current pain:** Manual setup key creation and peer renaming in dashboard
| # | Pain Point | Status |
|---|------------|--------|
| 1 | Peer naming after enrollment | **SOLVED** - Watcher service |
| 2 | Per-user vs per-role setup keys | **SOLVED** - One-off keys per user |
| 3 | Secure key distribution | Documented workflow |
---
## Pain Point 1: Peer Naming After Enrollment
## Pain Point 1: Peer Naming After Enrollment - SOLVED
### Problem
When a peer enrolls using a setup key, it appears in the NetBird dashboard with its hostname (e.g., `DESKTOP-ABC123` or `raspberrypi`). These hostnames are:
- Often generic and meaningless
- Not controllable via IaC (peer generates its own keypair locally)
- Confusing when managing 100+ devices
When a peer enrolls using a setup key, it appears with its hostname (e.g., `DESKTOP-ABC123`), not a meaningful name.
**Desired state:** Peer appears as `pilot-ivanov` or `gs-unit-042` immediately after enrollment.
### Solution
### Root Cause
**Watcher service** automatically renames peers:
NetBird's architecture requires peers to self-enroll:
1. Setup key defines which groups the peer joins
2. Peer runs `netbird up --setup-key <key>`
3. Peer generates WireGuard keypair locally
4. Peer registers with management server using its local hostname
5. **No API link between "which setup key was used" and "which peer enrolled"**
1. Setup key name = desired peer name (e.g., `pilot-ivanov`)
2. Operator enrolls -> peer appears as `DESKTOP-ABC123`
3. Watcher detects consumed key via API polling (every 30s)
4. Watcher finds peer created around key usage time
5. Watcher renames peer to match key name -> `pilot-ivanov`
### Options
**Implementation:** `watcher/netbird_watcher.py`
| Option | Description | Effort | Tradeoffs |
|--------|-------------|--------|-----------|
| **A. Manual rename** | Engineer renames peer in dashboard after enrollment | Zero | 30 seconds per device, human in loop |
| **B. Polling service** | Service watches for new peers, matches by timing/IP, renames | Medium | More infrastructure, heuristic matching |
| **C. Per-user tracking groups** | Unique group per user, find peer by group membership | High | Group sprawl, cleanup needed |
| **D. Installer modification** | Modify BlastPilot/BlastGS-Agent to set hostname before enrollment | N/A | Code freeze constraint |
**Deployment:**
```bash
cd ansible/netbird-watcher
ansible-playbook -i poc-inventory.yml playbook.yml -e vault_netbird_token=<TOKEN>
```
### Recommendation
**Option A** is acceptable for ~100 operators with ticket-based workflow:
- Ticket arrives -> engineer creates PR -> PR merges -> engineer sends setup key -> operator enrolls -> **engineer renames peer (30 sec)**
- Total engineer time per onboarding: ~5 minutes
- No additional infrastructure
**Option B** worth considering if:
- Onboarding volume increases significantly
- Full automation is required (no human in loop)
**How correlation works:**
- Watcher polls `GET /api/setup-keys` for keys with `used_times > 0`
- Gets `last_used` timestamp from the key
- Polls `GET /api/peers` for peers created within 60 seconds of that timestamp
- Renames matching peer via `PUT /api/peers/{id}`
- Marks key as processed to avoid re-processing
---
## Pain Point 2: Per-User vs Per-Role Setup Keys
## Pain Point 2: Per-User vs Per-Role Setup Keys - SOLVED
### Current State
### Problem
Setup keys are defined per-role in `terraform/setup_keys.tf`:
```hcl
resource "netbird_setup_key" "gs_onboarding" {
name = "ground-station-onboarding"
type = "reusable"
auto_groups = [netbird_group.ground_stations.id]
...
}
```
Reusable per-role keys (e.g., `pilot-onboarding`) don't provide:
- Audit trail (who enrolled which device?)
- Individual revocation
- Usage attribution
This means:
- One reusable key per role
- Key is shared across all operators of that role
- No way to track "this key was issued to Ivanov"
### Solution
### Problems
1. **No audit trail** - Can't answer "who enrolled device X?"
2. **Revocation is all-or-nothing** - Revoking `pilot-onboarding` affects everyone
3. **No usage attribution** - Can't enforce "one device per operator"
### Options
| Option | Description | Effort | Tradeoffs |
|--------|-------------|--------|-----------|
| **A. Accept per-role keys** | Current state, manual tracking in ticket system | Zero | No IaC-level audit trail |
| **B. Per-user setup keys** | Create key per onboarding request | Low | More keys to manage, cleanup needed |
| **C. One-off keys** | Each key has `usage_limit = 1` | Low | Key destroyed after use, good for audit |
### Recommendation
**Option C (one-off keys)** provides the best tradeoff:
- Create unique key per onboarding ticket
- Key auto-expires after first use
- Clear audit trail: key name links to ticket number
- Easy to implement:
**One-off keys per user/device:**
```hcl
# Example: ticket-based one-off key
resource "netbird_setup_key" "ticket_1234_pilot" {
name = "ticket-1234-pilot-ivanov"
type = "one-off"
resource "netbird_setup_key" "pilot_ivanov" {
name = "pilot-ivanov"
type = "one-off" # Single use
auto_groups = [netbird_group.pilots.id]
usage_limit = 1
ephemeral = false
}
```
**Workflow:**
1. Ticket ACHILLES-1234: "Onboard pilot Ivanov"
2. Engineer adds setup key `ticket-1234-pilot-ivanov` to Terraform
3. PR merged, key created
4. Engineer sends key to operator (see Pain Point 3)
5. Operator uses key, it's consumed
6. After enrollment, engineer renames peer to `pilot-ivanov`
**Benefits:**
- Key name = audit trail (linked to ticket/user)
- Key is consumed after single use
- Individual keys can be revoked before use
- Watcher uses key name as peer name automatically
---
## Pain Point 3: Secure Key Distribution
### Problem
### Current Workflow
After CI/CD creates a setup key, how does it reach the operator?
1. CI/CD creates setup key
2. Engineer retrieves key locally: `terraform output -raw pilot_ivanov_key`
3. Engineer sends key to operator via secure channel (Signal, encrypted email)
4. Operator uses key within expiry window
Setup keys are sensitive:
- Anyone with the key can enroll a device into the network
- Keys may be reusable (depends on configuration)
- Keys should be transmitted securely
### Considerations
### Current State
- Keys are sensitive - anyone with key can enroll a device
- One-off keys mitigate risk - single use, can't be reused if leaked
- Short expiry (7 days) limits exposure window
Setup keys are output by Terraform:
```bash
terraform output -raw gs_setup_key
```
### Future Improvements (If Needed)
But:
- Requires local Terraform access
- No automated distribution mechanism
- Keys in state file (committed to git in POC - not ideal)
| Option | Description |
|--------|-------------|
| Ticket integration | CI posts key directly to ticket system |
| Secrets manager | Store in Vault/1Password, notify engineer |
| Self-service portal | Operator requests key, gets it directly |
### Options
| Option | Description | Effort | Tradeoffs |
|--------|-------------|--------|-----------|
| **A. Manual retrieval** | Engineer runs `terraform output` locally | Zero | Requires CLI access, manual process |
| **B. CI output to ticket** | CI posts key to ticket system via API | Medium | Keys in ticket history (audit trail) |
| **C. Secrets manager** | Store keys in Vault/1Password, notify engineer | Medium | Another system to integrate |
| **D. Encrypted email** | CI encrypts key, emails to operator | High | Key management complexity |
### Recommendation
**Option A** for now (consistent with manual rename):
- Engineer retrieves key after CI completes
- Engineer sends key to operator via secure channel (Signal, encrypted email)
- Ticket updated with "key sent" status
**Option B** worth implementing if:
- Volume increases
- Want full automation
- Ticket system has secure "hidden fields" feature
For ~100 operators with ticket-based workflow, manual retrieval is acceptable.
---
## Summary: Recommended Workflow
Given the constraints (code freeze, ~100 operators, ticket-based), the pragmatic workflow is:
## Final Workflow
```
1. Ticket created: "Onboard pilot Ivanov with BlastPilot + GS"
1. Ticket: "Onboard pilot Ivanov with BlastPilot"
2. Engineer adds to Terraform:
- ticket-1234-pilot (one-off, 7 days)
- ticket-1234-gs (one-off, 7 days)
2. Engineer adds to terraform/setup_keys.tf:
- netbird_setup_key.pilot_ivanov (one-off, 7 days)
3. Engineer creates PR, gets review, merges
3. Engineer creates PR -> CI shows plan
4. CI/CD applies changes, keys created
4. PR merged -> CI applies -> key created
5. Engineer retrieves keys:
terraform output -raw ticket_1234_pilot_key
5. Engineer retrieves: terraform output -raw pilot_ivanov_key
6. Engineer sends keys to operator via secure channel
6. Engineer sends key to operator via Signal/email
7. Operator enrolls both devices
7. Operator installs NetBird, enrolls with key
8. Engineer renames peers in dashboard:
DESKTOP-ABC123 -> pilot-ivanov
raspberrypi -> gs-ivanov
8. Watcher auto-renames peer to "pilot-ivanov"
9. Engineer closes ticket
9. Ticket closed
```
**Total engineer time:** ~10 minutes per onboarding (pair of devices)
**Automation level:** Groups, policies, key creation automated; naming and distribution manual
---
## Future Improvements (If Needed)
1. **Webhook listener** for peer enrollment events -> auto-rename based on timing correlation
2. **Ticket system integration** for automated key distribution
3. **Custom installer** that prompts for device name before enrollment
4. **Batch onboarding tool** for multiple operators at once
These can be addressed incrementally as the operation scales.
**Engineer time:** ~2 minutes (Terraform edit + key retrieval + send)
**Automation:** Full - groups, policies, keys, peer naming all automated