added netbird-watcher script
All checks were successful
Terraform / terraform (push) Successful in 7s
All checks were successful
Terraform / terraform (push) Successful in 7s
This commit is contained in:
222
PAIN_POINTS.md
222
PAIN_POINTS.md
@@ -1,204 +1,128 @@
|
||||
# NetBird GitOps - Remaining Pain Points
|
||||
# NetBird GitOps - Pain Points Status
|
||||
|
||||
This document captures challenges discovered during the POC that need resolution before production use.
|
||||
## Summary
|
||||
|
||||
## Context
|
||||
|
||||
**Use case:** ~100+ operators, each with 2 devices (BlastPilot + BlastGS-Agent)
|
||||
**Workflow:** Ticket-based onboarding, engineer creates PR, merge triggers setup key creation
|
||||
**Current pain:** Manual setup key creation and peer renaming in dashboard
|
||||
| # | Pain Point | Status |
|
||||
|---|------------|--------|
|
||||
| 1 | Peer naming after enrollment | **SOLVED** - Watcher service |
|
||||
| 2 | Per-user vs per-role setup keys | **SOLVED** - One-off keys per user |
|
||||
| 3 | Secure key distribution | Documented workflow |
|
||||
|
||||
---
|
||||
|
||||
## Pain Point 1: Peer Naming After Enrollment
|
||||
## Pain Point 1: Peer Naming After Enrollment - SOLVED
|
||||
|
||||
### Problem
|
||||
|
||||
When a peer enrolls using a setup key, it appears in the NetBird dashboard with its hostname (e.g., `DESKTOP-ABC123` or `raspberrypi`). These hostnames are:
|
||||
- Often generic and meaningless
|
||||
- Not controllable via IaC (peer generates its own keypair locally)
|
||||
- Confusing when managing 100+ devices
|
||||
When a peer enrolls using a setup key, it appears with its hostname (e.g., `DESKTOP-ABC123`), not a meaningful name.
|
||||
|
||||
**Desired state:** Peer appears as `pilot-ivanov` or `gs-unit-042` immediately after enrollment.
|
||||
### Solution
|
||||
|
||||
### Root Cause
|
||||
**Watcher service** automatically renames peers:
|
||||
|
||||
NetBird's architecture requires peers to self-enroll:
|
||||
1. Setup key defines which groups the peer joins
|
||||
2. Peer runs `netbird up --setup-key <key>`
|
||||
3. Peer generates WireGuard keypair locally
|
||||
4. Peer registers with management server using its local hostname
|
||||
5. **No API link between "which setup key was used" and "which peer enrolled"**
|
||||
1. Setup key name = desired peer name (e.g., `pilot-ivanov`)
|
||||
2. Operator enrolls -> peer appears as `DESKTOP-ABC123`
|
||||
3. Watcher detects consumed key via API polling (every 30s)
|
||||
4. Watcher finds peer created around key usage time
|
||||
5. Watcher renames peer to match key name -> `pilot-ivanov`
|
||||
|
||||
### Options
|
||||
**Implementation:** `watcher/netbird_watcher.py`
|
||||
|
||||
| Option | Description | Effort | Tradeoffs |
|
||||
|--------|-------------|--------|-----------|
|
||||
| **A. Manual rename** | Engineer renames peer in dashboard after enrollment | Zero | 30 seconds per device, human in loop |
|
||||
| **B. Polling service** | Service watches for new peers, matches by timing/IP, renames | Medium | More infrastructure, heuristic matching |
|
||||
| **C. Per-user tracking groups** | Unique group per user, find peer by group membership | High | Group sprawl, cleanup needed |
|
||||
| **D. Installer modification** | Modify BlastPilot/BlastGS-Agent to set hostname before enrollment | N/A | Code freeze constraint |
|
||||
**Deployment:**
|
||||
```bash
|
||||
cd ansible/netbird-watcher
|
||||
ansible-playbook -i poc-inventory.yml playbook.yml -e vault_netbird_token=<TOKEN>
|
||||
```
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Option A** is acceptable for ~100 operators with ticket-based workflow:
|
||||
- Ticket arrives -> engineer creates PR -> PR merges -> engineer sends setup key -> operator enrolls -> **engineer renames peer (30 sec)**
|
||||
- Total engineer time per onboarding: ~5 minutes
|
||||
- No additional infrastructure
|
||||
|
||||
**Option B** worth considering if:
|
||||
- Onboarding volume increases significantly
|
||||
- Full automation is required (no human in loop)
|
||||
**How correlation works:**
|
||||
- Watcher polls `GET /api/setup-keys` for keys with `used_times > 0`
|
||||
- Gets `last_used` timestamp from the key
|
||||
- Polls `GET /api/peers` for peers created within 60 seconds of that timestamp
|
||||
- Renames matching peer via `PUT /api/peers/{id}`
|
||||
- Marks key as processed to avoid re-processing
|
||||
|
||||
---
|
||||
|
||||
## Pain Point 2: Per-User vs Per-Role Setup Keys
|
||||
## Pain Point 2: Per-User vs Per-Role Setup Keys - SOLVED
|
||||
|
||||
### Current State
|
||||
### Problem
|
||||
|
||||
Setup keys are defined per-role in `terraform/setup_keys.tf`:
|
||||
```hcl
|
||||
resource "netbird_setup_key" "gs_onboarding" {
|
||||
name = "ground-station-onboarding"
|
||||
type = "reusable"
|
||||
auto_groups = [netbird_group.ground_stations.id]
|
||||
...
|
||||
}
|
||||
```
|
||||
Reusable per-role keys (e.g., `pilot-onboarding`) don't provide:
|
||||
- Audit trail (who enrolled which device?)
|
||||
- Individual revocation
|
||||
- Usage attribution
|
||||
|
||||
This means:
|
||||
- One reusable key per role
|
||||
- Key is shared across all operators of that role
|
||||
- No way to track "this key was issued to Ivanov"
|
||||
### Solution
|
||||
|
||||
### Problems
|
||||
|
||||
1. **No audit trail** - Can't answer "who enrolled device X?"
|
||||
2. **Revocation is all-or-nothing** - Revoking `pilot-onboarding` affects everyone
|
||||
3. **No usage attribution** - Can't enforce "one device per operator"
|
||||
|
||||
### Options
|
||||
|
||||
| Option | Description | Effort | Tradeoffs |
|
||||
|--------|-------------|--------|-----------|
|
||||
| **A. Accept per-role keys** | Current state, manual tracking in ticket system | Zero | No IaC-level audit trail |
|
||||
| **B. Per-user setup keys** | Create key per onboarding request | Low | More keys to manage, cleanup needed |
|
||||
| **C. One-off keys** | Each key has `usage_limit = 1` | Low | Key destroyed after use, good for audit |
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Option C (one-off keys)** provides the best tradeoff:
|
||||
- Create unique key per onboarding ticket
|
||||
- Key auto-expires after first use
|
||||
- Clear audit trail: key name links to ticket number
|
||||
- Easy to implement:
|
||||
**One-off keys per user/device:**
|
||||
|
||||
```hcl
|
||||
# Example: ticket-based one-off key
|
||||
resource "netbird_setup_key" "ticket_1234_pilot" {
|
||||
name = "ticket-1234-pilot-ivanov"
|
||||
type = "one-off"
|
||||
resource "netbird_setup_key" "pilot_ivanov" {
|
||||
name = "pilot-ivanov"
|
||||
type = "one-off" # Single use
|
||||
auto_groups = [netbird_group.pilots.id]
|
||||
usage_limit = 1
|
||||
ephemeral = false
|
||||
}
|
||||
```
|
||||
|
||||
**Workflow:**
|
||||
1. Ticket ACHILLES-1234: "Onboard pilot Ivanov"
|
||||
2. Engineer adds setup key `ticket-1234-pilot-ivanov` to Terraform
|
||||
3. PR merged, key created
|
||||
4. Engineer sends key to operator (see Pain Point 3)
|
||||
5. Operator uses key, it's consumed
|
||||
6. After enrollment, engineer renames peer to `pilot-ivanov`
|
||||
**Benefits:**
|
||||
- Key name = audit trail (linked to ticket/user)
|
||||
- Key is consumed after single use
|
||||
- Individual keys can be revoked before use
|
||||
- Watcher uses key name as peer name automatically
|
||||
|
||||
---
|
||||
|
||||
## Pain Point 3: Secure Key Distribution
|
||||
|
||||
### Problem
|
||||
### Current Workflow
|
||||
|
||||
After CI/CD creates a setup key, how does it reach the operator?
|
||||
1. CI/CD creates setup key
|
||||
2. Engineer retrieves key locally: `terraform output -raw pilot_ivanov_key`
|
||||
3. Engineer sends key to operator via secure channel (Signal, encrypted email)
|
||||
4. Operator uses key within expiry window
|
||||
|
||||
Setup keys are sensitive:
|
||||
- Anyone with the key can enroll a device into the network
|
||||
- Keys may be reusable (depends on configuration)
|
||||
- Keys should be transmitted securely
|
||||
### Considerations
|
||||
|
||||
### Current State
|
||||
- Keys are sensitive - anyone with key can enroll a device
|
||||
- One-off keys mitigate risk - single use, can't be reused if leaked
|
||||
- Short expiry (7 days) limits exposure window
|
||||
|
||||
Setup keys are output by Terraform:
|
||||
```bash
|
||||
terraform output -raw gs_setup_key
|
||||
```
|
||||
### Future Improvements (If Needed)
|
||||
|
||||
But:
|
||||
- Requires local Terraform access
|
||||
- No automated distribution mechanism
|
||||
- Keys in state file (committed to git in POC - not ideal)
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| Ticket integration | CI posts key directly to ticket system |
|
||||
| Secrets manager | Store in Vault/1Password, notify engineer |
|
||||
| Self-service portal | Operator requests key, gets it directly |
|
||||
|
||||
### Options
|
||||
|
||||
| Option | Description | Effort | Tradeoffs |
|
||||
|--------|-------------|--------|-----------|
|
||||
| **A. Manual retrieval** | Engineer runs `terraform output` locally | Zero | Requires CLI access, manual process |
|
||||
| **B. CI output to ticket** | CI posts key to ticket system via API | Medium | Keys in ticket history (audit trail) |
|
||||
| **C. Secrets manager** | Store keys in Vault/1Password, notify engineer | Medium | Another system to integrate |
|
||||
| **D. Encrypted email** | CI encrypts key, emails to operator | High | Key management complexity |
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Option A** for now (consistent with manual rename):
|
||||
- Engineer retrieves key after CI completes
|
||||
- Engineer sends key to operator via secure channel (Signal, encrypted email)
|
||||
- Ticket updated with "key sent" status
|
||||
|
||||
**Option B** worth implementing if:
|
||||
- Volume increases
|
||||
- Want full automation
|
||||
- Ticket system has secure "hidden fields" feature
|
||||
For ~100 operators with ticket-based workflow, manual retrieval is acceptable.
|
||||
|
||||
---
|
||||
|
||||
## Summary: Recommended Workflow
|
||||
|
||||
Given the constraints (code freeze, ~100 operators, ticket-based), the pragmatic workflow is:
|
||||
## Final Workflow
|
||||
|
||||
```
|
||||
1. Ticket created: "Onboard pilot Ivanov with BlastPilot + GS"
|
||||
1. Ticket: "Onboard pilot Ivanov with BlastPilot"
|
||||
|
||||
2. Engineer adds to Terraform:
|
||||
- ticket-1234-pilot (one-off, 7 days)
|
||||
- ticket-1234-gs (one-off, 7 days)
|
||||
2. Engineer adds to terraform/setup_keys.tf:
|
||||
- netbird_setup_key.pilot_ivanov (one-off, 7 days)
|
||||
|
||||
3. Engineer creates PR, gets review, merges
|
||||
3. Engineer creates PR -> CI shows plan
|
||||
|
||||
4. CI/CD applies changes, keys created
|
||||
4. PR merged -> CI applies -> key created
|
||||
|
||||
5. Engineer retrieves keys:
|
||||
terraform output -raw ticket_1234_pilot_key
|
||||
5. Engineer retrieves: terraform output -raw pilot_ivanov_key
|
||||
|
||||
6. Engineer sends keys to operator via secure channel
|
||||
6. Engineer sends key to operator via Signal/email
|
||||
|
||||
7. Operator enrolls both devices
|
||||
7. Operator installs NetBird, enrolls with key
|
||||
|
||||
8. Engineer renames peers in dashboard:
|
||||
DESKTOP-ABC123 -> pilot-ivanov
|
||||
raspberrypi -> gs-ivanov
|
||||
8. Watcher auto-renames peer to "pilot-ivanov"
|
||||
|
||||
9. Engineer closes ticket
|
||||
9. Ticket closed
|
||||
```
|
||||
|
||||
**Total engineer time:** ~10 minutes per onboarding (pair of devices)
|
||||
**Automation level:** Groups, policies, key creation automated; naming and distribution manual
|
||||
|
||||
---
|
||||
|
||||
## Future Improvements (If Needed)
|
||||
|
||||
1. **Webhook listener** for peer enrollment events -> auto-rename based on timing correlation
|
||||
2. **Ticket system integration** for automated key distribution
|
||||
3. **Custom installer** that prompts for device name before enrollment
|
||||
4. **Batch onboarding tool** for multiple operators at once
|
||||
|
||||
These can be addressed incrementally as the operation scales.
|
||||
**Engineer time:** ~2 minutes (Terraform edit + key retrieval + send)
|
||||
**Automation:** Full - groups, policies, keys, peer naming all automated
|
||||
|
||||
Reference in New Issue
Block a user