Files
netbird-iac/PAIN_POINTS.md
2026-02-15 18:37:15 +02:00

205 lines
6.8 KiB
Markdown

# NetBird GitOps - Remaining Pain Points
This document captures challenges discovered during the POC that need resolution before production use.
## Context
**Use case:** ~100+ operators, each with 2 devices (BlastPilot + BlastGS-Agent)
**Workflow:** Ticket-based onboarding, engineer creates PR, merge triggers setup key creation
**Current pain:** Manual setup key creation and peer renaming in dashboard
---
## Pain Point 1: Peer Naming After Enrollment
### Problem
When a peer enrolls using a setup key, it appears in the NetBird dashboard with its hostname (e.g., `DESKTOP-ABC123` or `raspberrypi`). These hostnames are:
- Often generic and meaningless
- Not controllable via IaC (peer generates its own keypair locally)
- Confusing when managing 100+ devices
**Desired state:** Peer appears as `pilot-ivanov` or `gs-unit-042` immediately after enrollment.
### Root Cause
NetBird's architecture requires peers to self-enroll:
1. Setup key defines which groups the peer joins
2. Peer runs `netbird up --setup-key <key>`
3. Peer generates WireGuard keypair locally
4. Peer registers with management server using its local hostname
5. **No API link between "which setup key was used" and "which peer enrolled"**
### Options
| Option | Description | Effort | Tradeoffs |
|--------|-------------|--------|-----------|
| **A. Manual rename** | Engineer renames peer in dashboard after enrollment | Zero | 30 seconds per device, human in loop |
| **B. Polling service** | Service watches for new peers, matches by timing/IP, renames | Medium | More infrastructure, heuristic matching |
| **C. Per-user tracking groups** | Unique group per user, find peer by group membership | High | Group sprawl, cleanup needed |
| **D. Installer modification** | Modify BlastPilot/BlastGS-Agent to set hostname before enrollment | N/A | Code freeze constraint |
### Recommendation
**Option A** is acceptable for ~100 operators with ticket-based workflow:
- Ticket arrives -> engineer creates PR -> PR merges -> engineer sends setup key -> operator enrolls -> **engineer renames peer (30 sec)**
- Total engineer time per onboarding: ~5 minutes
- No additional infrastructure
**Option B** worth considering if:
- Onboarding volume increases significantly
- Full automation is required (no human in loop)
---
## Pain Point 2: Per-User vs Per-Role Setup Keys
### Current State
Setup keys are defined per-role in `terraform/setup_keys.tf`:
```hcl
resource "netbird_setup_key" "gs_onboarding" {
name = "ground-station-onboarding"
type = "reusable"
auto_groups = [netbird_group.ground_stations.id]
...
}
```
This means:
- One reusable key per role
- Key is shared across all operators of that role
- No way to track "this key was issued to Ivanov"
### Problems
1. **No audit trail** - Can't answer "who enrolled device X?"
2. **Revocation is all-or-nothing** - Revoking `pilot-onboarding` affects everyone
3. **No usage attribution** - Can't enforce "one device per operator"
### Options
| Option | Description | Effort | Tradeoffs |
|--------|-------------|--------|-----------|
| **A. Accept per-role keys** | Current state, manual tracking in ticket system | Zero | No IaC-level audit trail |
| **B. Per-user setup keys** | Create key per onboarding request | Low | More keys to manage, cleanup needed |
| **C. One-off keys** | Each key has `usage_limit = 1` | Low | Key destroyed after use, good for audit |
### Recommendation
**Option C (one-off keys)** provides the best tradeoff:
- Create unique key per onboarding ticket
- Key auto-expires after first use
- Clear audit trail: key name links to ticket number
- Easy to implement:
```hcl
# Example: ticket-based one-off key
resource "netbird_setup_key" "ticket_1234_pilot" {
name = "ticket-1234-pilot-ivanov"
type = "one-off"
auto_groups = [netbird_group.pilots.id]
usage_limit = 1
ephemeral = false
}
```
**Workflow:**
1. Ticket ACHILLES-1234: "Onboard pilot Ivanov"
2. Engineer adds setup key `ticket-1234-pilot-ivanov` to Terraform
3. PR merged, key created
4. Engineer sends key to operator (see Pain Point 3)
5. Operator uses key, it's consumed
6. After enrollment, engineer renames peer to `pilot-ivanov`
---
## Pain Point 3: Secure Key Distribution
### Problem
After CI/CD creates a setup key, how does it reach the operator?
Setup keys are sensitive:
- Anyone with the key can enroll a device into the network
- Keys may be reusable (depends on configuration)
- Keys should be transmitted securely
### Current State
Setup keys are output by Terraform:
```bash
terraform output -raw gs_setup_key
```
But:
- Requires local Terraform access
- No automated distribution mechanism
- Keys in state file (committed to git in POC - not ideal)
### Options
| Option | Description | Effort | Tradeoffs |
|--------|-------------|--------|-----------|
| **A. Manual retrieval** | Engineer runs `terraform output` locally | Zero | Requires CLI access, manual process |
| **B. CI output to ticket** | CI posts key to ticket system via API | Medium | Keys in ticket history (audit trail) |
| **C. Secrets manager** | Store keys in Vault/1Password, notify engineer | Medium | Another system to integrate |
| **D. Encrypted email** | CI encrypts key, emails to operator | High | Key management complexity |
### Recommendation
**Option A** for now (consistent with manual rename):
- Engineer retrieves key after CI completes
- Engineer sends key to operator via secure channel (Signal, encrypted email)
- Ticket updated with "key sent" status
**Option B** worth implementing if:
- Volume increases
- Want full automation
- Ticket system has secure "hidden fields" feature
---
## Summary: Recommended Workflow
Given the constraints (code freeze, ~100 operators, ticket-based), the pragmatic workflow is:
```
1. Ticket created: "Onboard pilot Ivanov with BlastPilot + GS"
2. Engineer adds to Terraform:
- ticket-1234-pilot (one-off, 7 days)
- ticket-1234-gs (one-off, 7 days)
3. Engineer creates PR, gets review, merges
4. CI/CD applies changes, keys created
5. Engineer retrieves keys:
terraform output -raw ticket_1234_pilot_key
6. Engineer sends keys to operator via secure channel
7. Operator enrolls both devices
8. Engineer renames peers in dashboard:
DESKTOP-ABC123 -> pilot-ivanov
raspberrypi -> gs-ivanov
9. Engineer closes ticket
```
**Total engineer time:** ~10 minutes per onboarding (pair of devices)
**Automation level:** Groups, policies, key creation automated; naming and distribution manual
---
## Future Improvements (If Needed)
1. **Webhook listener** for peer enrollment events -> auto-rename based on timing correlation
2. **Ticket system integration** for automated key distribution
3. **Custom installer** that prompts for device name before enrollment
4. **Batch onboarding tool** for multiple operators at once
These can be addressed incrementally as the operation scales.