205 lines
6.8 KiB
Markdown
205 lines
6.8 KiB
Markdown
# NetBird GitOps - Remaining Pain Points
|
|
|
|
This document captures challenges discovered during the POC that need resolution before production use.
|
|
|
|
## Context
|
|
|
|
**Use case:** ~100+ operators, each with 2 devices (BlastPilot + BlastGS-Agent)
|
|
**Workflow:** Ticket-based onboarding, engineer creates PR, merge triggers setup key creation
|
|
**Current pain:** Manual setup key creation and peer renaming in dashboard
|
|
|
|
---
|
|
|
|
## Pain Point 1: Peer Naming After Enrollment
|
|
|
|
### Problem
|
|
|
|
When a peer enrolls using a setup key, it appears in the NetBird dashboard with its hostname (e.g., `DESKTOP-ABC123` or `raspberrypi`). These hostnames are:
|
|
- Often generic and meaningless
|
|
- Not controllable via IaC (peer generates its own keypair locally)
|
|
- Confusing when managing 100+ devices
|
|
|
|
**Desired state:** Peer appears as `pilot-ivanov` or `gs-unit-042` immediately after enrollment.
|
|
|
|
### Root Cause
|
|
|
|
NetBird's architecture requires peers to self-enroll:
|
|
1. Setup key defines which groups the peer joins
|
|
2. Peer runs `netbird up --setup-key <key>`
|
|
3. Peer generates WireGuard keypair locally
|
|
4. Peer registers with management server using its local hostname
|
|
5. **No API link between "which setup key was used" and "which peer enrolled"**
|
|
|
|
### Options
|
|
|
|
| Option | Description | Effort | Tradeoffs |
|
|
|--------|-------------|--------|-----------|
|
|
| **A. Manual rename** | Engineer renames peer in dashboard after enrollment | Zero | 30 seconds per device, human in loop |
|
|
| **B. Polling service** | Service watches for new peers, matches by timing/IP, renames | Medium | More infrastructure, heuristic matching |
|
|
| **C. Per-user tracking groups** | Unique group per user, find peer by group membership | High | Group sprawl, cleanup needed |
|
|
| **D. Installer modification** | Modify BlastPilot/BlastGS-Agent to set hostname before enrollment | N/A | Code freeze constraint |
|
|
|
|
### Recommendation
|
|
|
|
**Option A** is acceptable for ~100 operators with ticket-based workflow:
|
|
- Ticket arrives -> engineer creates PR -> PR merges -> engineer sends setup key -> operator enrolls -> **engineer renames peer (30 sec)**
|
|
- Total engineer time per onboarding: ~5 minutes
|
|
- No additional infrastructure
|
|
|
|
**Option B** worth considering if:
|
|
- Onboarding volume increases significantly
|
|
- Full automation is required (no human in loop)
|
|
|
|
---
|
|
|
|
## Pain Point 2: Per-User vs Per-Role Setup Keys
|
|
|
|
### Current State
|
|
|
|
Setup keys are defined per-role in `terraform/setup_keys.tf`:
|
|
```hcl
|
|
resource "netbird_setup_key" "gs_onboarding" {
|
|
name = "ground-station-onboarding"
|
|
type = "reusable"
|
|
auto_groups = [netbird_group.ground_stations.id]
|
|
...
|
|
}
|
|
```
|
|
|
|
This means:
|
|
- One reusable key per role
|
|
- Key is shared across all operators of that role
|
|
- No way to track "this key was issued to Ivanov"
|
|
|
|
### Problems
|
|
|
|
1. **No audit trail** - Can't answer "who enrolled device X?"
|
|
2. **Revocation is all-or-nothing** - Revoking `pilot-onboarding` affects everyone
|
|
3. **No usage attribution** - Can't enforce "one device per operator"
|
|
|
|
### Options
|
|
|
|
| Option | Description | Effort | Tradeoffs |
|
|
|--------|-------------|--------|-----------|
|
|
| **A. Accept per-role keys** | Current state, manual tracking in ticket system | Zero | No IaC-level audit trail |
|
|
| **B. Per-user setup keys** | Create key per onboarding request | Low | More keys to manage, cleanup needed |
|
|
| **C. One-off keys** | Each key has `usage_limit = 1` | Low | Key destroyed after use, good for audit |
|
|
|
|
### Recommendation
|
|
|
|
**Option C (one-off keys)** provides the best tradeoff:
|
|
- Create unique key per onboarding ticket
|
|
- Key auto-expires after first use
|
|
- Clear audit trail: key name links to ticket number
|
|
- Easy to implement:
|
|
|
|
```hcl
|
|
# Example: ticket-based one-off key
|
|
resource "netbird_setup_key" "ticket_1234_pilot" {
|
|
name = "ticket-1234-pilot-ivanov"
|
|
type = "one-off"
|
|
auto_groups = [netbird_group.pilots.id]
|
|
usage_limit = 1
|
|
ephemeral = false
|
|
}
|
|
```
|
|
|
|
**Workflow:**
|
|
1. Ticket ACHILLES-1234: "Onboard pilot Ivanov"
|
|
2. Engineer adds setup key `ticket-1234-pilot-ivanov` to Terraform
|
|
3. PR merged, key created
|
|
4. Engineer sends key to operator (see Pain Point 3)
|
|
5. Operator uses key, it's consumed
|
|
6. After enrollment, engineer renames peer to `pilot-ivanov`
|
|
|
|
---
|
|
|
|
## Pain Point 3: Secure Key Distribution
|
|
|
|
### Problem
|
|
|
|
After CI/CD creates a setup key, how does it reach the operator?
|
|
|
|
Setup keys are sensitive:
|
|
- Anyone with the key can enroll a device into the network
|
|
- Keys may be reusable (depends on configuration)
|
|
- Keys should be transmitted securely
|
|
|
|
### Current State
|
|
|
|
Setup keys are output by Terraform:
|
|
```bash
|
|
terraform output -raw gs_setup_key
|
|
```
|
|
|
|
But:
|
|
- Requires local Terraform access
|
|
- No automated distribution mechanism
|
|
- Keys in state file (committed to git in POC - not ideal)
|
|
|
|
### Options
|
|
|
|
| Option | Description | Effort | Tradeoffs |
|
|
|--------|-------------|--------|-----------|
|
|
| **A. Manual retrieval** | Engineer runs `terraform output` locally | Zero | Requires CLI access, manual process |
|
|
| **B. CI output to ticket** | CI posts key to ticket system via API | Medium | Keys in ticket history (audit trail) |
|
|
| **C. Secrets manager** | Store keys in Vault/1Password, notify engineer | Medium | Another system to integrate |
|
|
| **D. Encrypted email** | CI encrypts key, emails to operator | High | Key management complexity |
|
|
|
|
### Recommendation
|
|
|
|
**Option A** for now (consistent with manual rename):
|
|
- Engineer retrieves key after CI completes
|
|
- Engineer sends key to operator via secure channel (Signal, encrypted email)
|
|
- Ticket updated with "key sent" status
|
|
|
|
**Option B** worth implementing if:
|
|
- Volume increases
|
|
- Want full automation
|
|
- Ticket system has secure "hidden fields" feature
|
|
|
|
---
|
|
|
|
## Summary: Recommended Workflow
|
|
|
|
Given the constraints (code freeze, ~100 operators, ticket-based), the pragmatic workflow is:
|
|
|
|
```
|
|
1. Ticket created: "Onboard pilot Ivanov with BlastPilot + GS"
|
|
|
|
2. Engineer adds to Terraform:
|
|
- ticket-1234-pilot (one-off, 7 days)
|
|
- ticket-1234-gs (one-off, 7 days)
|
|
|
|
3. Engineer creates PR, gets review, merges
|
|
|
|
4. CI/CD applies changes, keys created
|
|
|
|
5. Engineer retrieves keys:
|
|
terraform output -raw ticket_1234_pilot_key
|
|
|
|
6. Engineer sends keys to operator via secure channel
|
|
|
|
7. Operator enrolls both devices
|
|
|
|
8. Engineer renames peers in dashboard:
|
|
DESKTOP-ABC123 -> pilot-ivanov
|
|
raspberrypi -> gs-ivanov
|
|
|
|
9. Engineer closes ticket
|
|
```
|
|
|
|
**Total engineer time:** ~10 minutes per onboarding (pair of devices)
|
|
**Automation level:** Groups, policies, key creation automated; naming and distribution manual
|
|
|
|
---
|
|
|
|
## Future Improvements (If Needed)
|
|
|
|
1. **Webhook listener** for peer enrollment events -> auto-rename based on timing correlation
|
|
2. **Ticket system integration** for automated key distribution
|
|
3. **Custom installer** that prompts for device name before enrollment
|
|
4. **Batch onboarding tool** for multiple operators at once
|
|
|
|
These can be addressed incrementally as the operation scales.
|