Context
To operate and maintain the homelab professionally, I deployed a complete monitoring stack in Debian 12 LXC containers on Proxmox. The goal: full visibility over network equipment, hypervisors and applications, with centralised log collection — comparable to a production environment.
Monitoring architecture
Network devices (SNMP) ──→ Zabbix 7.0.26 (LXC 105) ──→ Dashboards / alerts
MariaDB (LXC 104) ──────→ Zabbix DB backend
PVE-01/02 hypervisors ──→ node_exporter :9100 ──→ Prometheus 3.5.3 (LXC 106)
└──→ pve_exporter :9221 ──→ ↓
HTTP/TCP/ICMP probes ───→ blackbox_exporter :9115 → Grafana 13.0.1 (LXC 108)
journald logs PVE-01/02 → Grafana Alloy 1.16.1 ──→ Loki 3.6.7 (LXC 107) → Grafana
Dedicated LXC containers (VLAN 10 MGMT)
| LXC | CTID | VLAN | Service | Version |
|---|---|---|---|---|
lxc-mariadb-01 |
104 | VLAN 10 MGMT | Zabbix database | MariaDB |
lxc-zabbix-01 |
105 | VLAN 10 MGMT | Zabbix Server + Frontend | 7.0.26 |
lxc-prometheus-01 |
106 | VLAN 10 MGMT | Prometheus | 3.5.3 LTS |
lxc-loki-01 |
107 | VLAN 10 MGMT | Loki | 3.6.7 |
lxc-grafana-01 |
108 | VLAN 10 MGMT | Grafana | 13.0.1 |
What I did
Zabbix 7.0.26 — SNMP network monitoring
- Deployed Zabbix Server + Apache frontend + MariaDB in separate LXC containers (decoupled)
- Fixed a tricky issue: the
zabbixMariaDB user was bound to the Zabbix LXC IP (not wildcard%), requiringskip-name-resolveon MariaDB to avoid reverse-DNS lookup timeouts - Upgraded to 7.0.26: reinstalled official apt repo, full SQL schema import
- SNMP configured on Cisco 3560-CX, D-Link DGS-1210-08P and pfSense (restricted community string)
- Applied automatic Zabbix templates per device type
- Tuned triggers to eliminate false positives (LLD interface thresholds, verbose alerts)
Prometheus 3.5.3 — hypervisor and service metrics
- Deployed with a systemd unit and lifecycle API enabled (
--web.enable-lifecycle) - Configured 12 job_names in
prometheus.ymlfor 37 targets total:node: PVE-01 + PVE-02 vianode_exporter(port 9100)pve: PVE-01 + PVE-02 viapve_exporter(port 9221, with relabeling)blackbox_icmp: 6 targets (gateways + 1.1.1.1)blackbox_https: 15 NPM + Cloudflare endpointsblackbox_dns_*: 6 DNS jobs (AdGuard, Unbound, Cloudflare)
- Fixed a gotcha: the
--web.enable-lifecycleflag was missing from the systemd unit, makingsystemctl reloada no-op
Grafana 13.0.1 — unified visualisation
- Datasources: Prometheus and Loki configured
- Imported and customised dashboards: Proxmox (pve_exporter), network, endpoint availability
- LogQL queries in Grafana to filter Proxmox logs by systemd unit
Loki 3.6.7 + Grafana Alloy 1.16.1 — log centralisation
- Loki deployed in monolithic single-binary mode (schema v13/TSDB/filesystem, 30-day retention)
- Dropped Promtail (EOL March 2026) in favour of Grafana Alloy (official successor, OpenTelemetry-based)
- Mass-purged residual Promtail from 10 LXC + 1 VM + 2 hosts via
pct exec - Deployed Grafana Alloy v1.16.1 on PVE-01 and PVE-02: collects the full systemd journal (Proxmox, kernel, VM/CT tasks)
- End-to-end validation:
host=pve-01/host=pve-02labels present in Loki, LogQL queries operational
Skills covered
This project covers infrastructure monitoring and operations (B2.4), IT asset management via Zabbix auto-discovery (B1.1), availability and integrity assurance through centralised logs (B3.4) and network anomaly detection via Prometheus blackbox probes (B3.5).