Stack de supervision complète : Zabbix, Prometheus, Grafana et Loki

Context

To operate and maintain the homelab professionally, I deployed a complete monitoring stack in Debian 12 LXC containers on Proxmox. The goal: full visibility over network equipment, hypervisors and applications, with centralised log collection — comparable to a production environment.

Monitoring architecture

Network devices (SNMP) ──→ Zabbix 7.0.26 (LXC 105) ──→ Dashboards / alerts
MariaDB (LXC 104) ──────→ Zabbix DB backend
PVE-01/02 hypervisors ──→ node_exporter :9100   ──→ Prometheus 3.5.3 (LXC 106)
                     └──→ pve_exporter :9221    ──→        ↓
HTTP/TCP/ICMP probes ───→ blackbox_exporter :9115 → Grafana 13.0.1 (LXC 108)
journald logs PVE-01/02 → Grafana Alloy 1.16.1  ──→ Loki 3.6.7 (LXC 107) → Grafana

Dedicated LXC containers (VLAN 10 MGMT)

LXC	CTID	VLAN	Service	Version
`lxc-mariadb-01`	104	VLAN 10 MGMT	Zabbix database	MariaDB
`lxc-zabbix-01`	105	VLAN 10 MGMT	Zabbix Server + Frontend	7.0.26
`lxc-prometheus-01`	106	VLAN 10 MGMT	Prometheus	3.5.3 LTS
`lxc-loki-01`	107	VLAN 10 MGMT	Loki	3.6.7
`lxc-grafana-01`	108	VLAN 10 MGMT	Grafana	13.0.1

What I did

Zabbix 7.0.26 — SNMP network monitoring

Deployed Zabbix Server + Apache frontend + MariaDB in separate LXC containers (decoupled)
Fixed a tricky issue: the zabbix MariaDB user was bound to the Zabbix LXC IP (not wildcard %), requiring skip-name-resolve on MariaDB to avoid reverse-DNS lookup timeouts
Upgraded to 7.0.26: reinstalled official apt repo, full SQL schema import
SNMP configured on Cisco 3560-CX, D-Link DGS-1210-08P and pfSense (restricted community string)
Applied automatic Zabbix templates per device type
Tuned triggers to eliminate false positives (LLD interface thresholds, verbose alerts)

Prometheus 3.5.3 — hypervisor and service metrics

Deployed with a systemd unit and lifecycle API enabled (--web.enable-lifecycle)
Configured 12 job_names in prometheus.yml for 37 targets total:
- node: PVE-01 + PVE-02 via node_exporter (port 9100)
- pve: PVE-01 + PVE-02 via pve_exporter (port 9221, with relabeling)
- blackbox_icmp: 6 targets (gateways + 1.1.1.1)
- blackbox_https: 15 NPM + Cloudflare endpoints
- blackbox_dns_*: 6 DNS jobs (AdGuard, Unbound, Cloudflare)
Fixed a gotcha: the --web.enable-lifecycle flag was missing from the systemd unit, making systemctl reload a no-op

Grafana 13.0.1 — unified visualisation

Datasources: Prometheus and Loki configured
Imported and customised dashboards: Proxmox (pve_exporter), network, endpoint availability
LogQL queries in Grafana to filter Proxmox logs by systemd unit

Loki 3.6.7 + Grafana Alloy 1.16.1 — log centralisation

Loki deployed in monolithic single-binary mode (schema v13/TSDB/filesystem, 30-day retention)
Dropped Promtail (EOL March 2026) in favour of Grafana Alloy (official successor, OpenTelemetry-based)
Mass-purged residual Promtail from 10 LXC + 1 VM + 2 hosts via pct exec
Deployed Grafana Alloy v1.16.1 on PVE-01 and PVE-02: collects the full systemd journal (Proxmox, kernel, VM/CT tasks)
End-to-end validation: host=pve-01 / host=pve-02 labels present in Loki, LogQL queries operational

Skills covered

This project covers infrastructure monitoring and operations (B2.4), IT asset management via Zabbix auto-discovery (B1.1), availability and integrity assurance through centralised logs (B3.4) and network anomaly detection via Prometheus blackbox probes (B3.5).

Welcome