Monitoring Your On-Prem Infrastructure Without the Cloud Price Tag
You moved off the cloud to save money. Then you looked at Datadog pricing and realized you might spend half those savings just watching your servers blink. Per-host fees, per-GB log ingestion charges, per-metric custom billing — cloud monitoring vendors have perfected the art of making observability expensive.
Here is the thing: on-premise server monitoring with open source tools is not just possible, it is better for most Nordic SMBs. I run Prometheus, Grafana, and Loki across a four-server on-prem environment for a client right now. The monitoring stack costs exactly zero in licensing. It handles everything we need — metrics, dashboards, logs, and alerts — without sending a single byte to a third-party SaaS.
Why Prometheus + Grafana + Loki Wins for SMBs
There are dozens of open source monitoring tools. I have tried more than I care to admit. This trio keeps winning because each tool does one thing well and they integrate cleanly.
Prometheus scrapes metrics. It pulls data from your services on a schedule, stores it in a time-series database, and evaluates alerting rules. No agents to install on every host — just expose a /metrics endpoint and Prometheus finds it. For .NET applications, the prometheus-net library adds an endpoint in about five lines of code.
Grafana visualizes everything. Dashboards, graphs, heatmaps, alert state — all in one place. It connects to Prometheus for metrics and Loki for logs, giving your team a single UI for the whole stack. The community dashboard library has pre-built panels for PostgreSQL, Docker, Nginx, .NET, and basically everything else you are running.
Loki handles logs. Think of it as "Prometheus, but for logs." It indexes metadata (labels) rather than full-text indexing every log line, which means it uses a fraction of the storage that Elasticsearch demands. Pair it with Promtail as the log shipper and you have centralized logging without the operational headache of running an ELK stack.
The total resource footprint for this stack on a typical SMB setup: one Docker Compose file, roughly 2 GB of RAM, and negligible CPU. I run it on a TOOLS VM alongside Harbor, SonarQube, and CI/CD agents with room to spare.
What this costs vs. the alternatives
| Solution | Monthly cost (20 hosts, 50 GB logs) |
|---|---|
| Datadog Pro | €1,200–€1,800 |
| New Relic (Pro, beyond free tier) | €800–€1,400 |
| Elastic Cloud | €400–€700 |
| Prometheus + Grafana + Loki (self-hosted) | €0 |
The "cost" of self-hosting is your time. For the initial setup, budget a day. After that, it runs itself. I spend maybe an hour a month on maintenance — mostly upgrading containers to new versions.
The 4 Golden Signals Applied to On-Prem
Google's SRE book defined the four golden signals: latency, traffic, errors, and saturation. They were written for cloud-scale systems, but they translate perfectly to on-prem SMB infrastructure. Here is what I actually monitor.
Latency
Track request duration at the application level, not just ping times. For .NET APIs, expose histogram metrics via prometheus-net:
private static readonly Histogram RequestDuration = Metrics
.CreateHistogram("http_request_duration_seconds",
"Request duration in seconds",
new HistogramConfiguration
{
LabelNames = new[] { "method", "endpoint", "status_code" },
Buckets = Histogram.LinearBuckets(0.05, 0.05, 20)
});
In Grafana, set up a panel showing the 95th percentile response time. If p95 crosses 500ms for your API, something needs attention. I alert at 1 second sustained over 5 minutes — that catches real problems without firing on every brief GC pause.
Traffic
Requests per second by endpoint. This tells you whether your system is under load, whether traffic patterns have changed, and — critically — whether something upstream has started hammering you unexpectedly.
On-prem, this also means monitoring network throughput between VMs. If your API server and database server are on separate machines (they should be), saturating the link between them looks like application slowness but is actually a network problem.
Errors
HTTP 5xx rates are the obvious metric. But also track application-level errors — unhandled exceptions, failed database queries, queue processing failures. In .NET, I push these through structured logging (Serilog) into Loki, then create Grafana panels that query Loki for error-level entries.
A Loki query for .NET exceptions in the last hour:
{app="my-api"} |= "Exception" | logfmt | level="error"
Saturation
This is where on-prem monitoring differs most from cloud. In the cloud, you never think about disk space — storage is "infinite." On-prem, a full disk at 3 AM will ruin your weekend.
Monitor these on every host:
- CPU utilization — alert at 85% sustained over 10 minutes
- Memory usage — alert at 90%
- Disk space — alert at 80% (gives you time to react)
- Disk I/O wait — the silent killer of PostgreSQL performance
- Docker container resource usage — one runaway container can starve everything else
Node Exporter (a Prometheus companion) exposes all of these out of the box. Install it on every host and Prometheus scrapes it automatically.
Alerting That Respects a Small Team
Datadog and PagerDuty are built for companies with dedicated on-call rotations. A five-person development team at a Nordic SMB does not have that luxury. You need alerts that are useful without being exhausting.
My alerting rules
Critical (wake someone up): production is down, database unreachable, disk above 95%, SSL certificate expires in under 7 days.
Warning (check next business day): p95 latency above 1 second, error rate above 2%, disk above 80%, backup job failed.
Info (weekly review): resource trends, slow query counts, container restart counts.
Prometheus Alertmanager handles routing. Here is a real config snippet from my setup:
route:
receiver: 'default-slack'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'sms-oncall'
repeat_interval: 15m
- match:
severity: warning
receiver: 'default-slack'
repeat_interval: 12h
receivers:
- name: 'default-slack'
slack_configs:
- channel: '#ops-alerts'
send_resolved: true
- name: 'sms-oncall'
webhook_configs:
- url: 'http://tools-vm:9095/sms-relay'
For the SMS relay, I use a simple webhook that calls the 46elks API (a Swedish SMS gateway — affordable and no minimum commitment). Total cost: about €5/month for the handful of critical alerts that actually fire.
The key principle: never alert on something nobody will act on. Every alert that gets ignored trains your team to ignore all alerts. I review alert rules quarterly and delete anything that has not been actionable.
Dashboard Templates for Common Stacks
You do not need to build dashboards from scratch. Grafana's community library has hundreds, and I have adapted several for the exact stack most Nordic SMBs run. Here is what I set up on every project.
PostgreSQL Dashboard
Import Grafana dashboard ID 9628 (PostgreSQL Database) and customize:
- Active connections vs. max connections
- Transactions per second
- Tuple operations (inserts, updates, deletes)
- Cache hit ratio (should be above 99% — if it is not, you need more RAM)
- Slow queries count (log queries over 200ms via
log_min_duration_statement)
.NET Application Dashboard
Build a custom dashboard using prometheus-net metrics:
- Request rate by endpoint
- p50/p95/p99 latency histograms
- Active HTTP connections
- GC collection counts and duration
- Thread pool queue length (early warning for thread starvation)
- Exception rate by type
Docker Host Dashboard
Import dashboard ID 893 (Docker and system monitoring) for:
- Container CPU and memory per service
- Container restart counts (a restarting container is a crashing container)
- Network I/O per container
- Volume disk usage
Nginx Dashboard
Import dashboard ID 12708 (Nginx with VTS module) or scrape the Nginx stub status:
- Requests per second by status code
- Active connections
- Upstream response times
- 4xx and 5xx rates
All of these dashboards are part of the Docker Compose setup in my Cloud Exit Starter Kit. Pre-configured, pre-wired to Prometheus, ready to deploy.
Log Aggregation Without Elasticsearch Complexity
I have run ELK stacks. I have debugged Elasticsearch cluster splits at 2 AM. I have watched Logstash eat 8 GB of RAM parsing Nginx access logs. For an SMB team, it is not worth it.
Loki + Promtail is simpler by design. Promtail tails your log files (or reads Docker container logs), attaches labels, and ships them to Loki. Loki stores them efficiently. Grafana queries them. That is the entire architecture.
Here is the Promtail config I use for Docker containers:
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
static_configs:
- targets:
- localhost
labels:
job: docker
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- docker: {}
- labeldrop:
- filename
Mount /var/lib/docker/containers into the Promtail container, and it picks up every container's logs automatically. New containers, removed containers — no config changes needed.
Structured logging matters
Loki works best with structured logs. If your .NET application uses Serilog with a JSON formatter, Loki can parse fields and you can filter by any property:
{app="order-api"} | json | StatusCode >= 500 | RequestPath=~"/api/orders.*"
This query finds all 500-level errors on order endpoints. Try doing that with grep across four servers. Loki makes it trivial.
Retention and storage
Loki compresses logs aggressively. In my setup, 30 days of logs from a .NET API, PostgreSQL, Nginx, and supporting services takes about 15 GB of disk. Set a retention policy so logs do not grow forever:
compactor:
retention_enabled: true
retention_delete_delay: 2h
limits_config:
retention_period: 720h # 30 days
For most SMBs, 30 days of searchable logs is plenty. If compliance requires longer retention, ship older logs to cheap object storage (a local MinIO instance) as compressed archives.
When NOT to Self-Host Monitoring
I would skip this approach if:
- You have no one who can SSH into a Linux server. Prometheus and Grafana are low-maintenance, but "low" is not "zero." Someone needs to handle upgrades and occasional disk-space cleanups.
- You are running 200+ microservices. At that scale, you probably need a dedicated observability team, and managed tools start to make financial sense because the operational cost of self-hosting scales with service count.
- You need distributed tracing across dozens of services. Jaeger or Tempo can handle this, but setting up distributed tracing self-hosted is a bigger project. If tracing is critical, consider Grafana Cloud's free tier (50 GB of traces/month) as a hybrid approach.
- Your compliance framework mandates a specific monitoring vendor. Rare, but I have seen it in financial services contracts.
Getting Started: The Minimum Viable Monitoring Stack
You can have this running in an afternoon. Here is the order:
- Deploy Node Exporter on every host. One binary, no config needed.
- Deploy Prometheus with a basic
prometheus.ymlthat scrapes your Node Exporters. - Deploy Grafana and connect it to Prometheus. Import the Docker host dashboard.
- Deploy Loki + Promtail for centralized logging.
- Configure Alertmanager with Slack and one emergency SMS channel.
- Add application metrics (
prometheus-netfor .NET,prom-clientfor Node.js) as the next step.
All six components fit in a single docker-compose.yml. I have battle-tested configs for this exact setup in the Cloud Exit Starter Kit — Ansible playbooks to provision the hosts, Docker Compose to run the stack, and Grafana dashboard JSON files ready to import.
Ready to migrate off the cloud?
I put together a Cloud Exit Starter Kit ($49) — Ansible playbooks, Docker Compose production templates, and the migration checklist I use on real projects. Everything you need to go from Azure/AWS to your own hardware.
Or if you just want to talk it through: book a free 30-minute cloud exit assessment. No sales pitch — just an honest look at whether on-prem makes sense for your situation.