Cloud Exit / DevOps

Monitoring Your On-Prem Infrastructure Without the Cloud Price Tag

Daniel

09 Apr 2026 — 7 min read

You moved off the cloud to save money. Then you looked at Datadog pricing and realized you might spend half those savings just watching your servers blink. Per-host fees, per-GB log ingestion charges, per-metric custom billing — cloud monitoring vendors have perfected the art of making observability expensive.

Here is the thing: on-premise server monitoring with open source tools is not just possible, it is better for most Nordic SMBs. I run Prometheus, Grafana, and Loki across a four-server on-prem environment for a client right now. The monitoring stack costs exactly zero in licensing. It handles everything we need — metrics, dashboards, logs, and alerts — without sending a single byte to a third-party SaaS.

Why Prometheus + Grafana + Loki Wins for SMBs

There are dozens of open source monitoring tools. I have tried more than I care to admit. This trio keeps winning because each tool does one thing well and they integrate cleanly.

Prometheus scrapes metrics. It pulls data from your services on a schedule, stores it in a time-series database, and evaluates alerting rules. No agents to install on every host — just expose a /metrics endpoint and Prometheus finds it. For .NET applications, the prometheus-net library adds an endpoint in about five lines of code.

Grafana visualizes everything. Dashboards, graphs, heatmaps, alert state — all in one place. It connects to Prometheus for metrics and Loki for logs, giving your team a single UI for the whole stack. The community dashboard library has pre-built panels for PostgreSQL, Docker, Nginx, .NET, and basically everything else you are running.

Loki handles logs. Think of it as "Prometheus, but for logs." It indexes metadata (labels) rather than full-text indexing every log line, which means it uses a fraction of the storage that Elasticsearch demands. Pair it with Promtail as the log shipper and you have centralized logging without the operational headache of running an ELK stack.

The total resource footprint for this stack on a typical SMB setup: one Docker Compose file, roughly 2 GB of RAM, and negligible CPU. I run it on a TOOLS VM alongside Harbor, SonarQube, and CI/CD agents with room to spare.

What this costs vs. the alternatives

Solution	Monthly cost (20 hosts, 50 GB logs)
Datadog Pro	€1,200–€1,800
New Relic (Pro, beyond free tier)	€800–€1,400
Elastic Cloud	€400–€700
Prometheus + Grafana + Loki (self-hosted)	€0

The "cost" of self-hosting is your time. For the initial setup, budget a day. After that, it runs itself. I spend maybe an hour a month on maintenance — mostly upgrading containers to new versions.

The 4 Golden Signals Applied to On-Prem

Google's SRE book defined the four golden signals: latency, traffic, errors, and saturation. They were written for cloud-scale systems, but they translate perfectly to on-prem SMB infrastructure. Here is what I actually monitor.

Latency

Track request duration at the application level, not just ping times. For .NET APIs, expose histogram metrics via prometheus-net:

private static readonly Histogram RequestDuration = Metrics
    .CreateHistogram("http_request_duration_seconds",
        "Request duration in seconds",
        new HistogramConfiguration
        {
            LabelNames = new[] { "method", "endpoint", "status_code" },
            Buckets = Histogram.LinearBuckets(0.05, 0.05, 20)
        });

In Grafana, set up a panel showing the 95th percentile response time. If p95 crosses 500ms for your API, something needs attention. I alert at 1 second sustained over 5 minutes — that catches real problems without firing on every brief GC pause.

Traffic

Requests per second by endpoint. This tells you whether your system is under load, whether traffic patterns have changed, and — critically — whether something upstream has started hammering you unexpectedly.

On-prem, this also means monitoring network throughput between VMs. If your API server and database server are on separate machines (they should be), saturating the link between them looks like application slowness but is actually a network problem.

Errors

HTTP 5xx rates are the obvious metric. But also track application-level errors — unhandled exceptions, failed database queries, queue processing failures. In .NET, I push these through structured logging (Serilog) into Loki, then create Grafana panels that query Loki for error-level entries.

A Loki query for .NET exceptions in the last hour:

{app="my-api"} |= "Exception" | logfmt | level="error"

Saturation

This is where on-prem monitoring differs most from cloud. In the cloud, you never think about disk space — storage is "infinite." On-prem, a full disk at 3 AM will ruin your weekend.

Monitor these on every host:

CPU utilization — alert at 85% sustained over 10 minutes
Memory usage — alert at 90%
Disk space — alert at 80% (gives you time to react)
Disk I/O wait — the silent killer of PostgreSQL performance
Docker container resource usage — one runaway container can starve everything else

Node Exporter (a Prometheus companion) exposes all of these out of the box. Install it on every host and Prometheus scrapes it automatically.

Alerting That Respects a Small Team

Datadog and PagerDuty are built for companies with dedicated on-call rotations. A five-person development team at a Nordic SMB does not have that luxury. You need alerts that are useful without being exhausting.

My alerting rules

Critical (wake someone up): production is down, database unreachable, disk above 95%, SSL certificate expires in under 7 days.

Warning (check next business day): p95 latency above 1 second, error rate above 2%, disk above 80%, backup job failed.

Info (weekly review): resource trends, slow query counts, container restart counts.

Prometheus Alertmanager handles routing. Here is a real config snippet from my setup:

route:
  receiver: 'default-slack'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'sms-oncall'
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: 'default-slack'
      repeat_interval: 12h

receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#ops-alerts'
        send_resolved: true
  - name: 'sms-oncall'
    webhook_configs:
      - url: 'http://tools-vm:9095/sms-relay'

For the SMS relay, I use a simple webhook that calls the 46elks API (a Swedish SMS gateway — affordable and no minimum commitment). Total cost: about €5/month for the handful of critical alerts that actually fire.

The key principle: never alert on something nobody will act on. Every alert that gets ignored trains your team to ignore all alerts. I review alert rules quarterly and delete anything that has not been actionable.

Dashboard Templates for Common Stacks

You do not need to build dashboards from scratch. Grafana's community library has hundreds, and I have adapted several for the exact stack most Nordic SMBs run. Here is what I set up on every project.

PostgreSQL Dashboard

Import Grafana dashboard ID 9628 (PostgreSQL Database) and customize:

Active connections vs. max connections
Transactions per second
Tuple operations (inserts, updates, deletes)
Cache hit ratio (should be above 99% — if it is not, you need more RAM)
Slow queries count (log queries over 200ms via log_min_duration_statement)

.NET Application Dashboard

Build a custom dashboard using prometheus-net metrics:

Request rate by endpoint
p50/p95/p99 latency histograms
Active HTTP connections
GC collection counts and duration
Thread pool queue length (early warning for thread starvation)
Exception rate by type

Docker Host Dashboard

Import dashboard ID 893 (Docker and system monitoring) for:

Container CPU and memory per service
Container restart counts (a restarting container is a crashing container)
Network I/O per container
Volume disk usage

Nginx Dashboard

Import dashboard ID 12708 (Nginx with VTS module) or scrape the Nginx stub status:

Requests per second by status code
Active connections
Upstream response times
4xx and 5xx rates

All of these dashboards are part of the Docker Compose setup in my Cloud Exit Starter Kit. Pre-configured, pre-wired to Prometheus, ready to deploy.

Log Aggregation Without Elasticsearch Complexity

I have run ELK stacks. I have debugged Elasticsearch cluster splits at 2 AM. I have watched Logstash eat 8 GB of RAM parsing Nginx access logs. For an SMB team, it is not worth it.

Loki + Promtail is simpler by design. Promtail tails your log files (or reads Docker container logs), attaches labels, and ships them to Loki. Loki stores them efficiently. Grafana queries them. That is the entire architecture.

Here is the Promtail config I use for Docker containers:

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - docker: {}
      - labeldrop:
          - filename

Mount /var/lib/docker/containers into the Promtail container, and it picks up every container's logs automatically. New containers, removed containers — no config changes needed.

Structured logging matters

Loki works best with structured logs. If your .NET application uses Serilog with a JSON formatter, Loki can parse fields and you can filter by any property:

{app="order-api"} | json | StatusCode >= 500 | RequestPath=~"/api/orders.*"

This query finds all 500-level errors on order endpoints. Try doing that with grep across four servers. Loki makes it trivial.

Retention and storage

Loki compresses logs aggressively. In my setup, 30 days of logs from a .NET API, PostgreSQL, Nginx, and supporting services takes about 15 GB of disk. Set a retention policy so logs do not grow forever:

compactor:
  retention_enabled: true
  retention_delete_delay: 2h
limits_config:
  retention_period: 720h  # 30 days

For most SMBs, 30 days of searchable logs is plenty. If compliance requires longer retention, ship older logs to cheap object storage (a local MinIO instance) as compressed archives.

When NOT to Self-Host Monitoring

I would skip this approach if:

You have no one who can SSH into a Linux server. Prometheus and Grafana are low-maintenance, but "low" is not "zero." Someone needs to handle upgrades and occasional disk-space cleanups.
You are running 200+ microservices. At that scale, you probably need a dedicated observability team, and managed tools start to make financial sense because the operational cost of self-hosting scales with service count.
You need distributed tracing across dozens of services. Jaeger or Tempo can handle this, but setting up distributed tracing self-hosted is a bigger project. If tracing is critical, consider Grafana Cloud's free tier (50 GB of traces/month) as a hybrid approach.
Your compliance framework mandates a specific monitoring vendor. Rare, but I have seen it in financial services contracts.

Getting Started: The Minimum Viable Monitoring Stack

You can have this running in an afternoon. Here is the order:

Deploy Node Exporter on every host. One binary, no config needed.
Deploy Prometheus with a basic prometheus.yml that scrapes your Node Exporters.
Deploy Grafana and connect it to Prometheus. Import the Docker host dashboard.
Deploy Loki + Promtail for centralized logging.
Configure Alertmanager with Slack and one emergency SMS channel.
Add application metrics (prometheus-net for .NET, prom-client for Node.js) as the next step.

All six components fit in a single docker-compose.yml. I have battle-tested configs for this exact setup in the Cloud Exit Starter Kit — Ansible playbooks to provision the hosts, Docker Compose to run the stack, and Grafana dashboard JSON files ready to import.

Ready to migrate off the cloud?

I put together a Cloud Exit Starter Kit ($49) — Ansible playbooks, Docker Compose production templates, and the migration checklist I use on real projects. Everything you need to go from Azure/AWS to your own hardware.

Or if you just want to talk it through: book a free 30-minute cloud exit assessment. No sales pitch — just an honest look at whether on-prem makes sense for your situation.

Docker and Ansible: Setting Up a Reproducible On-Prem Stack in a Weekend

Clean Architecture in C#: Structuring Your Next Business Application for Longevity

Hiring .NET Developers in the Nordics: Build vs. Partner

When to Modernize Your Legacy .NET Application (And When Not To)