Tecnología

cardinalidad escalabilidad grafana logs loki observabilidad

Loki at Scale: Lessons from High-Volume Logs

February 11, 2024 10 min read 61 reads

Table of contents

Key Takeaways
The Loki Model in 30 Seconds
Cardinality: The Silent Killer
Healthy Label Design
Separate Read and Write Paths
Promtail → Alloy
LogQL Queries: Useful Patterns
Log-Based Alerts
When Loki is NOT the Tool
Operational Lessons
Conclusion

Actualizado: 2026-05-03

Loki^[1] from Grafana Labs has gained ground as an Elasticsearch alternative for logs. Its pitch — “like Prometheus but for logs” — is attractive: only index labels, not content, drastically reducing storage and indexing cost. Works very well for mid-size teams. At large scale, the trade-offs show. This article collects lessons from operating Loki at real volumes (>1 TB/day) and the patterns that prevent production pain.

Key Takeaways

Loki indexes only labels, not content: each unique combination of label values generates a stream.
Cardinality explosion is incident number one — review new labels before merging any change.
Separating read and write paths is essential at serious scale; a heavy query must not saturate ingestion.
Grafana Alloy replaces Promtail for new deployments — one agent for logs, metrics, and traces.
Knowing when Loki is not the right tool (deep forensic search, strict compliance) matters as much as knowing how to operate it.

The Loki Model in 30 Seconds

Loki indexes only labels (key-value pairs like {app="api", env="prod"}) and stores the log chunk unindexed in object storage (S3, GCS, MinIO). Queries first filter by labels, then scan resulting chunks with regex or text filters.

Model advantages:

Cheap storage (S3 plus compression).
Fast ingestion — no heavy parsing pipeline.
Prometheus-compatible labels.

Model limits:

Non-label queries over large volume are slow.
Label cardinality is the cost — each unique combination generates a stream.

Cardinality: The Silent Killer

The most common mistake is high-cardinality labels. Things that must not be labels:

user_id (millions of values).
request_id (unique per request).
timestamp or any time-based value.
Full un-normalised url.

Each unique value generates a stream. 10,000 users × 5 envs × 3 services = 150,000 active streams. The index bloats, queries degrade, and object-store cost spikes from many small files.

Golden rule: labels for what you filter by (app, env, cluster, severity, tenant if few); content for what you search for (user_id in the message, queryable with regex).

Healthy Label Design

Practical pattern for mid-size teams:

{app, env, cluster, component} — fixed axis. 50-500 combinations typical.
{level} — log level (info/warn/error).
No unique IDs.
No free user values.

With this schema, a typical environment of 50 services × 3 environments × 2 clusters × 5 components × 4 levels = 6,000 streams. Perfectly manageable.

Separate Read and Write Paths

At serious scale, one process cannot manage ingestion and queries without them interfering. Recommended design:

Distributor + Ingester: write pipeline. Receives logs from Promtail/Alloy, buffers in memory, writes to object store in chunks.
Querier + Query-frontend: read pipeline. Parallelises queries, caches, sends results.
Compactor: batch process that periodically compacts chunks.
Ruler: evaluates log alert rules.
Index Gateway: serves index to queries without friction if using boltdb-shipper or TSDB.

A heavy query must not saturate ingestion — that separation is the guarantee.

Promtail → Alloy

Promtail^[2] was the traditional shipper. Grafana Alloy^[3] (formerly Grafana Agent) replaces it with a single agent shipping logs, metrics, and traces. For new deployments, Alloy is the correct choice.

LogQL Queries: Useful Patterns

LogQL is the query language. Queries offering the most operational value:

# Top errors by service in 1h
sum by (app) (count_over_time({env="prod", level="error"}[1h]))

# Latency extracted from logs (requires parse stage)
histogram_quantile(0.95,
  sum by (le) (rate(
    {app="api"}
    | json
    | unwrap duration
    | __error__=""
    [5m]
  ))
)

# Find pattern in a window
{app="api", env="prod"} |= "payment failed" | json | user_id = "12345"

Efficient queries always start with a selective label matcher. Any query without a label matcher operates over all streams and can bring down the cluster.

Log-Based Alerts

Loki supports Prometheus-style rules over metrics extracted from logs. This turns Loki into an alerting tool as well. It doesn’t cover the full range of Elasticsearch/Kibana, but covers 80% of practical cases.

When Loki is NOT the Tool

Being honest about limits saves frustration:

Sophisticated full-text search: Elasticsearch wins.
Deep forensic analysis: Splunk and Elasticsearch have specific tools.
Strict compliance with integrated auditing: Splunk Enterprise is built for that.
Massive volume needing fast ad-hoc queries over long periods: Loki starts to hurt.

Loki is excellent for “monitoring-grade logs” — logs ingested and queried routinely with known patterns. For “deep past forensic investigation”, better options exist.

Operational Lessons

A year operating serious Loki leaves a clear set of rules:

Cardinality explosion is incident number one. Review new labels before merging any PR.
Query cap and timeouts. A user with a badly constructed query can bring down the whole cluster.
Object-store backup. Losing the bucket is losing all logs.
Monitoring Loki with Loki is circular. Use Prometheus plus the metrics Loki exposes.
Rate limiting on ingest per tenant. A service generating log spam must not affect others.

Conclusion

Loki is a solid choice for logs in most cloud-native contexts. Its label-based design offers huge advantages but requires discipline — mismanaged cardinality ruins the experience. At serious scale, separating read/write paths, investing in cache and compaction, and choosing the object store well are key decisions. For teams already on Prometheus + Grafana, completing with Loki is one of the best-return observability upgrades available.

Was this useful?

[Total: 0 · Average: 0]

Post Views: 61

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.