Observability
12 min readMay 4, 2026

Kubernetes Logging: Fluent Bit and Grafana Loki

Fluent Bit runs as a DaemonSet, collects logs from every node, and ships them to a storage backend. Grafana Loki stores logs with labels (not full text indexing) — logs from the same pod share labels derived from Kubernetes metadata, making them queryable in Grafana alongside metrics and traces. This covers the Fluent Bit DaemonSet configuration for EKS, Loki deployment with S3 object storage, and the LogQL queries that diagnose production incidents.

CO
Coding Protocols Team
Platform Engineering
Kubernetes Logging: Fluent Bit and Grafana Loki

Every Kubernetes workload writes to stdout/stderr, and the container runtime writes those streams to files on the node at /var/log/containers/. A DaemonSet log collector reads those files, enriches the log lines with Kubernetes metadata (pod name, namespace, container name, labels), and ships them to a backend.

Loki's approach to log storage is the key design decision: unlike Elasticsearch, Loki doesn't full-text index log content. It indexes only the labels (namespace, pod, container, stream). Log content is stored compressed in object storage (S3). This makes Loki significantly cheaper at scale — but queries that need to search log content (|= "error") do linear scanning, which is slower than indexed search for high-cardinality text queries.


Fluent Bit Installation

bash
1helm repo add fluent https://fluent.github.io/helm-charts
2helm repo update
3
4helm install fluent-bit fluent/fluent-bit \
5  --namespace logging \
6  --create-namespace \
7  --version 0.47.7 \
8  --values fluent-bit-values.yaml
yaml
1# fluent-bit-values.yaml
2config:
3  service: |
4    [SERVICE]
5        Flush         1
6        Log_Level     info
7        Daemon        off
8        HTTP_Server   On
9        HTTP_Listen   0.0.0.0
10        HTTP_Port     2020
11        Health_Check  On
12
13  inputs: |
14    [INPUT]
15        Name              tail
16        Path              /var/log/containers/*.log
17        multiline.parser  docker, cri    # Handle both Docker and containerd log formats
18        Tag               kube.*
19        Refresh_Interval  5
20        Mem_Buf_Limit     50MB
21        Skip_Long_Lines   On
22
23  filters: |
24    [FILTER]
25        Name                kubernetes
26        Match               kube.*
27        Kube_URL            https://kubernetes.default.svc:443
28        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
29        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
30        Kube_Tag_Prefix     kube.var.log.containers.
31        Merge_Log           On        # Parse JSON log lines into structured fields
32        Merge_Log_Key       log_processed
33        Keep_Log            Off       # Don't duplicate the raw log field
34        K8S-Logging.Parser  On        # Use pod annotations to set log parsers
35        K8S-Logging.Exclude On        # Allow pods to opt out of collection
36
37    [FILTER]
38        Name    modify
39        Match   kube.*
40        Add     cluster production    # Tag all logs with cluster name
41
42  outputs: |
43    [OUTPUT]
44        Name            loki
45        Match           kube.*
46        Host            loki-gateway.logging.svc.cluster.local
47        Port            80
48        Labels          job=fluentbit,namespace=$kubernetes['namespace_name'],pod=$kubernetes['pod_name'],container=$kubernetes['container_name'],node=$kubernetes['host']
49        Label_Keys      $kubernetes['labels']['app'],$kubernetes['labels']['version']
50        line_format     json
51        remove_keys     kubernetes,stream    # Remove redundant fields already in labels
52        Retry_Limit     5
53
54tolerations:
55  - key: node-role.kubernetes.io/master
56    operator: Exists
57    effect: NoSchedule
58  - operator: Exists    # Run on all tainted nodes (omitting effect matches NoSchedule, PreferNoSchedule, and NoExecute)
59
60resources:
61  requests:
62    cpu: 50m
63    memory: 50Mi
64  limits:
65    cpu: 200m
66    memory: 200Mi

Loki Installation with S3 Backend

bash
1helm repo add grafana https://grafana.github.io/helm-charts
2helm repo update
3
4helm install loki grafana/loki \
5  --namespace logging \
6  --version 6.24.0 \
7  --values loki-values.yaml
yaml
1# loki-values.yaml — Simple Scalable mode (recommended for production)
2loki:
3  auth_enabled: false    # Set true for multi-tenant; false for single-tenant
4
5  commonConfig:
6    replication_factor: 3
7
8  storage:
9    type: s3
10    s3:
11      region: us-east-1
12      bucketNames:
13        chunks: loki-chunks-production
14        ruler: loki-ruler-production
15        admin: loki-admin-production    # Used only in Loki Enterprise; OSS Loki ignores this bucket
16
17  schemaConfig:
18    configs:
19      - from: "2024-01-01"
20        store: tsdb
21        object_store: s3
22        schema: v13
23        index:
24          prefix: loki_index_
25          period: 24h
26
27  limits_config:
28    retention_period: 30d     # Log retention — requires the compactor component to be enabled and running
29    ingestion_rate_mb: 16     # Per-tenant ingestion rate limit
30    ingestion_burst_size_mb: 32
31
32# Simple Scalable deployment: separate read and write paths
33deploymentMode: SimpleScalable
34
35backend:
36  replicas: 3
37  persistence:
38    storageClass: gp3
39    size: 10Gi    # WAL and index cache
40
41write:
42  replicas: 3
43
44read:
45  replicas: 3
46
47# Minio not needed — using S3
48minio:
49  enabled: false

The S3 bucket needs to exist before installing Loki:

bash
aws s3 mb s3://loki-chunks-production --region us-east-1
aws s3 mb s3://loki-ruler-production --region us-east-1
aws s3 mb s3://loki-admin-production --region us-east-1

The Loki pods need IRSA or Pod Identity to write to S3. The IAM policy needs s3:PutObject, s3:GetObject, s3:DeleteObject, s3:ListBucket on the three buckets.


LogQL: Querying Logs

Loki's query language is LogQL. Queries filter by labels first (fast — label index), then optionally filter or transform log content (slower — content scan):

logql
1# All error logs from the payments namespace in the last hour
2{namespace="payments"} |= "ERROR"
3
4# Logs from the payments-api container specifically
5{namespace="payments", container="payments-api"} |= "ERROR"
6
7# Structured log parsing: extract fields from JSON logs
8{namespace="payments"} | json | level="error"
9
10# Count error rate per pod over 5 minutes
11sum by (pod) (
12  rate({namespace="payments"} |= "ERROR" [5m])
13)
14
15# Show slow requests: parse duration from log line
16{namespace="payments", container="payments-api"}
17| json
18| duration > 1s
19
20# Trace a specific request by ID across all services (regex OR across namespaces)
21{namespace=~"payments|orders"} |= "req-abc123"
logql
1# Aggregate: error count by HTTP status code (requires structured JSON logs)
2sum by (status) (
3  count_over_time(
4    {namespace="payments"}
5    | json
6    | status =~ "5.."
7    [5m]
8  )
9)

LogQL supports two query types:

  • Log queries (return log lines): {namespace="payments"} |= "ERROR"
  • Metric queries (return time series): rate({namespace="payments"} |= "ERROR" [5m])

Metric queries from log data can drive Prometheus alerts via Loki's ruler — alert when the error rate from log content exceeds a threshold.


Grafana Integration

Add Loki as a data source in Grafana:

yaml
1# Grafana datasource provisioning (grafana-values.yaml)
2grafana:
3  additionalDataSources:
4    - name: Loki
5      type: loki
6      url: http://loki-gateway.logging.svc.cluster.local:80
7      access: proxy
8      jsonData:
9        maxLines: 1000
10        derivedFields:
11          # Auto-link trace IDs in logs to Tempo traces
12          - datasourceUid: tempo
13            matcherRegex: '"trace_id":"([a-f0-9]+)"'
14            name: TraceID
15            url: "$${__value.raw}"

The derivedFields configuration extracts trace IDs from log lines and creates clickable links to the corresponding Tempo traces — enabling the logs→traces correlation flow in Grafana.


Excluding Logs from Collection

Use the Fluent Bit Kubernetes filter's annotation-based exclusion to prevent noisy pods from filling the log storage:

yaml
1# Exclude this pod's logs from collection
2metadata:
3  annotations:
4    fluentbit.io/exclude: "true"
5
6# Use a custom parser for this pod (e.g., nginx access log format)
7metadata:
8  annotations:
9    fluentbit.io/parser: nginx

Frequently Asked Questions

How much S3 storage does Loki use?

Loki's label-based indexing and Snappy compression typically achieve 5-10x compression on log data. A cluster generating 1GB/day of raw logs stores roughly 100-200MB/day in S3. At 30 days retention: 3-6GB for S3 storage costs of ~$0.07-0.14/month at $0.023/GB. The cost advantage over Elasticsearch (which requires expensive SSD storage for indexes) is significant at scale.

Fluent Bit vs Fluentd vs the OpenTelemetry Collector for log collection?

Fluent Bit is significantly lighter than Fluentd and the standard choice for node-level log collection. Fluentd runs on a Ruby runtime (~40MB+ RAM) with a richer plugin ecosystem — use it if you need complex transformation that Fluent Bit can't express. The OpenTelemetry Collector can collect logs via its filelog receiver and is the right choice if you're already using OTel for traces and metrics and want a single collector pipeline. For pure log collection on EKS, Fluent Bit is the default recommendation.

Can I send logs directly to CloudWatch instead of Loki?

Yes. Replace the Loki [OUTPUT] block with the CloudWatch output:

ini
1[OUTPUT]
2    Name              cloudwatch_logs
3    Match             kube.*
4    region            us-east-1
5    log_group_name    /aws/eks/production/workloads
6    log_stream_prefix pod/
7    auto_create_group On

CloudWatch Logs Insights can query these logs. CloudWatch is simpler to operate on EKS (no additional infrastructure), but more expensive than Loki+S3 at scale and lacks Grafana's correlation features.



Multiline Log Parsing

Container workloads often emit multiline log output — Java exception stack traces, Go panic dumps, and multi-line JSON blobs that the CRI runtime splits into separate log records. Fluent Bit handles these with built-in and custom multiline parsers.

Java Stack Traces

yaml
[INPUT]
    Name              tail
    Path              /var/log/containers/*_production_*.log
    Tag               kube.*
    multiline.parser  java,cri    # Built-in Java multiline parser — joins indented lines after an exception header

The java parser uses the pattern that Java stack traces begin with an exception class name (e.g., java.lang.NullPointerException) and continuation lines are indented with whitespace. Fluent Bit buffers lines until the pattern breaks, then emits the joined record as a single log event.

Go Panic Traces

Go panics emit goroutine dumps that the CRI splits per line. Use a custom MULTILINE_PARSER with a regex:

yaml
[MULTILINE_PARSER]
    name          custom-go-panic
    type          regex
    flush_timeout 1000
    rule "start_state" "/(goroutine \d+)/gm" "go_state"
    rule "go_state"    "/^(\s+)/gm" "go_state"

This regex identifies goroutine header lines as the start of a new multiline record and accumulates all subsequent indented lines into the same event.


Retry Limit and Disk Buffering

The Retry_Limit False setting in the Fluent Bit output block is worth calling out explicitly:

ini
[OUTPUT]
    Name          loki
    Match         kube.*
    ...
    Retry_Limit   False   # Buffer to disk and retry indefinitely — no log loss on Loki downtime

Without this, Fluent Bit drops log records after the default 1 retry (roughly 2× the configured flush interval). With Retry_Limit False, Fluent Bit buffers chunks to disk (storage.path in the SERVICE block) and retries until Loki accepts them. This means:

  • No log loss during Loki restarts or rolling upgrades — chunks accumulate on the node disk and flush when Loki recovers
  • Backpressure protection — the Mem_Buf_Limit on the INPUT block caps in-memory buffering; overflow spills to disk
  • Trade-off: disk space on each node is consumed during outages — monitor the storage.path mount and set appropriate storage.max_chunks_up to bound memory consumption

For production deployments, pair Retry_Limit False with a hostPath volume for storage.path so buffers survive Fluent Bit pod restarts:

yaml
1daemonSetVolumes:
2  - name: flb-storage
3    hostPath:
4      path: /var/log/flb-storage    # Survives Fluent Bit pod restarts
5
6daemonSetVolumeMounts:
7  - name: flb-storage
8    mountPath: /var/log/flb-storage

For the OpenTelemetry Collector approach to log collection that unifies traces, metrics, and logs in one pipeline, see OpenTelemetry on Kubernetes: Collector, Auto-Instrumentation, and the Operator. For Prometheus metrics that complement Loki's log-based metrics, see Prometheus and Grafana on Kubernetes: Production Monitoring Stack.

Setting up centralized logging for an EKS cluster or migrating from CloudWatch Logs to Loki? Talk to us at Coding Protocols — we help platform teams build logging pipelines that scale to terabytes without breaking the on-call budget.

Related Topics

Kubernetes
Logging
Fluent Bit
Grafana Loki
Observability
EKS
Platform Engineering
LogQL

Read Next