Kubernetes Logging: Fluent Bit and Grafana Loki
Fluent Bit runs as a DaemonSet, collects logs from every node, and ships them to a storage backend. Grafana Loki stores logs with labels (not full text indexing) — logs from the same pod share labels derived from Kubernetes metadata, making them queryable in Grafana alongside metrics and traces. This covers the Fluent Bit DaemonSet configuration for EKS, Loki deployment with S3 object storage, and the LogQL queries that diagnose production incidents.

Every Kubernetes workload writes to stdout/stderr, and the container runtime writes those streams to files on the node at /var/log/containers/. A DaemonSet log collector reads those files, enriches the log lines with Kubernetes metadata (pod name, namespace, container name, labels), and ships them to a backend.
Loki's approach to log storage is the key design decision: unlike Elasticsearch, Loki doesn't full-text index log content. It indexes only the labels (namespace, pod, container, stream). Log content is stored compressed in object storage (S3). This makes Loki significantly cheaper at scale — but queries that need to search log content (|= "error") do linear scanning, which is slower than indexed search for high-cardinality text queries.
Fluent Bit Installation
1helm repo add fluent https://fluent.github.io/helm-charts
2helm repo update
3
4helm install fluent-bit fluent/fluent-bit \
5 --namespace logging \
6 --create-namespace \
7 --version 0.47.7 \
8 --values fluent-bit-values.yaml1# fluent-bit-values.yaml
2config:
3 service: |
4 [SERVICE]
5 Flush 1
6 Log_Level info
7 Daemon off
8 HTTP_Server On
9 HTTP_Listen 0.0.0.0
10 HTTP_Port 2020
11 Health_Check On
12
13 inputs: |
14 [INPUT]
15 Name tail
16 Path /var/log/containers/*.log
17 multiline.parser docker, cri # Handle both Docker and containerd log formats
18 Tag kube.*
19 Refresh_Interval 5
20 Mem_Buf_Limit 50MB
21 Skip_Long_Lines On
22
23 filters: |
24 [FILTER]
25 Name kubernetes
26 Match kube.*
27 Kube_URL https://kubernetes.default.svc:443
28 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
29 Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
30 Kube_Tag_Prefix kube.var.log.containers.
31 Merge_Log On # Parse JSON log lines into structured fields
32 Merge_Log_Key log_processed
33 Keep_Log Off # Don't duplicate the raw log field
34 K8S-Logging.Parser On # Use pod annotations to set log parsers
35 K8S-Logging.Exclude On # Allow pods to opt out of collection
36
37 [FILTER]
38 Name modify
39 Match kube.*
40 Add cluster production # Tag all logs with cluster name
41
42 outputs: |
43 [OUTPUT]
44 Name loki
45 Match kube.*
46 Host loki-gateway.logging.svc.cluster.local
47 Port 80
48 Labels job=fluentbit,namespace=$kubernetes['namespace_name'],pod=$kubernetes['pod_name'],container=$kubernetes['container_name'],node=$kubernetes['host']
49 Label_Keys $kubernetes['labels']['app'],$kubernetes['labels']['version']
50 line_format json
51 remove_keys kubernetes,stream # Remove redundant fields already in labels
52 Retry_Limit 5
53
54tolerations:
55 - key: node-role.kubernetes.io/master
56 operator: Exists
57 effect: NoSchedule
58 - operator: Exists # Run on all tainted nodes (omitting effect matches NoSchedule, PreferNoSchedule, and NoExecute)
59
60resources:
61 requests:
62 cpu: 50m
63 memory: 50Mi
64 limits:
65 cpu: 200m
66 memory: 200MiLoki Installation with S3 Backend
1helm repo add grafana https://grafana.github.io/helm-charts
2helm repo update
3
4helm install loki grafana/loki \
5 --namespace logging \
6 --version 6.24.0 \
7 --values loki-values.yaml1# loki-values.yaml — Simple Scalable mode (recommended for production)
2loki:
3 auth_enabled: false # Set true for multi-tenant; false for single-tenant
4
5 commonConfig:
6 replication_factor: 3
7
8 storage:
9 type: s3
10 s3:
11 region: us-east-1
12 bucketNames:
13 chunks: loki-chunks-production
14 ruler: loki-ruler-production
15 admin: loki-admin-production # Used only in Loki Enterprise; OSS Loki ignores this bucket
16
17 schemaConfig:
18 configs:
19 - from: "2024-01-01"
20 store: tsdb
21 object_store: s3
22 schema: v13
23 index:
24 prefix: loki_index_
25 period: 24h
26
27 limits_config:
28 retention_period: 30d # Log retention — requires the compactor component to be enabled and running
29 ingestion_rate_mb: 16 # Per-tenant ingestion rate limit
30 ingestion_burst_size_mb: 32
31
32# Simple Scalable deployment: separate read and write paths
33deploymentMode: SimpleScalable
34
35backend:
36 replicas: 3
37 persistence:
38 storageClass: gp3
39 size: 10Gi # WAL and index cache
40
41write:
42 replicas: 3
43
44read:
45 replicas: 3
46
47# Minio not needed — using S3
48minio:
49 enabled: falseThe S3 bucket needs to exist before installing Loki:
aws s3 mb s3://loki-chunks-production --region us-east-1
aws s3 mb s3://loki-ruler-production --region us-east-1
aws s3 mb s3://loki-admin-production --region us-east-1The Loki pods need IRSA or Pod Identity to write to S3. The IAM policy needs s3:PutObject, s3:GetObject, s3:DeleteObject, s3:ListBucket on the three buckets.
LogQL: Querying Logs
Loki's query language is LogQL. Queries filter by labels first (fast — label index), then optionally filter or transform log content (slower — content scan):
1# All error logs from the payments namespace in the last hour
2{namespace="payments"} |= "ERROR"
3
4# Logs from the payments-api container specifically
5{namespace="payments", container="payments-api"} |= "ERROR"
6
7# Structured log parsing: extract fields from JSON logs
8{namespace="payments"} | json | level="error"
9
10# Count error rate per pod over 5 minutes
11sum by (pod) (
12 rate({namespace="payments"} |= "ERROR" [5m])
13)
14
15# Show slow requests: parse duration from log line
16{namespace="payments", container="payments-api"}
17| json
18| duration > 1s
19
20# Trace a specific request by ID across all services (regex OR across namespaces)
21{namespace=~"payments|orders"} |= "req-abc123"1# Aggregate: error count by HTTP status code (requires structured JSON logs)
2sum by (status) (
3 count_over_time(
4 {namespace="payments"}
5 | json
6 | status =~ "5.."
7 [5m]
8 )
9)LogQL supports two query types:
- Log queries (return log lines):
{namespace="payments"} |= "ERROR" - Metric queries (return time series):
rate({namespace="payments"} |= "ERROR" [5m])
Metric queries from log data can drive Prometheus alerts via Loki's ruler — alert when the error rate from log content exceeds a threshold.
Grafana Integration
Add Loki as a data source in Grafana:
1# Grafana datasource provisioning (grafana-values.yaml)
2grafana:
3 additionalDataSources:
4 - name: Loki
5 type: loki
6 url: http://loki-gateway.logging.svc.cluster.local:80
7 access: proxy
8 jsonData:
9 maxLines: 1000
10 derivedFields:
11 # Auto-link trace IDs in logs to Tempo traces
12 - datasourceUid: tempo
13 matcherRegex: '"trace_id":"([a-f0-9]+)"'
14 name: TraceID
15 url: "$${__value.raw}"The derivedFields configuration extracts trace IDs from log lines and creates clickable links to the corresponding Tempo traces — enabling the logs→traces correlation flow in Grafana.
Excluding Logs from Collection
Use the Fluent Bit Kubernetes filter's annotation-based exclusion to prevent noisy pods from filling the log storage:
1# Exclude this pod's logs from collection
2metadata:
3 annotations:
4 fluentbit.io/exclude: "true"
5
6# Use a custom parser for this pod (e.g., nginx access log format)
7metadata:
8 annotations:
9 fluentbit.io/parser: nginxFrequently Asked Questions
How much S3 storage does Loki use?
Loki's label-based indexing and Snappy compression typically achieve 5-10x compression on log data. A cluster generating 1GB/day of raw logs stores roughly 100-200MB/day in S3. At 30 days retention: 3-6GB for S3 storage costs of ~$0.07-0.14/month at $0.023/GB. The cost advantage over Elasticsearch (which requires expensive SSD storage for indexes) is significant at scale.
Fluent Bit vs Fluentd vs the OpenTelemetry Collector for log collection?
Fluent Bit is significantly lighter than Fluentd and the standard choice for node-level log collection. Fluentd runs on a Ruby runtime (~40MB+ RAM) with a richer plugin ecosystem — use it if you need complex transformation that Fluent Bit can't express. The OpenTelemetry Collector can collect logs via its filelog receiver and is the right choice if you're already using OTel for traces and metrics and want a single collector pipeline. For pure log collection on EKS, Fluent Bit is the default recommendation.
Can I send logs directly to CloudWatch instead of Loki?
Yes. Replace the Loki [OUTPUT] block with the CloudWatch output:
1[OUTPUT]
2 Name cloudwatch_logs
3 Match kube.*
4 region us-east-1
5 log_group_name /aws/eks/production/workloads
6 log_stream_prefix pod/
7 auto_create_group OnCloudWatch Logs Insights can query these logs. CloudWatch is simpler to operate on EKS (no additional infrastructure), but more expensive than Loki+S3 at scale and lacks Grafana's correlation features.
Multiline Log Parsing
Container workloads often emit multiline log output — Java exception stack traces, Go panic dumps, and multi-line JSON blobs that the CRI runtime splits into separate log records. Fluent Bit handles these with built-in and custom multiline parsers.
Java Stack Traces
[INPUT]
Name tail
Path /var/log/containers/*_production_*.log
Tag kube.*
multiline.parser java,cri # Built-in Java multiline parser — joins indented lines after an exception headerThe java parser uses the pattern that Java stack traces begin with an exception class name (e.g., java.lang.NullPointerException) and continuation lines are indented with whitespace. Fluent Bit buffers lines until the pattern breaks, then emits the joined record as a single log event.
Go Panic Traces
Go panics emit goroutine dumps that the CRI splits per line. Use a custom MULTILINE_PARSER with a regex:
[MULTILINE_PARSER]
name custom-go-panic
type regex
flush_timeout 1000
rule "start_state" "/(goroutine \d+)/gm" "go_state"
rule "go_state" "/^(\s+)/gm" "go_state"This regex identifies goroutine header lines as the start of a new multiline record and accumulates all subsequent indented lines into the same event.
Retry Limit and Disk Buffering
The Retry_Limit False setting in the Fluent Bit output block is worth calling out explicitly:
[OUTPUT]
Name loki
Match kube.*
...
Retry_Limit False # Buffer to disk and retry indefinitely — no log loss on Loki downtimeWithout this, Fluent Bit drops log records after the default 1 retry (roughly 2× the configured flush interval). With Retry_Limit False, Fluent Bit buffers chunks to disk (storage.path in the SERVICE block) and retries until Loki accepts them. This means:
- No log loss during Loki restarts or rolling upgrades — chunks accumulate on the node disk and flush when Loki recovers
- Backpressure protection — the
Mem_Buf_Limiton the INPUT block caps in-memory buffering; overflow spills to disk - Trade-off: disk space on each node is consumed during outages — monitor the
storage.pathmount and set appropriatestorage.max_chunks_upto bound memory consumption
For production deployments, pair Retry_Limit False with a hostPath volume for storage.path so buffers survive Fluent Bit pod restarts:
1daemonSetVolumes:
2 - name: flb-storage
3 hostPath:
4 path: /var/log/flb-storage # Survives Fluent Bit pod restarts
5
6daemonSetVolumeMounts:
7 - name: flb-storage
8 mountPath: /var/log/flb-storageFor the OpenTelemetry Collector approach to log collection that unifies traces, metrics, and logs in one pipeline, see OpenTelemetry on Kubernetes: Collector, Auto-Instrumentation, and the Operator. For Prometheus metrics that complement Loki's log-based metrics, see Prometheus and Grafana on Kubernetes: Production Monitoring Stack.
Setting up centralized logging for an EKS cluster or migrating from CloudWatch Logs to Loki? Talk to us at Coding Protocols — we help platform teams build logging pipelines that scale to terabytes without breaking the on-call budget.


