Linux Commands for Advanced Engineers: Debugging, systemd & Kernel Internals
Go deep — strace, lsof, tcpdump, systemd units, cgroup and namespace primitives, kernel parameter tuning, and shell scripting patterns for production-grade Linux engineering.
Before you begin
- Solid intermediate Linux skills (pipes, processes, SSH, networking)
- Basic shell scripting (loops, variables, conditionals)
- A Linux system — not macOS (several tools here are Linux-only)
Linux Commands for Advanced Engineers: Debugging, systemd & Kernel Internals
Intermediate Linux gets you productive. Advanced Linux gets you dangerous — in the good way. This tutorial covers the tools that let you see exactly what a process is doing at the system level, configure Linux as a service runtime, tune kernel parameters for production workloads, and understand the primitives that containers are built on.
This is the knowledge that separates engineers who can fix a hung Kubernetes node at 3am from engineers who can't.
1. Syscall Tracing — strace
Every interaction a process has with the kernel (reading files, opening sockets, allocating memory) is a system call. strace shows you all of them in real time.
strace ls # Trace syscalls made by ls
strace -p 1234 # Attach to running process by PID
strace -e openat,read,write ls # Filter to specific syscalls
strace -c ls # Summary: count and time per syscall
strace -f -p 1234 # Follow forked children too
strace -o /tmp/trace.txt -p 1234 # Write output to fileWhat to look for
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0...", 832) = 832
Common patterns:
ENOENT(No such file or directory) — missing config file, broken library pathEACCES(Permission denied) — file permission issueEAGAIN/EWOULDBLOCK— non-blocking I/O waiting for data- Lots of
futexcalls — thread synchronisation (can indicate lock contention)
Practical use: process exits immediately with no output, no logs, exit code 1. strace -e openat,stat <cmd> often shows exactly which file it's trying and failing to read.
2. Open File Descriptors — lsof
lsof (list open files) shows everything a process has open: files, sockets, pipes, devices.
1lsof -p 1234 # All open files for PID 1234
2lsof -u ajeet # All files opened by user ajeet
3lsof -i :8080 # What process is using port 8080
4lsof -i TCP # All TCP connections
5lsof -i TCP:8080-9000 # Range of ports
6lsof +D /var/log/ # All processes with files open in /var/log/
7lsof /var/log/app.log # What's currently writing to app.logfuser — simpler port/file queries
fuser 8080/tcp # PID using TCP port 8080
fuser -k 8080/tcp # Kill the process using port 8080
fuser /mnt/disk # What's preventing unmountfuser /mnt/disk is the first thing to run when umount says "device is busy."
3. Network Packet Capture — tcpdump
tcpdump captures raw network packets. Essential for debugging TLS issues, unexpected traffic, and misbehaving services.
1tcpdump -i eth0 # Capture all traffic on eth0
2tcpdump -i any port 80 # HTTP traffic on any interface
3tcpdump -i eth0 host 10.0.0.5 # Traffic to/from a specific host
4tcpdump -i eth0 'tcp port 443 and host 10.0.0.5'
5tcpdump -i eth0 -w /tmp/capture.pcap # Write to file for Wireshark
6tcpdump -i eth0 -c 100 # Capture only 100 packets
7tcpdump -i eth0 -nn # Don't resolve IPs or ports to names-nn makes output readable for IPs and ports in production. -w captures to a file you can open in Wireshark for deep inspection.
IP routing and interfaces
ip addr # All interfaces and their IPs
ip addr show eth0 # Specific interface
ip route # Routing table
ip route get 8.8.8.8 # Which route would traffic to 8.8.8.8 take?
ip link set eth0 up # Bring interface up
ip neigh # ARP table4. Disk and Storage
lsblk # Block devices tree
lsblk -f # Include filesystem types and UUIDs
fdisk -l # Partition tables
blkid # UUIDs and filesystem typesMount and unmount
mount /dev/sdb1 /mnt/data # Mount device
mount -t nfs 10.0.0.5:/exports /mnt/nfs # Mount NFS share
umount /mnt/data # Unmount (fails if in use — use fuser first)
mount | grep sdb # See current mounts
cat /proc/mounts # All current mounts (including virtual)Inodes and hard/soft links
ls -i file.txt # Show inode number
stat file.txt # Full file metadata including inode
df -i # Inode usage (can fill up before disk space does)
ln source.txt hardlink.txt # Hard link (same inode, same data)
ln -s /abs/path/to/source symlink.txt # Symbolic link (pointer to path)Hard links: two directory entries pointing to the same inode. Deleting one doesn't delete the data until all hard links are removed. Symlinks point to a path — if the target moves, the symlink breaks.
"Inode exhaustion" (df -i shows 100%) is a real production failure mode — many small files (npm node_modules, log files) can exhaust inodes before disk space.
5. User and Group Management
1useradd -m -s /bin/bash deploy # Create user with home dir and bash shell
2useradd -r -s /usr/sbin/nologin appuser # System user, no login shell
3usermod -aG docker ajeet # Add ajeet to docker group
4usermod -aG sudo ajeet # Grant sudo access
5id ajeet # Show UIDs, GIDs, groups
6passwd ajeet # Set password
7userdel -r olduser # Delete user and home directory
8groupadd appgroup # Create a groupsudo and sudoers
sudo command # Run command as root
sudo -u postgres psql # Run as a different user
sudo -i # Interactive root shell
sudo !! # Re-run last command with sudoEdit sudoers safely with visudo (validates syntax before saving):
visudoCommon sudoers entries:
# Allow ajeet to run all commands without password
ajeet ALL=(ALL) NOPASSWD: ALL
# Allow deploy user to restart nginx only
deploy ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nginx
6. File Permissions Deep Dive
The standard permission string rwxr-xr-- breaks down as:
rwx— owner (read, write, execute)r-x— group (read, no write, execute)r--— others (read only)
Octal notation
chmod 755 script.sh # rwxr-xr-x (owner full, group/others read+exec)
chmod 644 config.yaml # rw-r--r-- (owner read+write, others read)
chmod 600 ~/.ssh/id_rsa # rw------- (private key: owner only)
chmod 700 ~/.ssh # rwx------ (SSH dir: owner only)Special bits
chmod u+s /usr/bin/passwd # Setuid — runs as file owner regardless of caller
chmod g+s /shared/dir # Setgid — new files inherit group of directory
chmod +t /tmp # Sticky bit — only owner can delete their filesCheck special bits with ls -l — an s in owner execute position = setuid, s in group execute = setgid, t in others execute = sticky.
ls -la /usr/bin/passwd
# -rwsr-xr-x 1 root root passwd
# ↑ 's' in owner execute = setuid (allows any user to change their own password)7. systemd — Managing Services
systemd is the init system and service manager on nearly all modern Linux distributions (Debian, Ubuntu, RHEL, Fedora, Amazon Linux 2+).
Essential commands
1systemctl status nginx # Service status
2systemctl start nginx # Start
3systemctl stop nginx # Stop
4systemctl restart nginx # Restart
5systemctl reload nginx # Reload config without restart (if supported)
6systemctl enable nginx # Start on boot
7systemctl disable nginx # Don't start on boot
8systemctl list-units --type=service # List all services
9systemctl list-units --failed # Show failed servicesLogs with journalctl
1journalctl -u nginx # All logs for nginx
2journalctl -u nginx -f # Follow live
3journalctl -u nginx --since "1 hour ago"
4journalctl -u nginx --since "2026-06-01 10:00" --until "2026-06-01 11:00"
5journalctl -p err -u nginx # Error-level and above only
6journalctl --disk-usage # How much space logs are using
7journalctl --vacuum-size=500M # Trim logs to 500MBWriting a service unit
Service unit files live in /etc/systemd/system/. Here's a minimal one:
1# /etc/systemd/system/myapp.service
2[Unit]
3Description=My Application
4After=network.target
5Wants=network.target
6
7[Service]
8Type=simple
9User=deploy
10WorkingDirectory=/opt/myapp
11ExecStart=/opt/myapp/bin/server --port=8080
12Restart=on-failure
13RestartSec=5
14StandardOutput=journal
15StandardError=journal
16Environment=NODE_ENV=production
17EnvironmentFile=/opt/myapp/.env
18
19[Install]
20WantedBy=multi-user.targetsystemctl daemon-reload # Required after editing unit files
systemctl enable --now myapp # Enable and start immediately
journalctl -u myapp -f # Watch logsKey Restart= values: no (never), on-failure (on non-zero exit), always (always restart).
8. Performance Tuning
ulimit — per-process limits
ulimit -n # Current open file limit (1024 on older systems; modern distros default to 1048576)
ulimit -n 65536 # Raise limit for current shell session
ulimit -a # Show all limitsFor services: set in the systemd unit file:
[Service]
LimitNOFILE=65536
LimitNPROC=4096sysctl — kernel parameters
sysctl -a # All kernel parameters
sysctl net.core.somaxconn # Read a parameter
sysctl -w net.core.somaxconn=65535 # Write (temporary, until reboot)To persist across reboots, write to /etc/sysctl.conf or /etc/sysctl.d/99-custom.conf:
echo "net.core.somaxconn = 65535" >> /etc/sysctl.d/99-custom.conf
sysctl -p /etc/sysctl.d/99-custom.conf # Apply immediatelyCommon production tuning parameters
# TCP connection backlog (important for high-traffic servers)
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# File descriptors (affects all processes)
fs.file-max = 2097152
# Time-wait sockets (reduce TIME_WAIT accumulation)
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
# VM memory overcommit (important for Redis, Java apps)
vm.overcommit_memory = 1
# Reduce swappiness to near-zero (Kubernetes nodes also require swapoff -a to fully disable swap)
vm.swappiness = 1
perf — CPU profiling
perf stat ls # CPU counter summary for a command
perf top # Live CPU usage by function
perf record -p 1234 -g sleep 30 # 30-second profile of process
perf report # Analyse recorded data9. Container Primitives — Namespaces and cgroups
Containers are not magic. They're Linux namespaces (isolation) and cgroups (resource limits) combined.
Namespaces
1# List all namespaces for a process
2ls -la /proc/1234/ns/
3
4# Run a command in an isolated PID namespace (like a container)
5unshare --pid --fork --mount-proc bash
6
7# Enter the namespace of a running container/process
8nsenter -t 1234 --net --pid bash
9
10# See which namespace a process is in
11ls -la /proc/1234/ns/netnsenter is invaluable for debugging containers — it puts you inside the network and PID namespace of a running pod without needing docker exec or kubectl exec.
cgroup v2
1# Check if you're on cgroup v2
2mount | grep cgroup2
3cat /sys/fs/cgroup/cgroup.controllers
4
5# See what a process belongs to
6cat /proc/1234/cgroup
7
8# Memory limit for a cgroup
9cat /sys/fs/cgroup/system.slice/myapp.service/memory.max
10
11# CPU quota (100000 = 1 CPU, 200000 = 2 CPUs per 100ms period)
12cat /sys/fs/cgroup/system.slice/myapp.service/cpu.maxWhen a container is OOMKilled, it's the cgroup memory limit enforced by the kernel. When you set resources.limits.cpu in Kubernetes, it's a cgroup CPU quota. Understanding cgroups means understanding why Kubernetes resource limits work the way they do.
10. Shell Scripting for Production
Defensive defaults
Always start scripts with:
#!/usr/bin/env bash
set -euo pipefailset -e— exit immediately if any command failsset -u— treat unset variables as errorsset -o pipefail— a pipe fails if any command in it fails (without this,false | truesucceeds)
Functions and error handling
1#!/usr/bin/env bash
2set -euo pipefail
3
4log() {
5 echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*"
6}
7
8die() {
9 log "ERROR: $*" >&2
10 exit 1
11}
12
13cleanup() {
14 log "Cleaning up..."
15 rm -f /tmp/deploy.lock
16}
17trap cleanup EXIT # runs cleanup() on exit, even on error
18
19[[ -f /tmp/deploy.lock ]] && die "Deploy already in progress"
20touch /tmp/deploy.lock
21
22log "Starting deploy..."Looping over files
1for file in *.yaml; do
2 echo "Processing $file"
3 kubectl apply -f "$file"
4done
5
6# Loop with array
7services=("api" "worker" "scheduler")
8for svc in "${services[@]}"; do
9 kubectl rollout restart deployment/"$svc"
10doneChecking exit codes
1if kubectl get pod "$POD" &>/dev/null; then
2 echo "Pod exists"
3else
4 echo "Pod not found"
5fi
6
7# Retry pattern
8for i in {1..5}; do
9 curl -sf https://api.internal/health && break
10 echo "Attempt $i failed, retrying..."
11 sleep 5
12doneHere documents
1kubectl apply -f - <<EOF
2apiVersion: v1
3kind: ConfigMap
4metadata:
5 name: app-config
6data:
7 DB_HOST: "${DB_HOST}"
8 APP_ENV: production
9EOFWhere to Go From Here
This tutorial completes the Linux Foundations learning path:
- Beginner — Filesystem, basic file ops, grep, find
- Intermediate — Pipes, text processing, processes, SSH, cron
- Advanced (this tutorial) — strace, systemd, cgroups, kernel tuning, production scripting
With these foundations you're ready for Stage 2 of the Platform Engineering Roadmap — containers and Docker — where namespaces and cgroups that you just learned about become the building blocks of everything you'll run in production.
We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.
Struggling with this in production?
We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.