If You Cannot See It, You Cannot Operate It
Module 6 – Observability Foundations
Visibility Before Automation
Infrastructure fails quietly before it fails loudly.
CPU spikes.
Disk fills.
Memory fragments.
Services restart.
Connections timeout.
If you only notice failure after users complain, you are not operating — you are reacting.
Observability is the discipline of seeing system behavior in real time.
This module builds foundational operational awareness before introducing advanced monitoring stacks.
1. Observability vs Monitoring
Monitoring asks:
Is the system up?
Observability asks:
Why is the system behaving this way?
Monitoring is binary.
Observability is diagnostic.
You begin with system-native tools.
2. System Resource Visibility
CPU Monitoring
top
or
htop
Observe:
- Load average
- CPU utilization
- Process CPU consumption
Load average rule of thumb:
If load > number of CPU cores → system is stressed.
Memory Monitoring
free -h
Check:
- Used memory
- Available memory
- Swap usage
If swap usage increases consistently:
- Memory allocation is insufficient
- Application behavior needs investigation
Disk Monitoring
df -h
Monitor:
- Root filesystem usage
- /var usage
- Custom logical volumes
When disk exceeds 85%, alerts should trigger.
3. Log-Based Observability
Logs are your first diagnostic layer.
View system journal:
sudo journalctl
View specific service logs:
sudo journalctl -u nginx
Follow logs in real time:
sudo journalctl -f
Logs answer:
- What happened?
- When did it happen?
- Which process was involved?
Ignoring logs removes your historical insight.
4. Service Health Inspection
Check service status:
sudo systemctl status nginx
List failed services:
sudo systemctl --failed
Operational awareness means checking health proactively.
Do not wait for service crashes.
5. Network Observability
Check listening ports:
sudo ss -tulnp
Inspect active connections:
sudo ss -tn
Network inspection answers:
- Who is connected?
- What services are exposed?
- Is traffic abnormal?
6. Process Inspection
List processes sorted by memory usage:
ps aux --sort=-%mem | head
List processes sorted by CPU usage:
ps aux --sort=-%cpu | head
Understanding process behavior is critical during incidents.
7. Basic Alert Mindset
Even without monitoring tools, you should define thresholds.
Examples:
- CPU > 80% sustained for 5 minutes
- Disk > 85%
- Swap > 20%
- Service restart count > 3 in 10 minutes
Observability begins with defining abnormal behavior.
8. Simulating Observability Scenarios
CPU Stress Simulation
Install stress tool:
sudo dnf install stress -y
Run:
stress --cpu 2 --timeout 60
Observe:
top
Watch load increase.
Disk Pressure Simulation
sudo fallocate -l 1G /data/fillfile
Monitor:
df -h
Observe how system behaves near threshold.
9. Centralized Thinking (Preview of Advanced Observability)
In production:
- Logs are centralized
- Metrics are aggregated
- Alerts are automated
- Dashboards visualize trends
Foundational tools prepare you for:
- Prometheus
- Grafana
- ELK stack
- Cloud monitoring systems
But without understanding raw system behavior, dashboards are misleading.
10. Multi-Node Observability
In your multi-node lab:
- Monitor app-node CPU
- Monitor db-node disk
- Monitor service restarts
Ask:
- Does database failure increase CPU on app node?
- Does network latency affect response times?
- Does log growth correlate with traffic?
Observability connects cause and effect.
11. Snapshot After Baseline Monitoring
Once observability tools and baseline configurations are validated:
Take snapshot:
06-observability-baseline
This snapshot represents a stable, monitored system.
12. Lab Assignment
- Monitor CPU, memory, and disk on all nodes.
- Simulate CPU stress.
- Simulate disk pressure.
- Stop a service intentionally.
- Observe:
- Logs
- Resource metrics
- System behavior
- Document:
- What changed?
- What symptoms appeared first?
- What would an alert detect?
Deliverable:
Write a short operational analysis of one simulated failure.
If you cannot explain system behavior during stress, you cannot operate production systems.
13. Production Reflection
Consider:
- What metrics matter most in production?
- What does “mean time to detect” (MTTD) mean?
- How would you automate alerting?
- How would you prevent alert fatigue?
Observability is not tool-driven.
It is mindset-driven.
Module Completion Criteria
You have completed the DevOps Lab Engineering course when:
- Infrastructure is segmented.
- Systems are hardened.
- Storage is engineered.
- Nodes are distributed.
- System behavior is observable.
- Snapshots are versioned.
You now have:
A controlled, production-style DevOps lab.