Skip to content

If You Cannot See It, You Cannot Operate It


Module 6 – Observability Foundations

Visibility Before Automation

Infrastructure fails quietly before it fails loudly.

CPU spikes.
Disk fills.
Memory fragments.
Services restart.
Connections timeout.

If you only notice failure after users complain, you are not operating — you are reacting.

Observability is the discipline of seeing system behavior in real time.

This module builds foundational operational awareness before introducing advanced monitoring stacks.


1. Observability vs Monitoring

Monitoring asks:

Is the system up?

Observability asks:

Why is the system behaving this way?

Monitoring is binary.
Observability is diagnostic.

You begin with system-native tools.


2. System Resource Visibility

CPU Monitoring

top

or

htop

Observe:

  • Load average
  • CPU utilization
  • Process CPU consumption

Load average rule of thumb:

If load > number of CPU cores → system is stressed.


Memory Monitoring

free -h

Check:

  • Used memory
  • Available memory
  • Swap usage

If swap usage increases consistently:

  • Memory allocation is insufficient
  • Application behavior needs investigation

Disk Monitoring

df -h

Monitor:

  • Root filesystem usage
  • /var usage
  • Custom logical volumes

When disk exceeds 85%, alerts should trigger.


3. Log-Based Observability

Logs are your first diagnostic layer.

View system journal:

sudo journalctl

View specific service logs:

sudo journalctl -u nginx

Follow logs in real time:

sudo journalctl -f

Logs answer:

  • What happened?
  • When did it happen?
  • Which process was involved?

Ignoring logs removes your historical insight.


4. Service Health Inspection

Check service status:

sudo systemctl status nginx

List failed services:

sudo systemctl --failed

Operational awareness means checking health proactively.

Do not wait for service crashes.


5. Network Observability

Check listening ports:

sudo ss -tulnp

Inspect active connections:

sudo ss -tn

Network inspection answers:

  • Who is connected?
  • What services are exposed?
  • Is traffic abnormal?

6. Process Inspection

List processes sorted by memory usage:

ps aux --sort=-%mem | head

List processes sorted by CPU usage:

ps aux --sort=-%cpu | head

Understanding process behavior is critical during incidents.


7. Basic Alert Mindset

Even without monitoring tools, you should define thresholds.

Examples:

  • CPU > 80% sustained for 5 minutes
  • Disk > 85%
  • Swap > 20%
  • Service restart count > 3 in 10 minutes

Observability begins with defining abnormal behavior.


8. Simulating Observability Scenarios

CPU Stress Simulation

Install stress tool:

sudo dnf install stress -y

Run:

stress --cpu 2 --timeout 60

Observe:

top

Watch load increase.


Disk Pressure Simulation

sudo fallocate -l 1G /data/fillfile

Monitor:

df -h

Observe how system behaves near threshold.


9. Centralized Thinking (Preview of Advanced Observability)

In production:

  • Logs are centralized
  • Metrics are aggregated
  • Alerts are automated
  • Dashboards visualize trends

Foundational tools prepare you for:

  • Prometheus
  • Grafana
  • ELK stack
  • Cloud monitoring systems

But without understanding raw system behavior, dashboards are misleading.


10. Multi-Node Observability

In your multi-node lab:

  • Monitor app-node CPU
  • Monitor db-node disk
  • Monitor service restarts

Ask:

  • Does database failure increase CPU on app node?
  • Does network latency affect response times?
  • Does log growth correlate with traffic?

Observability connects cause and effect.


11. Snapshot After Baseline Monitoring

Once observability tools and baseline configurations are validated:

Take snapshot:

06-observability-baseline

This snapshot represents a stable, monitored system.


12. Lab Assignment

  1. Monitor CPU, memory, and disk on all nodes.
  2. Simulate CPU stress.
  3. Simulate disk pressure.
  4. Stop a service intentionally.
  5. Observe:
    • Logs
    • Resource metrics
    • System behavior
  6. Document:
    • What changed?
    • What symptoms appeared first?
    • What would an alert detect?

Deliverable:

Write a short operational analysis of one simulated failure.

If you cannot explain system behavior during stress, you cannot operate production systems.


13. Production Reflection

Consider:

  • What metrics matter most in production?
  • What does “mean time to detect” (MTTD) mean?
  • How would you automate alerting?
  • How would you prevent alert fatigue?

Observability is not tool-driven.

It is mindset-driven.


Module Completion Criteria

You have completed the DevOps Lab Engineering course when:

  • Infrastructure is segmented.
  • Systems are hardened.
  • Storage is engineered.
  • Nodes are distributed.
  • System behavior is observable.
  • Snapshots are versioned.

You now have:

A controlled, production-style DevOps lab.

Back To Top
Search
error: Content is protected !!