Python Script

Automate Server Health Checks with This Ultimate Bash Script

In the world of DevOps, Cloud Computing, and SRE (Site Reliability Engineering), the worst kind of problem is the one that surprises you. An unexpected server crash, a maxed-out disk, or a CPU spike can bring your applications down, impacting users and your reputation. The solution is to move from a reactive to a proactive monitoring strategy.

This guide provides the foundational tool for that strategy: a powerful, all-in-one Bash script that performs a comprehensive health check on any Linux-based server. It's lightweight, customizable, and serves as the first step towards building a robust observability pipeline.

What Does This Script Do?

This script is a powerful utility that gathers critical system telemetry in a single, easy-to-read report. It checks the vital signs of your server, including:

The output is color-coded, making it instantly clear if a resource is in a healthy (green), warning (yellow), or critical (red) state.

How to Use This Script:

  1. Save the Script: Click the "Copy Script" button below and save the code into a file on your server named `health_check.sh`.
  2. Make It Executable: This is a crucial step. Open your terminal, navigate to where you saved the file, and run this command to give it permission to execute:
    chmod +x health_check.sh
  3. Run the Health Check: Execute the script directly from your terminal to get an instant report:
    ./health_check.sh

Taking It to the Next Level: True Automation

Running the script manually is useful, but its real power comes from automation. Here's how professionals use it:


#!/bin/bash

# =============================================================================
#           System Health Monitoring Script for Linux Servers (2025)
#
# Description:
#   An all-in-one Bash script to get a quick overview of system health.
#   It checks CPU, memory, disk usage, load average, and uptime.
#   Output is color-coded for at-a-glance readability.
#
# =============================================================================

# --- Configuration: Thresholds for Alerts ---
# These values can be adjusted to fit your server's baseline.
CPU_WARN_THRESHOLD=75
MEM_WARN_THRESHOLD=80
DISK_WARN_THRESHOLD=85

# --- Color Codes for Output ---
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# --- Helper Function for Headers ---
print_header() {
    printf "\n%s\n" "============================================================================="
    printf "    %-60s \n" "$1"
    printf "%s\n" "============================================================================="
}

# --- 1. System Information ---
print_header "System Information"
HOSTNAME=$(hostname)
OS=$(source /etc/os-release && echo $PRETTY_NAME)
KERNEL=$(uname -r)
UPTIME=$(uptime -p)

printf "Hostname:         %s\n" "$HOSTNAME"
printf "Operating System: %s\n" "$OS"
printf "Kernel Version:   %s\n" "$KERNEL"
printf "System Uptime:    %s\n" "$UPTIME"

# --- 2. CPU Usage & Load Average ---
print_header "CPU Usage & Load Average"
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
LOAD_AVERAGE=$(uptime | awk -F'load average:' '{ print $2 }')

printf "Current CPU Usage:  %.2f%%\n" "$CPU_USAGE"
printf "Load Average (1m, 5m, 15m):%s\n" "$LOAD_AVERAGE"

if (( $(echo "$CPU_USAGE > $CPU_WARN_THRESHOLD" | bc -l) )); then
    printf "${RED}CRITICAL: CPU usage is above the threshold of ${CPU_WARN_THRESHOLD}%%!${NC}\n"
fi

# --- 3. Memory Usage ---
print_header "Memory (RAM) Usage"
# Using `free -m` for megabytes and modern awk for parsing
MEM_INFO=$(free -m | awk 'NR==2{printf "Total: %s MB | Used: %s MB | Free: %s MB", $2, $3, $4}')
MEM_PERCENTAGE=$(free -m | awk 'NR==2{printf "%.2f", $3*100/$2 }')

printf "Memory Stats:     %s\n" "$MEM_INFO"
printf "Memory Usage:     %s%%\n" "$MEM_PERCENTAGE"

if (( $(echo "$MEM_PERCENTAGE > $MEM_WARN_THRESHOLD" | bc -l) )); then
    printf "${RED}CRITICAL: Memory usage is above the threshold of ${MEM_WARN_THRESHOLD}%%!${NC}\n"
fi

# --- 4. Disk Usage ---
print_header "Disk Filesystem Usage"
# df with -h (human-readable) and excluding tmpfs/devfs which are not persistent disks.
DISK_USAGE=$(df -h --output=source,pcent,size,used,avail | grep -vE 'tmpfs|devfs|squashfs')
printf "%s\n" "$DISK_USAGE"

# Check each filesystem against the threshold
df -H | grep -vE '^Filesystem|tmpfs|devtmpfs' | awk '{ print $5 " " $1 }' | while read output;
do
  usage=$(echo $output | awk '{ print $1}' | sed 's/%//g')
  filesystem=$(echo $output | awk '{ print $2 }')
  if [ $usage -ge $DISK_WARN_THRESHOLD ]; then
    printf "${RED}CRITICAL: Filesystem '%s' usage is at %s%%, which is above the threshold of %s%%!${NC}\n" "$filesystem" "$usage" "$DISK_WARN_THRESHOLD"
  fi
done

# --- 5. Network Information ---
print_header "Network Information"
IP_ADDRESS=$(hostname -I | awk '{print $1}')
TOTAL_PROCESSES=$(ps aux | wc -l)
printf "Primary IP Address: %s\n" "$IP_ADDRESS"
printf "Total Running Processes: %d\n" "$TOTAL_PROCESSES"

printf "\n%s\n" "=========================== Health Check Complete ==========================="