Automate Server Health Checks with This Ultimate Bash Script
In the world of DevOps, Cloud Computing, and SRE (Site Reliability Engineering), the worst kind of problem is the one that surprises you. An unexpected server crash, a maxed-out disk, or a CPU spike can bring your applications down, impacting users and your reputation. The solution is to move from a reactive to a proactive monitoring strategy.
This guide provides the foundational tool for that strategy: a powerful, all-in-one Bash script that performs a comprehensive health check on any Linux-based server. It's lightweight, customizable, and serves as the first step towards building a robust observability pipeline.
What Does This Script Do?
This script is a powerful utility that gathers critical system telemetry in a single, easy-to-read report. It checks the vital signs of your server, including:
- CPU Load: Checks the 1, 5, and 15-minute load averages to identify sustained high processing demand.
- Memory Usage: Reports the total, used, and free RAM, and flags a warning if usage exceeds a critical threshold.
- Disk Space: Scans all mounted filesystems and alerts you if any disk is close to being full.
- Running Processes: Provides a count of total running processes.
- Uptime: Shows how long the server has been running without a reboot.
The output is color-coded, making it instantly clear if a resource is in a healthy (green), warning (yellow), or critical (red) state.
How to Use This Script:
- Save the Script: Click the "Copy Script" button below and save the code into a file on your server named `health_check.sh`.
-
Make It Executable: This is a crucial step. Open your terminal, navigate to where you saved the file, and run this command to give it permission to execute:
chmod +x health_check.sh -
Run the Health Check: Execute the script directly from your terminal to get an instant report:
./health_check.sh
Taking It to the Next Level: True Automation
Running the script manually is useful, but its real power comes from automation. Here's how professionals use it:
- Automate with Cron: Schedule the script to run automatically at regular intervals. Open your crontab editor with `crontab -e` and add a line like this to run the check every hour:
0 * * * * /path/to/your/health_check.sh - Proactive Alerting: You can easily modify the script to send an email, a Slack notification, or a PagerDuty alert if any of the metrics cross the critical threshold. This is the core of a modern alerting strategy.
- CI/CD Integration: Integrate this script into your deployment pipeline. Run a health check on a server *before* deploying new code to ensure the environment is stable and ready.
#!/bin/bash
# =============================================================================
# System Health Monitoring Script for Linux Servers (2025)
#
# Description:
# An all-in-one Bash script to get a quick overview of system health.
# It checks CPU, memory, disk usage, load average, and uptime.
# Output is color-coded for at-a-glance readability.
#
# =============================================================================
# --- Configuration: Thresholds for Alerts ---
# These values can be adjusted to fit your server's baseline.
CPU_WARN_THRESHOLD=75
MEM_WARN_THRESHOLD=80
DISK_WARN_THRESHOLD=85
# --- Color Codes for Output ---
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# --- Helper Function for Headers ---
print_header() {
printf "\n%s\n" "============================================================================="
printf " %-60s \n" "$1"
printf "%s\n" "============================================================================="
}
# --- 1. System Information ---
print_header "System Information"
HOSTNAME=$(hostname)
OS=$(source /etc/os-release && echo $PRETTY_NAME)
KERNEL=$(uname -r)
UPTIME=$(uptime -p)
printf "Hostname: %s\n" "$HOSTNAME"
printf "Operating System: %s\n" "$OS"
printf "Kernel Version: %s\n" "$KERNEL"
printf "System Uptime: %s\n" "$UPTIME"
# --- 2. CPU Usage & Load Average ---
print_header "CPU Usage & Load Average"
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
LOAD_AVERAGE=$(uptime | awk -F'load average:' '{ print $2 }')
printf "Current CPU Usage: %.2f%%\n" "$CPU_USAGE"
printf "Load Average (1m, 5m, 15m):%s\n" "$LOAD_AVERAGE"
if (( $(echo "$CPU_USAGE > $CPU_WARN_THRESHOLD" | bc -l) )); then
printf "${RED}CRITICAL: CPU usage is above the threshold of ${CPU_WARN_THRESHOLD}%%!${NC}\n"
fi
# --- 3. Memory Usage ---
print_header "Memory (RAM) Usage"
# Using `free -m` for megabytes and modern awk for parsing
MEM_INFO=$(free -m | awk 'NR==2{printf "Total: %s MB | Used: %s MB | Free: %s MB", $2, $3, $4}')
MEM_PERCENTAGE=$(free -m | awk 'NR==2{printf "%.2f", $3*100/$2 }')
printf "Memory Stats: %s\n" "$MEM_INFO"
printf "Memory Usage: %s%%\n" "$MEM_PERCENTAGE"
if (( $(echo "$MEM_PERCENTAGE > $MEM_WARN_THRESHOLD" | bc -l) )); then
printf "${RED}CRITICAL: Memory usage is above the threshold of ${MEM_WARN_THRESHOLD}%%!${NC}\n"
fi
# --- 4. Disk Usage ---
print_header "Disk Filesystem Usage"
# df with -h (human-readable) and excluding tmpfs/devfs which are not persistent disks.
DISK_USAGE=$(df -h --output=source,pcent,size,used,avail | grep -vE 'tmpfs|devfs|squashfs')
printf "%s\n" "$DISK_USAGE"
# Check each filesystem against the threshold
df -H | grep -vE '^Filesystem|tmpfs|devtmpfs' | awk '{ print $5 " " $1 }' | while read output;
do
usage=$(echo $output | awk '{ print $1}' | sed 's/%//g')
filesystem=$(echo $output | awk '{ print $2 }')
if [ $usage -ge $DISK_WARN_THRESHOLD ]; then
printf "${RED}CRITICAL: Filesystem '%s' usage is at %s%%, which is above the threshold of %s%%!${NC}\n" "$filesystem" "$usage" "$DISK_WARN_THRESHOLD"
fi
done
# --- 5. Network Information ---
print_header "Network Information"
IP_ADDRESS=$(hostname -I | awk '{print $1}')
TOTAL_PROCESSES=$(ps aux | wc -l)
printf "Primary IP Address: %s\n" "$IP_ADDRESS"
printf "Total Running Processes: %d\n" "$TOTAL_PROCESSES"
printf "\n%s\n" "=========================== Health Check Complete ==========================="