An Introduction to the 3 Pillars of Observability: Logs, Metrics, and Traces

Introduction: Beyond the Black Box

Your application is slow. Users are complaining about intermittent errors. You look at your server logs, and all you see are thousands of cryptic `INFO` messages. This is the reality for too many engineering teams: their complex, distributed systems are a "black box." They know something is wrong, but they have no way to ask "why?"

This is the problem that **Observability** solves. It's an evolution of traditional monitoring. While monitoring tells you *that* something is wrong (e.g., "CPU is at 90%"), observability gives you the tools to ask *why* it's wrong. It's about understanding the internal state of your system from the outside. This understanding is built upon three foundational pillars: **Logs, Metrics, and Traces**.

Pillar 1: Logs - The Detailed Diary

What They Are

Analogy: Logs are like a detailed, timestamped diary written by your application. Every time a significant event happens—a user logs in, a database query fails, an API request is received—the application writes a line in its diary describing what happened.

Logs are immutable, event-level records. They are rich in context but can be overwhelming in volume. A simple `console.log("User logged in")` is a log, but for observability, we need more.

The Modern Standard: Structured Logging

The key to effective logging is structure. Instead of plain text, logs should be written as **JSON**. This makes them machine-readable and infinitely easier to search, filter, and analyze in a logging platform like Splunk, Datadog, or Loki.

Bad Log (Hard to search):

INFO: User 123 authenticated successfully from IP 192.168.1.100 at 2025-10-25T10:00:00Z

Good Log (Easy to query in Splunk):

{
  "timestamp": "2025-10-25T10:00:00Z",
  "level": "info",
  "message": "User authenticated successfully",
  "userId": 123,
  "sourceIp": "192.168.1.100",
  "service": "auth-service"
}

With structured logs, you can easily ask complex questions like, "Show me all login failures for `userId: 456` from the `auth-service` in the last hour."

Pillar 2: Metrics - The Car's Dashboard

What They Are

Analogy: Metrics are like the dashboard of your car. They don't tell you the story of your whole trip (that's the log), but they give you a high-level, real-time view of the system's health: your speed (request rate), engine temperature (CPU usage), and fuel level (disk space).

Metrics are numeric, aggregatable data points measured over time. They are cheap to store and incredibly efficient for creating dashboards and alerts. The most famous framework for this is Google's "Four Golden Signals":

  • Latency: How long do requests take? (e.g., P95 API response time)
  • Traffic: How much demand is on the system? (e.g., requests per second)
  • Errors: How often are things failing? (e.g., the rate of 500-level HTTP errors)
  • Saturation: How "full" is the system? (e.g., CPU utilization, memory usage)

Pillar 3: Distributed Traces - The Detective's Red String

What They Are

Analogy: If logs are a diary and metrics are a dashboard, a trace is like a detective's corkboard with red string. It follows a single request from the moment it enters your system, tracking its journey as it hops from the frontend to the API gateway, to the auth service, to the database, and back again. It shows you exactly how long each step took.

A trace is a visualization of a single request's entire lifecycle across a distributed system. It is the most powerful tool for debugging latency issues. When a user says, "My request took 3 seconds," a trace can tell you that 2.5 of those seconds were spent waiting for a slow, external API call.

The Modern Standard: OpenTelemetry (OTel)

To achieve this, your services need to be "instrumented." This means adding a library, like the open-source standard **OpenTelemetry**, to your code. OTel automatically injects and propagates a unique `traceId` in the headers of each request as it moves between services, allowing a backend system (like Jaeger or Honeycomb) to stitch the journey together.

Putting It All Together: A Real-World Example with Akamai & Splunk

Imagine a user request is slow. How do our pillars help?

  1. Metrics Alert You: A dashboard in Grafana or Splunk shows a spike in P99 latency for your `/checkout` API endpoint. You know *what* is slow.
  2. Traces Tell You Where: You find a trace for a slow request. It shows that the API gateway is fast, the auth service is fast, but the call to the `payment-service` is taking 3 seconds. You now know *where* the problem is.
  3. Logs Tell You Why: You go to the logs for the `payment-service` and filter by that request's `traceId`. You find a structured log message: `{"level":"error", "message":"Timeout connecting to upstream payment provider", "provider":"Stripe", "timeoutMs":3000}`. You now know *why* it's slow.

Expert Implementation: Akamai Cloud Monitor & Splunk

In a large enterprise, your CDN is a critical source of observability data. Akamai Cloud Monitor can be configured to push rich, real-time data from the edge directly into your logging platform.

The How: In Akamai's Property Manager, you add the "Cloud Monitor" behavior. You configure it with the endpoint for your Splunk HTTP Event Collector (HEC). You can choose which data points to include in the JSON payload sent to Splunk, such as request time, cache status, WAF details, and client IP information.

The Payoff: Now, in Splunk, you can correlate your application logs with the edge delivery logs from Akamai using a shared `traceId` (which you can pass via a request header). You can build dashboards that show cache offload rates, WAF trigger counts, and geographic latency, all in one place, giving you a complete end-to-end view of your system's performance and security.

Conclusion: Ask, Don't Guess

Observability is a cultural shift. It's about instrumenting your code to ask questions later, not just to log what happened. By combining the "what" of Metrics, the "where" of Traces, and the "why" of Logs, you move from a reactive state of guessing and firefighting to a proactive state of understanding and engineering. This is the foundation of building and maintaining reliable, high-performance systems in 2025.

← Back to All Articles