Observability for a fintech platform — logs, metrics, traces and audit

Structured JSON logs, propagated trace ID, OpenTelemetry, P1/P2/P3 alerts with runbooks, append-only audit log: the toolkit to diagnose a fintech incident in 5 minutes.

On a financial platform, "works in dev" isn't enough. Here's how I instrument logs, metrics and traces so a 3 AM incident finds its root cause in 5 minutes.

1. The context

Nexus is a financial investment platform. When a user makes a deposit, a dozen operations chain together: auth, KYC scoring, balance check, payment gateway call, confirmation webhook, account update, notification. If one breaks, the user sees "transaction failed" — but without observability, we don't know where it broke.

Observability isn't a luxury on this kind of platform. It's what turns a 3 AM incident from a 5-hour scramble into a 5-minute diagnosis. This article condenses what I put in place.

2. The three pillars — logs, metrics, traces

Logs

What happened. Structured text, readable by humans and machines. Granularity: one line per notable event.

Metrics

How many, how often. Aggregated time series. Granularity: one point per minute typically.

Distributed traces

The path of a request across services. One trace = one propagated ID. Granularity: one span per step.

The three are complementary. A metric alerts ("error rate spikes"), a log shows the cause ("Stripe returned 503"), a trace shows where in the chain ("between auth-service and payment-service, 8s latency").

3. Pattern 1 — JSON-structured logs

Unstructured logs are useless for correlation. All Nexus logs are JSON with stable structure.

TypeScript

1// src/utils/logger.ts
2import winston from "winston";
3 
4export const logger = winston.createLogger({
5  level: process.env.LOG_LEVEL ?? "info",
6  format: winston.format.combine(
7    winston.format.timestamp(),
8    winston.format.errors({ stack: true }),
9    winston.format.json(),
10  ),
11  defaultMeta: { service: "nexus-api", env: process.env.NODE_ENV },
12  transports: [new winston.transports.Console()],
13});

A typical line:

JSON

1{
"timestamp": "2026-03-12T14:32:08.412Z",
"level": "info",
"service": "nexus-api",
"env": "production",
"trace_id": "abc123",
"user_id": "u_8jhq...",
"operation": "deposit.initiate",
"amount_xof": 50000,
"gateway": "mtn_momo",
"msg": "Deposit initiated"
12}

The trace_id is key. All logs of one request share it. Loki or Elasticsearch index that field, and a trace_id="abc123" search reconstructs the whole story.

4. Pattern 2 — Disciplined log levels

Four levels, period. A fifth kills readability.

Level	When	Example
ERROR	Something really broke and a human must know	Invalid webhook signature
WARN	Anomaly that doesn't break but must be watched	Retry after 3 failures
INFO	Notable business event	Payment confirmed, KYC validated
DEBUG	Technical details useful for investigation	Received payload, SQL query

Discipline rule: DEBUG is off in prod by default. Toggleable temporarily via env without redeployment. If your prod logs are 80% DEBUG, you don't have logs — you have noise.

5. Pattern 3 — Trace ID propagation

Without propagated trace ID, you can't follow a request across services. On Nexus, the gateway generates a trace ID at each incoming request and propagates it as HTTP header x-trace-id.

TypeScript

1// src/middlewares/traceContext.ts
2import { randomUUID } from "node:crypto";
3import { AsyncLocalStorage } from "node:async_hooks";
4 
5const storage = new AsyncLocalStorage<{ traceId: string }>();
6 
7export function traceContext(req, res, next) {
8  const traceId = req.header("x-trace-id") ?? randomUUID();
9  res.setHeader("x-trace-id", traceId);
10  storage.run({ traceId }, () => next());
11}
12 
13export function currentTraceId(): string | undefined {
14  return storage.getStore()?.traceId;
15}

The logger automatically includes trace_id in each log:

TypeScript

1const log = (level, msg, meta = {}) =>
2  logger.log({ level, message: msg, trace_id: currentTraceId(), ...meta });

Each downstream HTTP call propagates the header. The trace becomes unbroken.

6. Pattern 4 — Business metrics + technical metrics

Two distinct families, two distinct dashboards.

Technical (red signals)

HTTP error rate per route (4xx, 5xx)
p50, p95, p99 latency per route
Saturation: CPU, memory, DB connections, queue depth
Per-service availability (regular probe)

Business (golden signals)

Deposit / withdrawal volume per minute
KYC completion rate
Time-to-money (from deposit click to user confirmation)
Platform-wide balance

TypeScript

1import prom from "prom-client";
2 
3export const depositCounter = new prom.Counter({
4  name: "nexus_deposit_total",
5  help: "Total deposits initiated",
6  labelNames: ["gateway", "currency", "status"],
7});
8 
9export const depositLatency = new prom.Histogram({
10  name: "nexus_deposit_seconds",
11  help: "Deposit completion time",
12  labelNames: ["gateway"],
13  buckets: [0.5, 1, 2, 5, 10, 30],
14});

A business alert ("deposit volume divided by 3 in the last hour") signals earlier than a technical alert ("payment service responds slower"). Both matter.

7. Pattern 5 — Distributed tracing with OpenTelemetry

For requests crossing several services, OpenTelemetry traces each span.

TypeScript

1import { NodeSDK } from "@opentelemetry/sdk-node";
2import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
3import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
4 
5const sdk = new NodeSDK({
6  traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_ENDPOINT }),
7  instrumentations: [getNodeAutoInstrumentations()],
8});
9 
10sdk.start();

With auto-instrumentation, each HTTP call, SQL query, Redis query becomes a span automatically. For business operations, add manual spans:

TypeScript

1import { trace } from "@opentelemetry/api";
2 
3const tracer = trace.getTracer("nexus-api");
4 
5async function processDeposit(input: DepositInput) {
return tracer.startActiveSpan("deposit.process", async (span) => {
  span.setAttribute("amount", input.amount);
  span.setAttribute("gateway", input.gateway);
  try {
    const result = await doProcess(input);
    span.setStatus({ code: 1 });
    return result;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: 2, message: error.message });
    throw error;
  } finally {
    span.end();
  }
});
21}

Visualization in Tempo or Jaeger: you see the full cascade, where each step took how many ms, where it broke.

8. Pattern 6 — Smart sampling

Tracing 100% of requests in production costs heavily in storage and bandwidth. Sampling:

100% of errors — always traced
100% of financial operations — non-negotiable
10% of read requests — sampled
1% of healthchecks — near zero

TypeScript

1const sampler = new ParentBasedSampler({
2  root: new TraceIdRatioBasedSampler(0.1),
3  remoteParentSampled: new AlwaysOnSampler(),
4});

Storage cost ÷10 without losing signals that matter.

9. Pattern 7 — Alerts that wake you vs alerts that inform

Three alert channels, three distinct urgencies:

P1 (wakes at 3 AM): platform down, payment impossible, corrupted balance
P2 (Slack business hours): degraded latency, error rate > 1%, growing queue
P3 (weekly digest): recurring warnings, declining business metrics

YAML

1# prometheus/alerts.yml
2groups:
- name: nexus.p1
  rules:
    - alert: PaymentServiceDown
      expr: up{service="payment-service"} == 0
      for: 1m
      labels: { severity: p1 }
      annotations:
        summary: "Payment service is down"
        runbook: "https://wiki.nexus/runbooks/payment-down"

Each P1 alert has a runbook: 5 ordered steps. No "thinking at 3 AM" — execute runbook, verify, escalate if unresolved in 15 min.

10. Pattern 8 — "Health in 5 seconds" dashboard

A single dashboard open on a big screen in the workspace. Three blocks:

Service status: 6 green/red boxes
Key business metrics: deposits/h, withdrawals/h, KYC validated/h
Errors of last 15 min: top 5 by frequency

If all green, nobody looks. If something turns red, everyone sees at once. No need to wait for a Slack alert.

11. Pattern 9 — Separate audit log

Application logs aren't enough for fintech audit. A separate log, append-only, kept 7 years minimum:

SQL

1CREATE TABLE audit_log (
2  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
3  actor_type   VARCHAR(20) NOT NULL,
4  actor_id     UUID NULL,
5  action       VARCHAR(80) NOT NULL,
6  resource     VARCHAR(80) NOT NULL,
7  resource_id  UUID NULL,
8  before       JSONB NULL,
9  after        JSONB NULL,
10  trace_id     VARCHAR(64) NOT NULL,
11  at           TIMESTAMP NOT NULL DEFAULT NOW()
12);
13 
14CREATE FUNCTION audit_immutable() RETURNS trigger AS $$
15BEGIN RAISE EXCEPTION 'audit_log is append-only'; END;
16$$ LANGUAGE plpgsql;
17CREATE TRIGGER no_update BEFORE UPDATE ON audit_log
18  FOR EACH ROW EXECUTE FUNCTION audit_immutable();
19CREATE TRIGGER no_delete BEFORE DELETE ON audit_log
20  FOR EACH ROW EXECUTE FUNCTION audit_immutable();

This audit is what a regulator or auditor will read. It must be clean, complete, immutable.

12. Pitfalls to avoid

Pitfall	Symptom	Fix
Plain text logs	Search impossible	JSON structured, indexed
No propagated trace ID	Inter-service request lost	UUID propagated as header everywhere
Too many log levels	Unmanageable noise	4 levels max, DEBUG off in prod
Alerts without runbook	Nightly panic	Every P1 has its runbook
100% sampling	Exorbitant storage	Smart sampling by criticality
No separate audit	Compliance risk	Append-only audit_log table, 7 years
Over-detailed dashboards	Nobody looks	One health dashboard, 5 seconds
Technical-only metrics	Invisible business drift	Business + technical separate

13. Closing

Observability isn't an infra topic. It's an operational resilience topic. A fintech platform without clear observability is a time bomb: the first serious incident lasts 6 hours, loses transactions, and damages user trust.

Four fundamentals to start: structured JSON logs with propagated trace ID, business + technical metrics separate, distributed traces on critical operations, separate append-only audit log.

Cost is real (~2 weeks initial setup, ~20% code overhead). Benefit is priceless the first time a 3 AM incident diagnoses in 5 minutes.

Structured JSON logs, propagated trace ID, OpenTelemetry, P1/P2/P3 alerts with runbooks, append-only audit log: the toolkit to diagnose a fintech incident in 5 minutes.

On a financial platform, "works in dev" isn't enough. Here's how I instrument logs, metrics and traces so a 3 AM incident finds its root cause in 5 minutes.

1. The context

Observability isn't a luxury on this kind of platform. It's what turns a 3 AM incident from a 5-hour scramble into a 5-minute diagnosis. This article condenses what I put in place.

2. The three pillars — logs, metrics, traces

Logs

What happened. Structured text, readable by humans and machines. Granularity: one line per notable event.

Metrics

How many, how often. Aggregated time series. Granularity: one point per minute typically.

Distributed traces

The path of a request across services. One trace = one propagated ID. Granularity: one span per step.

3. Pattern 1 — JSON-structured logs

Unstructured logs are useless for correlation. All Nexus logs are JSON with stable structure.

TypeScript

1// src/utils/logger.ts
2import winston from "winston";
3 
4export const logger = winston.createLogger({
5  level: process.env.LOG_LEVEL ?? "info",
6  format: winston.format.combine(
7    winston.format.timestamp(),
8    winston.format.errors({ stack: true }),
9    winston.format.json(),
10  ),
11  defaultMeta: { service: "nexus-api", env: process.env.NODE_ENV },
12  transports: [new winston.transports.Console()],
13});

A typical line:

JSON

1{
"timestamp": "2026-03-12T14:32:08.412Z",
"level": "info",
"service": "nexus-api",
"env": "production",
"trace_id": "abc123",
"user_id": "u_8jhq...",
"operation": "deposit.initiate",
"amount_xof": 50000,
"gateway": "mtn_momo",
"msg": "Deposit initiated"
12}

The trace_id is key. All logs of one request share it. Loki or Elasticsearch index that field, and a trace_id="abc123" search reconstructs the whole story.

4. Pattern 2 — Disciplined log levels

Four levels, period. A fifth kills readability.

Level	When	Example
ERROR	Something really broke and a human must know	Invalid webhook signature
WARN	Anomaly that doesn't break but must be watched	Retry after 3 failures
INFO	Notable business event	Payment confirmed, KYC validated
DEBUG	Technical details useful for investigation	Received payload, SQL query

Discipline rule: DEBUG is off in prod by default. Toggleable temporarily via env without redeployment. If your prod logs are 80% DEBUG, you don't have logs — you have noise.

5. Pattern 3 — Trace ID propagation

Without propagated trace ID, you can't follow a request across services. On Nexus, the gateway generates a trace ID at each incoming request and propagates it as HTTP header x-trace-id.

TypeScript

1// src/middlewares/traceContext.ts
2import { randomUUID } from "node:crypto";
3import { AsyncLocalStorage } from "node:async_hooks";
4 
5const storage = new AsyncLocalStorage<{ traceId: string }>();
6 
7export function traceContext(req, res, next) {
8  const traceId = req.header("x-trace-id") ?? randomUUID();
9  res.setHeader("x-trace-id", traceId);
10  storage.run({ traceId }, () => next());
11}
12 
13export function currentTraceId(): string | undefined {
14  return storage.getStore()?.traceId;
15}

The logger automatically includes trace_id in each log:

TypeScript

1const log = (level, msg, meta = {}) =>
2  logger.log({ level, message: msg, trace_id: currentTraceId(), ...meta });

Each downstream HTTP call propagates the header. The trace becomes unbroken.

6. Pattern 4 — Business metrics + technical metrics

Two distinct families, two distinct dashboards.

Technical (red signals)

HTTP error rate per route (4xx, 5xx)
p50, p95, p99 latency per route
Saturation: CPU, memory, DB connections, queue depth
Per-service availability (regular probe)

Business (golden signals)

Deposit / withdrawal volume per minute
KYC completion rate
Time-to-money (from deposit click to user confirmation)
Platform-wide balance

TypeScript

1import prom from "prom-client";
2 
3export const depositCounter = new prom.Counter({
4  name: "nexus_deposit_total",
5  help: "Total deposits initiated",
6  labelNames: ["gateway", "currency", "status"],
7});
8 
9export const depositLatency = new prom.Histogram({
10  name: "nexus_deposit_seconds",
11  help: "Deposit completion time",
12  labelNames: ["gateway"],
13  buckets: [0.5, 1, 2, 5, 10, 30],
14});

A business alert ("deposit volume divided by 3 in the last hour") signals earlier than a technical alert ("payment service responds slower"). Both matter.

7. Pattern 5 — Distributed tracing with OpenTelemetry

For requests crossing several services, OpenTelemetry traces each span.

TypeScript

1import { NodeSDK } from "@opentelemetry/sdk-node";
2import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
3import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
4 
5const sdk = new NodeSDK({
6  traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_ENDPOINT }),
7  instrumentations: [getNodeAutoInstrumentations()],
8});
9 
10sdk.start();

With auto-instrumentation, each HTTP call, SQL query, Redis query becomes a span automatically. For business operations, add manual spans:

TypeScript

1import { trace } from "@opentelemetry/api";
2 
3const tracer = trace.getTracer("nexus-api");
4 
5async function processDeposit(input: DepositInput) {
return tracer.startActiveSpan("deposit.process", async (span) => {
  span.setAttribute("amount", input.amount);
  span.setAttribute("gateway", input.gateway);
  try {
    const result = await doProcess(input);
    span.setStatus({ code: 1 });
    return result;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: 2, message: error.message });
    throw error;
  } finally {
    span.end();
  }
});
21}

Visualization in Tempo or Jaeger: you see the full cascade, where each step took how many ms, where it broke.

8. Pattern 6 — Smart sampling

Tracing 100% of requests in production costs heavily in storage and bandwidth. Sampling:

100% of errors — always traced
100% of financial operations — non-negotiable
10% of read requests — sampled
1% of healthchecks — near zero

TypeScript

1const sampler = new ParentBasedSampler({
2  root: new TraceIdRatioBasedSampler(0.1),
3  remoteParentSampled: new AlwaysOnSampler(),
4});

Storage cost ÷10 without losing signals that matter.

9. Pattern 7 — Alerts that wake you vs alerts that inform

Three alert channels, three distinct urgencies:

P1 (wakes at 3 AM): platform down, payment impossible, corrupted balance
P2 (Slack business hours): degraded latency, error rate > 1%, growing queue
P3 (weekly digest): recurring warnings, declining business metrics

YAML

1# prometheus/alerts.yml
2groups:
- name: nexus.p1
  rules:
    - alert: PaymentServiceDown
      expr: up{service="payment-service"} == 0
      for: 1m
      labels: { severity: p1 }
      annotations:
        summary: "Payment service is down"
        runbook: "https://wiki.nexus/runbooks/payment-down"

Each P1 alert has a runbook: 5 ordered steps. No "thinking at 3 AM" — execute runbook, verify, escalate if unresolved in 15 min.

10. Pattern 8 — "Health in 5 seconds" dashboard

A single dashboard open on a big screen in the workspace. Three blocks:

Service status: 6 green/red boxes
Key business metrics: deposits/h, withdrawals/h, KYC validated/h
Errors of last 15 min: top 5 by frequency

If all green, nobody looks. If something turns red, everyone sees at once. No need to wait for a Slack alert.

11. Pattern 9 — Separate audit log

Application logs aren't enough for fintech audit. A separate log, append-only, kept 7 years minimum:

SQL

1CREATE TABLE audit_log (
2  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
3  actor_type   VARCHAR(20) NOT NULL,
4  actor_id     UUID NULL,
5  action       VARCHAR(80) NOT NULL,
6  resource     VARCHAR(80) NOT NULL,
7  resource_id  UUID NULL,
8  before       JSONB NULL,
9  after        JSONB NULL,
10  trace_id     VARCHAR(64) NOT NULL,
11  at           TIMESTAMP NOT NULL DEFAULT NOW()
12);
13 
14CREATE FUNCTION audit_immutable() RETURNS trigger AS $$
15BEGIN RAISE EXCEPTION 'audit_log is append-only'; END;
16$$ LANGUAGE plpgsql;
17CREATE TRIGGER no_update BEFORE UPDATE ON audit_log
18  FOR EACH ROW EXECUTE FUNCTION audit_immutable();
19CREATE TRIGGER no_delete BEFORE DELETE ON audit_log
20  FOR EACH ROW EXECUTE FUNCTION audit_immutable();

This audit is what a regulator or auditor will read. It must be clean, complete, immutable.

12. Pitfalls to avoid

Pitfall	Symptom	Fix
Plain text logs	Search impossible	JSON structured, indexed
No propagated trace ID	Inter-service request lost	UUID propagated as header everywhere
Too many log levels	Unmanageable noise	4 levels max, DEBUG off in prod
Alerts without runbook	Nightly panic	Every P1 has its runbook
100% sampling	Exorbitant storage	Smart sampling by criticality
No separate audit	Compliance risk	Append-only audit_log table, 7 years
Over-detailed dashboards	Nobody looks	One health dashboard, 5 seconds
Technical-only metrics	Invisible business drift	Business + technical separate

13. Closing

Four fundamentals to start: structured JSON logs with propagated trace ID, business + technical metrics separate, distributed traces on critical operations, separate append-only audit log.

Cost is real (~2 weeks initial setup, ~20% code overhead). Benefit is priceless the first time a 3 AM incident diagnoses in 5 minutes.

1	// src/utils/logger.ts
2	import winston from "winston";
3
4	export const logger = winston.createLogger({
5	level: process.env.LOG_LEVEL ?? "info",
6	format: winston.format.combine(
7	winston.format.timestamp(),
8	winston.format.errors({ stack: true }),
9	winston.format.json(),
10	),
11	defaultMeta: { service: "nexus-api", env: process.env.NODE_ENV },
12	transports: [new winston.transports.Console()],
13	});

1	{
2	"timestamp": "2026-03-12T14:32:08.412Z",
3	"level": "info",
4	"service": "nexus-api",
5	"env": "production",
6	"trace_id": "abc123",
7	"user_id": "u_8jhq...",
8	"operation": "deposit.initiate",
9	"amount_xof": 50000,
10	"gateway": "mtn_momo",
11	"msg": "Deposit initiated"
12	}

1	// src/middlewares/traceContext.ts
2	import { randomUUID } from "node:crypto";
3	import { AsyncLocalStorage } from "node:async_hooks";
4
5	const storage = new AsyncLocalStorage<{ traceId: string }>();
6
7	export function traceContext(req, res, next) {
8	const traceId = req.header("x-trace-id") ?? randomUUID();
9	res.setHeader("x-trace-id", traceId);
10	storage.run({ traceId }, () => next());
11	}
12
13	export function currentTraceId(): string \| undefined {
14	return storage.getStore()?.traceId;
15	}

1	const log = (level, msg, meta = {}) =>
2	logger.log({ level, message: msg, trace_id: currentTraceId(), ...meta });

1	import prom from "prom-client";
2
3	export const depositCounter = new prom.Counter({
4	name: "nexus_deposit_total",
5	help: "Total deposits initiated",
6	labelNames: ["gateway", "currency", "status"],
7	});
8
9	export const depositLatency = new prom.Histogram({
10	name: "nexus_deposit_seconds",
11	help: "Deposit completion time",
12	labelNames: ["gateway"],
13	buckets: [0.5, 1, 2, 5, 10, 30],
14	});

1	import { NodeSDK } from "@opentelemetry/sdk-node";
2	import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
3	import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
4
5	const sdk = new NodeSDK({
6	traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_ENDPOINT }),
7	instrumentations: [getNodeAutoInstrumentations()],
8	});
9
10	sdk.start();

1	import { trace } from "@opentelemetry/api";
2
3	const tracer = trace.getTracer("nexus-api");
4
5	async function processDeposit(input: DepositInput) {
6	return tracer.startActiveSpan("deposit.process", async (span) => {
7	span.setAttribute("amount", input.amount);
8	span.setAttribute("gateway", input.gateway);
9	try {
10	const result = await doProcess(input);
11	span.setStatus({ code: 1 });
12	return result;
13	} catch (error) {
14	span.recordException(error);
15	span.setStatus({ code: 2, message: error.message });
16	throw error;
17	} finally {
18	span.end();
19	}
20	});
21	}

1	const sampler = new ParentBasedSampler({
2	root: new TraceIdRatioBasedSampler(0.1),
3	remoteParentSampled: new AlwaysOnSampler(),
4	});

1	# prometheus/alerts.yml
2	groups:
3	- name: nexus.p1
4	rules:
5	- alert: PaymentServiceDown
6	expr: up{service="payment-service"} == 0
7	for: 1m
8	labels: { severity: p1 }
9	annotations:
10	summary: "Payment service is down"
11	runbook: "https://wiki.nexus/runbooks/payment-down"

1	CREATE TABLE audit_log (
2	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
3	actor_type VARCHAR(20) NOT NULL,
4	actor_id UUID NULL,
5	action VARCHAR(80) NOT NULL,
6	resource VARCHAR(80) NOT NULL,
7	resource_id UUID NULL,
8	before JSONB NULL,
9	after JSONB NULL,
10	trace_id VARCHAR(64) NOT NULL,
11	at TIMESTAMP NOT NULL DEFAULT NOW()
12	);
13
14	CREATE FUNCTION audit_immutable() RETURNS trigger AS $$
15	BEGIN RAISE EXCEPTION 'audit_log is append-only'; END;
16	$$ LANGUAGE plpgsql;
17	CREATE TRIGGER no_update BEFORE UPDATE ON audit_log
18	FOR EACH ROW EXECUTE FUNCTION audit_immutable();
19	CREATE TRIGGER no_delete BEFORE DELETE ON audit_log
20	FOR EACH ROW EXECUTE FUNCTION audit_immutable();

Observability for a fintech platform — logs, metrics, traces and audit

1. The context

2. The three pillars — logs, metrics, traces

Logs

Metrics

Distributed traces

3. Pattern 1 — JSON-structured logs

4. Pattern 2 — Disciplined log levels

5. Pattern 3 — Trace ID propagation

6. Pattern 4 — Business metrics + technical metrics

Technical (red signals)

Business (golden signals)

7. Pattern 5 — Distributed tracing with OpenTelemetry

8. Pattern 6 — Smart sampling

9. Pattern 7 — Alerts that wake you vs alerts that inform

10. Pattern 8 — "Health in 5 seconds" dashboard

11. Pattern 9 — Separate audit log

12. Pitfalls to avoid

13. Closing

Turn this reading into a product decision.

Get the next technical notes.

Observability for a fintech platform — logs, metrics, traces and audit

1. The context

2. The three pillars — logs, metrics, traces

Logs

Metrics

Distributed traces

3. Pattern 1 — JSON-structured logs

4. Pattern 2 — Disciplined log levels

5. Pattern 3 — Trace ID propagation

6. Pattern 4 — Business metrics + technical metrics

Technical (red signals)

Business (golden signals)

7. Pattern 5 — Distributed tracing with OpenTelemetry

8. Pattern 6 — Smart sampling

9. Pattern 7 — Alerts that wake you vs alerts that inform

10. Pattern 8 — "Health in 5 seconds" dashboard

11. Pattern 9 — Separate audit log

12. Pitfalls to avoid

13. Closing

Turn this reading into a product decision.

Get the next technical notes.