


Structured JSON logs, propagated trace ID, OpenTelemetry, P1/P2/P3 alerts with runbooks, append-only audit log: the toolkit to diagnose a fintech incident in 5 minutes.
On a financial platform, "works in dev" isn't enough. Here's how I instrument logs, metrics and traces so a 3 AM incident finds its root cause in 5 minutes.
Nexus is a financial investment platform. When a user makes a deposit, a dozen operations chain together: auth, KYC scoring, balance check, payment gateway call, confirmation webhook, account update, notification. If one breaks, the user sees "transaction failed" — but without observability, we don't know where it broke.
Observability isn't a luxury on this kind of platform. It's what turns a 3 AM incident from a 5-hour scramble into a 5-minute diagnosis. This article condenses what I put in place.
What happened. Structured text, readable by humans and machines. Granularity: one line per notable event.
How many, how often. Aggregated time series. Granularity: one point per minute typically.
The path of a request across services. One trace = one propagated ID. Granularity: one span per step.
The three are complementary. A metric alerts ("error rate spikes"), a log shows the cause ("Stripe returned 503"), a trace shows where in the chain ("between auth-service and payment-service, 8s latency").
Unstructured logs are useless for correlation. All Nexus logs are JSON with stable structure.
| 1 | // src/utils/logger.ts |
| 2 | import winston from "winston"; |
| 3 | |
| 4 | export const logger = winston.createLogger({ |
| 5 | level: process.env.LOG_LEVEL ?? "info", |
| 6 | format: winston.format.combine( |
| 7 | winston.format.timestamp(), |
| 8 | winston.format.errors({ stack: true }), |
| 9 | winston.format.json(), |
| 10 | ), |
| 11 | defaultMeta: { service: "nexus-api", env: process.env.NODE_ENV }, |
| 12 | transports: [new winston.transports.Console()], |
| 13 | }); |
A typical line:
| 1 | { |
| 2 | "timestamp": "2026-03-12T14:32:08.412Z", |
| 3 | "level": "info", |
| 4 | "service": "nexus-api", |
| 5 | "env": "production", |
| 6 | "trace_id": "abc123", |
| 7 | "user_id": "u_8jhq...", |
| 8 | "operation": "deposit.initiate", |
| 9 | "amount_xof": 50000, |
| 10 | "gateway": "mtn_momo", |
| 11 | "msg": "Deposit initiated" |
| 12 | } |
The trace_id is key. All logs of one request share it. Loki or Elasticsearch index that field, and a trace_id="abc123" search reconstructs the whole story.
Four levels, period. A fifth kills readability.
| Level | When | Example |
|---|---|---|
| ERROR | Something really broke and a human must know | Invalid webhook signature |
| WARN | Anomaly that doesn't break but must be watched | Retry after 3 failures |
| INFO | Notable business event | Payment confirmed, KYC validated |
| DEBUG | Technical details useful for investigation | Received payload, SQL query |
Discipline rule: DEBUG is off in prod by default. Toggleable temporarily via env without redeployment. If your prod logs are 80% DEBUG, you don't have logs — you have noise.
Without propagated trace ID, you can't follow a request across services. On Nexus, the gateway generates a trace ID at each incoming request and propagates it as HTTP header x-trace-id.
| 1 | // src/middlewares/traceContext.ts |
| 2 | import { randomUUID } from "node:crypto"; |
| 3 | import { AsyncLocalStorage } from "node:async_hooks"; |
| 4 | |
| 5 | const storage = new AsyncLocalStorage<{ traceId: string }>(); |
| 6 | |
| 7 | export function traceContext(req, res, next) { |
| 8 | const traceId = req.header("x-trace-id") ?? randomUUID(); |
| 9 | res.setHeader("x-trace-id", traceId); |
| 10 | storage.run({ traceId }, () => next()); |
| 11 | } |
| 12 | |
| 13 | export function currentTraceId(): string | undefined { |
| 14 | return storage.getStore()?.traceId; |
| 15 | } |
The logger automatically includes trace_id in each log:
| 1 | const log = (level, msg, meta = {}) => |
| 2 | logger.log({ level, message: msg, trace_id: currentTraceId(), ...meta }); |
Each downstream HTTP call propagates the header. The trace becomes unbroken.
Two distinct families, two distinct dashboards.
| 1 | import prom from "prom-client"; |
| 2 | |
| 3 | export const depositCounter = new prom.Counter({ |
| 4 | name: "nexus_deposit_total", |
| 5 | help: "Total deposits initiated", |
| 6 | labelNames: ["gateway", "currency", "status"], |
| 7 | }); |
| 8 | |
| 9 | export const depositLatency = new prom.Histogram({ |
| 10 | name: "nexus_deposit_seconds", |
| 11 | help: "Deposit completion time", |
| 12 | labelNames: ["gateway"], |
| 13 | buckets: [0.5, 1, 2, 5, 10, 30], |
| 14 | }); |
A business alert ("deposit volume divided by 3 in the last hour") signals earlier than a technical alert ("payment service responds slower"). Both matter.
For requests crossing several services, OpenTelemetry traces each span.
| 1 | import { NodeSDK } from "@opentelemetry/sdk-node"; |
| 2 | import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node"; |
| 3 | import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http"; |
| 4 | |
| 5 | const sdk = new NodeSDK({ |
| 6 | traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_ENDPOINT }), |
| 7 | instrumentations: [getNodeAutoInstrumentations()], |
| 8 | }); |
| 9 | |
| 10 | sdk.start(); |
With auto-instrumentation, each HTTP call, SQL query, Redis query becomes a span automatically. For business operations, add manual spans:
| 1 | import { trace } from "@opentelemetry/api"; |
| 2 | |
| 3 | const tracer = trace.getTracer("nexus-api"); |
| 4 | |
| 5 | async function processDeposit(input: DepositInput) { |
| 6 | return tracer.startActiveSpan("deposit.process", async (span) => { |
| 7 | span.setAttribute("amount", input.amount); |
| 8 | span.setAttribute("gateway", input.gateway); |
| 9 | try { |
| 10 | const result = await doProcess(input); |
| 11 | span.setStatus({ code: 1 }); |
| 12 | return result; |
| 13 | } catch (error) { |
| 14 | span.recordException(error); |
| 15 | span.setStatus({ code: 2, message: error.message }); |
| 16 | throw error; |
| 17 | } finally { |
| 18 | span.end(); |
| 19 | } |
| 20 | }); |
| 21 | } |
Visualization in Tempo or Jaeger: you see the full cascade, where each step took how many ms, where it broke.
Tracing 100% of requests in production costs heavily in storage and bandwidth. Sampling:
| 1 | const sampler = new ParentBasedSampler({ |
| 2 | root: new TraceIdRatioBasedSampler(0.1), |
| 3 | remoteParentSampled: new AlwaysOnSampler(), |
| 4 | }); |
Storage cost ÷10 without losing signals that matter.
Three alert channels, three distinct urgencies:
| 1 | # prometheus/alerts.yml |
| 2 | groups: |
| 3 | - name: nexus.p1 |
| 4 | rules: |
| 5 | - alert: PaymentServiceDown |
| 6 | expr: up{service="payment-service"} == 0 |
| 7 | for: 1m |
| 8 | labels: { severity: p1 } |
| 9 | annotations: |
| 10 | summary: "Payment service is down" |
| 11 | runbook: "https://wiki.nexus/runbooks/payment-down" |
Each P1 alert has a runbook: 5 ordered steps. No "thinking at 3 AM" — execute runbook, verify, escalate if unresolved in 15 min.
A single dashboard open on a big screen in the workspace. Three blocks:
If all green, nobody looks. If something turns red, everyone sees at once. No need to wait for a Slack alert.
Application logs aren't enough for fintech audit. A separate log, append-only, kept 7 years minimum:
| 1 | CREATE TABLE audit_log ( |
| 2 | id UUID PRIMARY KEY DEFAULT gen_random_uuid(), |
| 3 | actor_type VARCHAR(20) NOT NULL, |
| 4 | actor_id UUID NULL, |
| 5 | action VARCHAR(80) NOT NULL, |
| 6 | resource VARCHAR(80) NOT NULL, |
| 7 | resource_id UUID NULL, |
| 8 | before JSONB NULL, |
| 9 | after JSONB NULL, |
| 10 | trace_id VARCHAR(64) NOT NULL, |
| 11 | at TIMESTAMP NOT NULL DEFAULT NOW() |
| 12 | ); |
| 13 | |
| 14 | CREATE FUNCTION audit_immutable() RETURNS trigger AS $$ |
| 15 | BEGIN RAISE EXCEPTION 'audit_log is append-only'; END; |
| 16 | $$ LANGUAGE plpgsql; |
| 17 | CREATE TRIGGER no_update BEFORE UPDATE ON audit_log |
| 18 | FOR EACH ROW EXECUTE FUNCTION audit_immutable(); |
| 19 | CREATE TRIGGER no_delete BEFORE DELETE ON audit_log |
| 20 | FOR EACH ROW EXECUTE FUNCTION audit_immutable(); |
This audit is what a regulator or auditor will read. It must be clean, complete, immutable.
| Pitfall | Symptom | Fix |
|---|---|---|
| Plain text logs | Search impossible | JSON structured, indexed |
| No propagated trace ID | Inter-service request lost | UUID propagated as header everywhere |
| Too many log levels | Unmanageable noise | 4 levels max, DEBUG off in prod |
| Alerts without runbook | Nightly panic | Every P1 has its runbook |
| 100% sampling | Exorbitant storage | Smart sampling by criticality |
| No separate audit | Compliance risk | Append-only audit_log table, 7 years |
| Over-detailed dashboards | Nobody looks | One health dashboard, 5 seconds |
| Technical-only metrics | Invisible business drift | Business + technical separate |
Observability isn't an infra topic. It's an operational resilience topic. A fintech platform without clear observability is a time bomb: the first serious incident lasts 6 hours, loses transactions, and damages user trust.
Four fundamentals to start: structured JSON logs with propagated trace ID, business + technical metrics separate, distributed traces on critical operations, separate append-only audit log.
Cost is real (~2 weeks initial setup, ~20% code overhead). Benefit is priceless the first time a 3 AM incident diagnoses in 5 minutes.
If this topic feels close to a real product problem, I can help on diagnosis, architecture, backend, interface and automations that make a platform usable in production.
Reader reactions
No comment yet
Be the first to share your reaction.