AboutExpertiseProjectsJourneyBlogContact
Discuss
AboutExpertiseProjectsJourneyBlogContactDiscuss

Yao David Logan

Software Engineer fullstack specialized in SaaS, business automation and scalable web/mobile platforms.

NavigationExpertiseProjectsJourneyBlogContact
LinksGitHubLinkedInEmail
© 2026 Yao David Logan. All rights reserved.
Observability for a fintech platform — logs, metrics, traces and audit
Back to blog

Tech

Xin@

Observability for a fintech platform — logs, metrics, traces and audit

YDLYao David Logan9 min readMay 23, 2026

CategoryTech

Read9 min read

PublishedMay 23, 2026

Views281

Shares0

Comments0

Contents
  1. 011. The context
  2. 022. The three pillars — logs, metrics, traces
  3. 033. Pattern 1 — JSON-structured logs
  4. 044. Pattern 2 — Disciplined log levels
  5. 055. Pattern 3 — Trace ID propagation
  6. 066. Pattern 4 — Business metrics + technical metrics
  7. 077. Pattern 5 — Distributed tracing with OpenTelemetry
  8. 088. Pattern 6 — Smart sampling
  9. 099. Pattern 7 — Alerts that wake you vs alerts that inform
  10. 1010. Pattern 8 — "Health in 5 seconds" dashboard
  11. 1111. Pattern 9 — Separate audit log
  12. 1212. Pitfalls to avoid
  13. 1313. Closing

Structured JSON logs, propagated trace ID, OpenTelemetry, P1/P2/P3 alerts with runbooks, append-only audit log: the toolkit to diagnose a fintech incident in 5 minutes.

On a financial platform, "works in dev" isn't enough. Here's how I instrument logs, metrics and traces so a 3 AM incident finds its root cause in 5 minutes.

1. The context

Nexus is a financial investment platform. When a user makes a deposit, a dozen operations chain together: auth, KYC scoring, balance check, payment gateway call, confirmation webhook, account update, notification. If one breaks, the user sees "transaction failed" — but without observability, we don't know where it broke.

Observability isn't a luxury on this kind of platform. It's what turns a 3 AM incident from a 5-hour scramble into a 5-minute diagnosis. This article condenses what I put in place.

2. The three pillars — logs, metrics, traces

Logs

What happened. Structured text, readable by humans and machines. Granularity: one line per notable event.

Metrics

How many, how often. Aggregated time series. Granularity: one point per minute typically.

Distributed traces

The path of a request across services. One trace = one propagated ID. Granularity: one span per step.

The three are complementary. A metric alerts ("error rate spikes"), a log shows the cause ("Stripe returned 503"), a trace shows where in the chain ("between auth-service and payment-service, 8s latency").

3. Pattern 1 — JSON-structured logs

Unstructured logs are useless for correlation. All Nexus logs are JSON with stable structure.

TypeScript
1// src/utils/logger.ts
2import winston from "winston";
3 
4export const logger = winston.createLogger({
5 level: process.env.LOG_LEVEL ?? "info",
6 format: winston.format.combine(
7 winston.format.timestamp(),
8 winston.format.errors({ stack: true }),
9 winston.format.json(),
10 ),
11 defaultMeta: { service: "nexus-api", env: process.env.NODE_ENV },
12 transports: [new winston.transports.Console()],
13});

A typical line:

JSON
1{
2 "timestamp": "2026-03-12T14:32:08.412Z",
3 "level": "info",
4 "service": "nexus-api",
5 "env": "production",
6 "trace_id": "abc123",
7 "user_id": "u_8jhq...",
8 "operation": "deposit.initiate",
9 "amount_xof": 50000,
10 "gateway": "mtn_momo",
11 "msg": "Deposit initiated"
12}

The trace_id is key. All logs of one request share it. Loki or Elasticsearch index that field, and a trace_id="abc123" search reconstructs the whole story.

4. Pattern 2 — Disciplined log levels

Four levels, period. A fifth kills readability.

LevelWhenExample
ERRORSomething really broke and a human must knowInvalid webhook signature
WARNAnomaly that doesn't break but must be watchedRetry after 3 failures
INFONotable business eventPayment confirmed, KYC validated
DEBUGTechnical details useful for investigationReceived payload, SQL query

Discipline rule: DEBUG is off in prod by default. Toggleable temporarily via env without redeployment. If your prod logs are 80% DEBUG, you don't have logs — you have noise.

5. Pattern 3 — Trace ID propagation

Without propagated trace ID, you can't follow a request across services. On Nexus, the gateway generates a trace ID at each incoming request and propagates it as HTTP header x-trace-id.

TypeScript
1// src/middlewares/traceContext.ts
2import { randomUUID } from "node:crypto";
3import { AsyncLocalStorage } from "node:async_hooks";
4 
5const storage = new AsyncLocalStorage<{ traceId: string }>();
6 
7export function traceContext(req, res, next) {
8 const traceId = req.header("x-trace-id") ?? randomUUID();
9 res.setHeader("x-trace-id", traceId);
10 storage.run({ traceId }, () => next());
11}
12 
13export function currentTraceId(): string | undefined {
14 return storage.getStore()?.traceId;
15}

The logger automatically includes trace_id in each log:

TypeScript
1const log = (level, msg, meta = {}) =>
2 logger.log({ level, message: msg, trace_id: currentTraceId(), ...meta });

Each downstream HTTP call propagates the header. The trace becomes unbroken.

6. Pattern 4 — Business metrics + technical metrics

Two distinct families, two distinct dashboards.

Technical (red signals)

  • HTTP error rate per route (4xx, 5xx)
  • p50, p95, p99 latency per route
  • Saturation: CPU, memory, DB connections, queue depth
  • Per-service availability (regular probe)

Business (golden signals)

  • Deposit / withdrawal volume per minute
  • KYC completion rate
  • Time-to-money (from deposit click to user confirmation)
  • Platform-wide balance
TypeScript
1import prom from "prom-client";
2 
3export const depositCounter = new prom.Counter({
4 name: "nexus_deposit_total",
5 help: "Total deposits initiated",
6 labelNames: ["gateway", "currency", "status"],
7});
8 
9export const depositLatency = new prom.Histogram({
10 name: "nexus_deposit_seconds",
11 help: "Deposit completion time",
12 labelNames: ["gateway"],
13 buckets: [0.5, 1, 2, 5, 10, 30],
14});

A business alert ("deposit volume divided by 3 in the last hour") signals earlier than a technical alert ("payment service responds slower"). Both matter.

7. Pattern 5 — Distributed tracing with OpenTelemetry

For requests crossing several services, OpenTelemetry traces each span.

TypeScript
1import { NodeSDK } from "@opentelemetry/sdk-node";
2import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
3import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
4 
5const sdk = new NodeSDK({
6 traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_ENDPOINT }),
7 instrumentations: [getNodeAutoInstrumentations()],
8});
9 
10sdk.start();

With auto-instrumentation, each HTTP call, SQL query, Redis query becomes a span automatically. For business operations, add manual spans:

TypeScript
1import { trace } from "@opentelemetry/api";
2 
3const tracer = trace.getTracer("nexus-api");
4 
5async function processDeposit(input: DepositInput) {
6 return tracer.startActiveSpan("deposit.process", async (span) => {
7 span.setAttribute("amount", input.amount);
8 span.setAttribute("gateway", input.gateway);
9 try {
10 const result = await doProcess(input);
11 span.setStatus({ code: 1 });
12 return result;
13 } catch (error) {
14 span.recordException(error);
15 span.setStatus({ code: 2, message: error.message });
16 throw error;
17 } finally {
18 span.end();
19 }
20 });
21}

Visualization in Tempo or Jaeger: you see the full cascade, where each step took how many ms, where it broke.

8. Pattern 6 — Smart sampling

Tracing 100% of requests in production costs heavily in storage and bandwidth. Sampling:

  • 100% of errors — always traced
  • 100% of financial operations — non-negotiable
  • 10% of read requests — sampled
  • 1% of healthchecks — near zero
TypeScript
1const sampler = new ParentBasedSampler({
2 root: new TraceIdRatioBasedSampler(0.1),
3 remoteParentSampled: new AlwaysOnSampler(),
4});

Storage cost ÷10 without losing signals that matter.

9. Pattern 7 — Alerts that wake you vs alerts that inform

Three alert channels, three distinct urgencies:

  • P1 (wakes at 3 AM): platform down, payment impossible, corrupted balance
  • P2 (Slack business hours): degraded latency, error rate > 1%, growing queue
  • P3 (weekly digest): recurring warnings, declining business metrics
YAML
1# prometheus/alerts.yml
2groups:
3 - name: nexus.p1
4 rules:
5 - alert: PaymentServiceDown
6 expr: up{service="payment-service"} == 0
7 for: 1m
8 labels: { severity: p1 }
9 annotations:
10 summary: "Payment service is down"
11 runbook: "https://wiki.nexus/runbooks/payment-down"

Each P1 alert has a runbook: 5 ordered steps. No "thinking at 3 AM" — execute runbook, verify, escalate if unresolved in 15 min.

10. Pattern 8 — "Health in 5 seconds" dashboard

A single dashboard open on a big screen in the workspace. Three blocks:

  1. Service status: 6 green/red boxes
  2. Key business metrics: deposits/h, withdrawals/h, KYC validated/h
  3. Errors of last 15 min: top 5 by frequency

If all green, nobody looks. If something turns red, everyone sees at once. No need to wait for a Slack alert.

11. Pattern 9 — Separate audit log

Application logs aren't enough for fintech audit. A separate log, append-only, kept 7 years minimum:

SQL
1CREATE TABLE audit_log (
2 id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
3 actor_type VARCHAR(20) NOT NULL,
4 actor_id UUID NULL,
5 action VARCHAR(80) NOT NULL,
6 resource VARCHAR(80) NOT NULL,
7 resource_id UUID NULL,
8 before JSONB NULL,
9 after JSONB NULL,
10 trace_id VARCHAR(64) NOT NULL,
11 at TIMESTAMP NOT NULL DEFAULT NOW()
12);
13 
14CREATE FUNCTION audit_immutable() RETURNS trigger AS $$
15BEGIN RAISE EXCEPTION 'audit_log is append-only'; END;
16$$ LANGUAGE plpgsql;
17CREATE TRIGGER no_update BEFORE UPDATE ON audit_log
18 FOR EACH ROW EXECUTE FUNCTION audit_immutable();
19CREATE TRIGGER no_delete BEFORE DELETE ON audit_log
20 FOR EACH ROW EXECUTE FUNCTION audit_immutable();

This audit is what a regulator or auditor will read. It must be clean, complete, immutable.

12. Pitfalls to avoid

PitfallSymptomFix
Plain text logsSearch impossibleJSON structured, indexed
No propagated trace IDInter-service request lostUUID propagated as header everywhere
Too many log levelsUnmanageable noise4 levels max, DEBUG off in prod
Alerts without runbookNightly panicEvery P1 has its runbook
100% samplingExorbitant storageSmart sampling by criticality
No separate auditCompliance riskAppend-only audit_log table, 7 years
Over-detailed dashboardsNobody looksOne health dashboard, 5 seconds
Technical-only metricsInvisible business driftBusiness + technical separate

13. Closing

Observability isn't an infra topic. It's an operational resilience topic. A fintech platform without clear observability is a time bomb: the first serious incident lasts 6 hours, loses transactions, and damages user trust.

Four fundamentals to start: structured JSON logs with propagated trace ID, business + technical metrics separate, distributed traces on critical operations, separate append-only audit log.

Cost is real (~2 weeks initial setup, ~20% code overhead). Benefit is priceless the first time a 3 AM incident diagnoses in 5 minutes.

Comments

Reader reactions

No comment yet

No spam — email is only used to verify your identity.

·

Be the first to share your reaction.

←PreviousDX 2026 — picking your Node.js + TypeScript stack without falling into trapsTech · 7 min readNextFrom MVP to traction — 8 patterns to avoid shipping a sleeping B2B SaaSBusiness · 8 min read→
Next conversation

Turn this reading into a product decision.

If this topic feels close to a real product problem, I can help on diagnosis, architecture, backend, interface and automations that make a platform usable in production.

Format
Full-time, freelance, long mission
Focus
SaaS, API, back-office, automation
Discuss the topicDownload CV

Newsletter

Get the next technical notes.

A short selection on SaaS, backend architecture, business automation and product quality. No noise, only applicable ideas.