Skip to content

Observability Guide

This guide covers metrics, health checks, and logging in Mycel services.

Overview

Mycel provides built-in observability features:

Feature Endpoint Purpose
Metrics /metrics Prometheus metrics
Health /health Detailed health status
Liveness /health/live Kubernetes liveness probe
Readiness /health/ready Kubernetes readiness probe

Prometheus Metrics

Mycel exposes metrics in Prometheus format at /metrics.

Request Metrics

Metric Type Labels Description
mycel_requests_total Counter method, path, status Total HTTP requests
mycel_request_duration_seconds Histogram method, path Request duration
mycel_requests_in_flight Gauge method, path Current in-flight requests

Flow Metrics

Metric Type Labels Description
mycel_flow_executions_total Counter flow, status Total flow executions
mycel_flow_duration_seconds Histogram flow Flow execution duration
mycel_flow_errors_total Counter flow, error_type Flow errors by type

Connector Metrics

Metric Type Labels Description
mycel_connector_health Gauge connector, type Health status (1=healthy)
mycel_connector_operations_total Counter connector, type, operation, status Operations count
mycel_connector_latency_seconds Histogram connector, type, operation Operation latency

Cache Metrics

Metric Type Labels Description
mycel_cache_hits_total Counter cache Cache hits
mycel_cache_misses_total Counter cache Cache misses
mycel_cache_size Gauge cache Current cache size

Profile Metrics

Metric Type Labels Description
mycel_connector_profile_active Gauge connector, profile Active profile (1=active)
mycel_connector_profile_requests_total Counter connector, profile Requests per profile
mycel_connector_profile_errors_total Counter connector, profile, error Errors per profile
mycel_connector_profile_fallback_total Counter connector, from, to Fallback events
mycel_connector_profile_latency_seconds Histogram connector, profile Latency per profile

Synchronization Metrics

Metric Type Labels Description
mycel_lock_acquired_total Counter key Locks acquired
mycel_lock_released_total Counter key Locks released
mycel_lock_wait_seconds Histogram key Lock wait time
mycel_lock_timeout_total Counter key Lock timeouts
mycel_lock_held Gauge key Currently held locks
mycel_semaphore_acquired_total Counter key Semaphore permits acquired
mycel_semaphore_available Gauge key Available permits
mycel_coordinate_signal_total Counter signal Signals emitted
mycel_coordinate_wait_seconds Histogram signal Wait duration

Runtime Metrics

Metric Type Labels Description
mycel_uptime_seconds Gauge - Service uptime
mycel_goroutines Gauge - Current goroutines
mycel_service_info Gauge service, version Service metadata
mycel_scheduled_flows Gauge - Scheduled flows count

Accessing Metrics

# Get all metrics
curl http://localhost:3000/metrics

# Filter specific metrics
curl http://localhost:3000/metrics | grep mycel_flow

# Get flow durations
curl http://localhost:3000/metrics | grep mycel_flow_duration

Example output:

# HELP mycel_requests_total Total number of HTTP requests processed
# TYPE mycel_requests_total counter
mycel_requests_total{method="GET",path="/users",status="200"} 150
mycel_requests_total{method="POST",path="/users",status="201"} 25

# HELP mycel_flow_duration_seconds Flow execution duration in seconds
# TYPE mycel_flow_duration_seconds histogram
mycel_flow_duration_seconds_bucket{flow="get_users",le="0.005"} 120
mycel_flow_duration_seconds_bucket{flow="get_users",le="0.01"} 145
mycel_flow_duration_seconds_sum{flow="get_users"} 0.45
mycel_flow_duration_seconds_count{flow="get_users"} 150

Health Checks

Detailed Health (/health)

Returns comprehensive status of all components:

curl http://localhost:3000/health

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "1.0.0",
  "uptime": "2h30m15s",
  "components": [
    {
      "name": "postgres",
      "status": "healthy",
      "latency": "5ms"
    },
    {
      "name": "redis",
      "status": "healthy",
      "latency": "1ms"
    }
  ]
}

Liveness Probe (/health/live)

Simple check that the process is alive. Always returns 200 unless crashed.

curl http://localhost:3000/health/live

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "1.0.0",
  "uptime": "2h30m15s"
}

Readiness Probe (/health/ready)

Checks if service is ready to receive traffic (all connectors healthy).

curl http://localhost:3000/health/ready

Response (healthy):

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z"
}

Response (not ready):

{
  "status": "unhealthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "metadata": {
    "reason": "service not ready"
  }
}

Kubernetes Configuration

Deployment with Probes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
spec:
  template:
    spec:
      containers:
        - name: mycel
          image: ghcr.io/matutetandil/mycel:latest
          ports:
            - containerPort: 3000
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          resources:
            requests:
              memory: "64Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "500m"

ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-service
  labels:
    app: my-service
spec:
  selector:
    matchLabels:
      app: my-service
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Logging

Log Levels

Level Description Use Case
debug Detailed debugging Development
info Normal operations Production (default)
warn Warning conditions Issues that may need attention
error Error conditions Failures that need investigation

Configuration

# Via command line
mycel start --log-level debug --log-format json

# Via environment variables
MYCEL_LOG_LEVEL=debug MYCEL_LOG_FORMAT=json mycel start

Log Format

Text format (development):

2024-01-15T10:30:00.000Z INFO  Starting service: my-service
2024-01-15T10:30:00.001Z INFO  Loaded 3 connectors: api, db, cache
2024-01-15T10:30:00.002Z INFO  REST server listening on :3000

JSON format (production):

{"time":"2024-01-15T10:30:00.000Z","level":"INFO","msg":"Starting service","service":"my-service"}
{"time":"2024-01-15T10:30:00.001Z","level":"INFO","msg":"Loaded connectors","count":3}
{"time":"2024-01-15T10:30:00.002Z","level":"INFO","msg":"REST server listening","port":3000}

Choosing a format

Format Use for Why
text Local tail / development Pretty, human-friendly, colored when supported.
json Production / log pipelines Each log line is a queryable object — level, flow, connector, error_type etc. become first-class fields in Loki / Elastic / Datadog. The difference between string-search and structured queries.

If you ship logs to a backend (next section), always pick json. The text format works too, but every downstream tool either re-parses it (extra cost, fragile) or limits you to substring matching.

Shipping logs to a backend

Mycel logs to stdout / stderr — and that's where you should keep them. The recommended pattern across every Mycel deployment is the same one that's been the cloud-native default for the last decade:

The app logs to stdout. A collector outside the app ships those logs to the backend of your choice.

Why not push from inside Mycel directly?

  • The runtime stays focused on flows; it doesn't carry retry / batching / back-pressure logic for an external log API.
  • A failing log backend cannot stall your request path.
  • You can swap Loki for Datadog for Elastic without redeploying Mycel — just reconfigure the collector.
  • Containers, k8s, and most PaaS already capture stdout for free.

A handful of well-known collectors covers every realistic backend. Pick one — they all consume Mycel's JSON logs cleanly. Run mycel start --log-format json (or MYCEL_LOG_FORMAT=json) and point the collector at the container's stdout.

Promtail → Grafana Loki

The Grafana-stack default. Tails container stdout, ships to Loki, queryable from Grafana.

# promtail.yaml
clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: mycel
    static_configs:
      - targets: [localhost]
        labels:
          job: mycel
          service: my-consumer
          __path__: /var/log/containers/mycel-*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg:   msg
            flow:  flow
            error: error
      - labels:
          level:
          flow:
      - timestamp:
          source: time
          format: RFC3339

In Grafana the JSON unlocks LogQL queries that wouldn't work over plain text:

# All error logs from a specific flow
{job="mycel", service="my-consumer"} | json | flow="item_create" | level="error"

# Rate of timeout errors per flow over 5 minutes
sum by (flow) (rate({job="mycel"} | json | error_type="timeout" [5m]))

# Filter by any structured field the runtime emits
{job="mycel"} | json | connector="rabbit" |~ "deadline"

Vector → any backend

Vector is the most flexible option: one config can fan out the same logs to several backends (Loki + S3 + a webhook, for example) with transforms in between. Good when you don't want to commit to one ecosystem.

# vector.toml
[sources.mycel]
type = "docker_logs"
include_containers = ["mycel"]

[transforms.parse]
type   = "remap"
inputs = ["mycel"]
source = '''
  . = parse_json!(.message)
'''

[sinks.loki]
type     = "loki"
inputs   = ["parse"]
endpoint = "http://loki:3100"
labels   = { job = "mycel", level = "{{ level }}", flow = "{{ flow }}" }
encoding = { codec = "json" }

Swap [sinks.loki] for [sinks.elasticsearch], [sinks.datadog_logs], [sinks.http], etc. — Vector supports ~40 sinks out of the box.

Fluent Bit → Kubernetes daemonset

The default in many k8s clusters. Lightweight, written in C, ships as a DaemonSet that tails every pod's stdout.

# fluent-bit.conf
[INPUT]
    Name              tail
    Path              /var/log/containers/mycel-*.log
    Parser            docker
    Tag               mycel.*
    Refresh_Interval  5

[FILTER]
    Name    parser
    Match   mycel.*
    Key_Name log
    Parser  json

[OUTPUT]
    Name        loki
    Match       mycel.*
    Host        loki
    Port        3100
    Labels      job=mycel
    Label_Keys  $level, $flow, $service

Most clusters already have Fluent Bit installed — check before adding another collector.

OpenTelemetry Collector → any OTLP backend

The vendor-neutral standard. One collector config can route to Loki, Datadog, New Relic, Honeycomb, Elastic, Splunk, Grafana Cloud — every major vendor speaks OTLP.

# otel-collector.yaml
receivers:
  filelog:
    include: [/var/log/containers/mycel-*.log]
    operators:
      - type: json_parser
        timestamp:
          parse_from: attributes.time
          layout_type: gotime
          layout: '2006-01-02T15:04:05.000Z07:00'
        severity:
          parse_from: attributes.level

exporters:
  otlphttp:
    endpoint: ${OTEL_LOGS_ENDPOINT}
    headers:
      "x-api-key": ${OTEL_API_KEY}

service:
  pipelines:
    logs:
      receivers:  [filelog]
      exporters:  [otlphttp]

This is the safest long-term bet — if you change vendor in 2 years, only the exporters: block changes.

Deploying the collector

  • Docker Compose: add the collector as another service alongside Mycel; share the log volume or use the Docker log driver.
  • Kubernetes: run Fluent Bit / Vector / OTel Collector as a DaemonSet (one per node) tailing /var/log/containers/*.log, OR as a sidecar in the Mycel pod. DaemonSet is cheaper at scale; sidecar gives you per-pod isolation.
  • VM / bare metal: run the collector as a systemd unit alongside Mycel; point it at journald or at Mycel's log file if you redirect stdout.

The bundled local monitoring stack at monitoring/ (Prometheus + Grafana) does not include a log collector by default — for local dev docker compose logs mycel is enough. Add Promtail/Vector to that stack when you want LogQL queries locally.

In-process shipping (future)

Mycel does not ship logs over the network from inside the process. There's a fair case for adding a built-in OTLP sink (logging { sink "otlp" { } } declared in HCL alongside connectors) for deployments without a collector — small VPS, edge — and it fits Mycel's "configuration, not code" ethos. It's not on the roadmap today; open an issue or a discussion if you have a concrete use case.

Distributed Tracing (OpenTelemetry)

Mycel can emit OpenTelemetry traces over OTLP, so a single request can be followed end-to-end across services in Jaeger, Tempo, Grafana, or any OTel-compatible backend.

Tracing is opt-in and a strict no-op when unconfigured — there is no hot-path cost unless you turn it on.

Enabling it

Either set MYCEL_TRACING=true, or simply point Mycel at a collector with the standard OTLP endpoint variable (which turns tracing on by itself):

# Auto-enabled: setting an OTLP endpoint is enough
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 mycel start

# Or explicitly
MYCEL_TRACING=true OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 mycel start

The OTLP/gRPC exporter reads the rest of its configuration from the standard OTEL_* environment variables — endpoint, headers, TLS/insecure, timeout — so Mycel is wired up exactly like any other OpenTelemetry service. Common ones:

Variable Purpose
OTEL_EXPORTER_OTLP_ENDPOINT Collector endpoint (e.g. http://otel-collector:4317)
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT Traces-only endpoint override
OTEL_EXPORTER_OTLP_HEADERS Headers (e.g. auth: authorization=Bearer ...)
OTEL_EXPORTER_OTLP_INSECURE true to disable TLS (plaintext collector)

service.name and service.version come from your service config; override with OTEL_SERVICE_NAME if needed.

What gets traced

  • A root span per flow execution, started at the single choke-point every request passes through — so it works for any source connector (queue message, HTTP body, TCP frame, CDC event), in any environment.
  • Inbound context propagation: the flow joins an existing distributed trace when a W3C traceparent is present in the source headers (HTTP or message headers; lookup is case-insensitive).
  • Child spans around connector writes (to {} destinations), tagged with the connector, operation, and target.
  • Depth inside the flow: the transform/steps stage, the to { transaction {} } block, and each each loop within it get their own spans, so the trace shows where a flow's time actually goes — e.g. which each loop in a large transaction is slow — instead of a flat flow → write.
  • Outbound propagation on HTTP client calls and on RabbitMQ / Kafka publishes (the traceparent is written into the message headers), so the downstream service or consumer continues the same trace.

Span attributes include mycel.flow, mycel.source, mycel.connector, and the operation; errored flows and writes are marked on the span.

Correlating logs with traces

Logs emitted with a context during a traced flow automatically carry trace_id and span_id, so you can pivot from a log line to its trace (and back) in Grafana/Loki/Tempo. This is a no-op when there is no active span, so it adds nothing when tracing is off.

{"time":"...","level":"INFO","msg":"request","flow":"item_update","duration":"812ms","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736","span_id":"00f067aa0ba902b7"}

Granular per-step / per-rule transform spans, per-statement transaction spans, and spans around the sync primitives (lock / coordinate / sequence_guard) are planned refinements; today the transform/transaction stages are spanned as a whole and sync wait time shows up as the leading gap inside the flow span.

Header-less brokers: Redis Pub/Sub and MQTT v3 carry no message headers, so trace context cannot cross those hops (Mycel does not embed it in the payload). A trace will show the flow that consumes such a message but cannot be linked from the publishing side over that hop.

This is separate from the debugging tracer (verbose flow logging + the Studio debugger), which is for local development. The two are independent and can be active at the same time. Prometheus /metrics is unaffected by tracing.

OTLP export of metrics/logs is a planned follow-up.

Grafana Dashboard

Import Dashboard

  1. Open Grafana
  2. Go to Dashboards > Import
  3. Use the JSON below or import from file

Example Dashboard JSON

{
  "title": "Mycel Service Dashboard",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(mycel_requests_total[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    },
    {
      "title": "Request Duration (p95)",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, rate(mycel_request_duration_seconds_bucket[5m]))",
          "legendFormat": "{{path}}"
        }
      ]
    },
    {
      "title": "Flow Errors",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(mycel_flow_errors_total[5m])",
          "legendFormat": "{{flow}} - {{error_type}}"
        }
      ]
    },
    {
      "title": "Cache Hit Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(mycel_cache_hits_total[5m])) / (sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m])))"
        }
      ]
    },
    {
      "title": "Connector Health",
      "type": "table",
      "targets": [
        {
          "expr": "mycel_connector_health",
          "legendFormat": "{{connector}}"
        }
      ]
    }
  ]
}

Common Queries

Request Rate

rate(mycel_requests_total[5m])

Error Rate

rate(mycel_requests_total{status=~"5.."}[5m]) / rate(mycel_requests_total[5m])

Request Duration (p95)

histogram_quantile(0.95, rate(mycel_request_duration_seconds_bucket[5m]))

Slow Flows

histogram_quantile(0.99, rate(mycel_flow_duration_seconds_bucket[5m])) > 1

Cache Hit Rate

sum(rate(mycel_cache_hits_total[5m])) /
(sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m])))

Unhealthy Connectors

mycel_connector_health == 0

Alerting Rules

Example Prometheus Alerts

groups:
  - name: mycel
    rules:
      - alert: HighErrorRate
        expr: rate(mycel_requests_total{status=~"5.."}[5m]) / rate(mycel_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.path }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: SlowRequests
        expr: histogram_quantile(0.95, rate(mycel_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow requests on {{ $labels.path }}"
          description: "p95 latency is {{ $value }}s"

      - alert: ConnectorUnhealthy
        expr: mycel_connector_health == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Connector {{ $labels.connector }} is unhealthy"

      - alert: LowCacheHitRate
        expr: |
          sum(rate(mycel_cache_hits_total[5m])) /
          (sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m]))) < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"

      - alert: HighLockContention
        expr: rate(mycel_lock_timeout_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High lock contention on {{ $labels.key }}"

Docker Compose with Monitoring

version: '3.8'

services:
  mycel:
    image: ghcr.io/matutetandil/mycel:latest
    ports:
      - "3000:3000"
    environment:
      - MYCEL_LOG_FORMAT=json

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

prometheus.yml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'mycel'
    static_configs:
      - targets: ['mycel:3000']

Best Practices

  1. Use JSON logging in production for easier parsing by log aggregators
  2. Set appropriate alert thresholds based on your SLOs
  3. Monitor cache hit rates - low rates indicate misconfigured cache keys
  4. Track connector latency to identify slow dependencies
  5. Use readiness probes to prevent routing traffic to unhealthy instances
  6. Set resource limits based on observed metrics

See Also