Observability Guide¶

This guide covers metrics, health checks, and logging in Mycel services.

Overview¶

Mycel provides built-in observability features:

Feature	Endpoint	Purpose
Metrics	`/metrics`	Prometheus metrics
Health	`/health`	Detailed health status
Liveness	`/health/live`	Kubernetes liveness probe
Readiness	`/health/ready`	Kubernetes readiness probe

Prometheus Metrics¶

Mycel exposes metrics in Prometheus format at /metrics.

Request Metrics¶

Metric	Type	Labels	Description
`mycel_requests_total`	Counter	method, path, status	Total HTTP requests
`mycel_request_duration_seconds`	Histogram	method, path	Request duration
`mycel_requests_in_flight`	Gauge	method, path	Current in-flight requests

Flow Metrics¶

Metric	Type	Labels	Description
`mycel_flow_executions_total`	Counter	flow, status	Total flow executions
`mycel_flow_duration_seconds`	Histogram	flow	Flow execution duration
`mycel_flow_errors_total`	Counter	flow, error_type	Flow errors by type

Connector Metrics¶

Metric	Type	Labels	Description
`mycel_connector_health`	Gauge	connector, type	Health status (1=healthy)
`mycel_connector_operations_total`	Counter	connector, type, operation, status	Operations count
`mycel_connector_latency_seconds`	Histogram	connector, type, operation	Operation latency

Cache Metrics¶

Metric	Type	Labels	Description
`mycel_cache_hits_total`	Counter	cache	Cache hits
`mycel_cache_misses_total`	Counter	cache	Cache misses
`mycel_cache_size`	Gauge	cache	Current cache size

Profile Metrics¶

Metric	Type	Labels	Description
`mycel_connector_profile_active`	Gauge	connector, profile	Active profile (1=active)
`mycel_connector_profile_requests_total`	Counter	connector, profile	Requests per profile
`mycel_connector_profile_errors_total`	Counter	connector, profile, error	Errors per profile
`mycel_connector_profile_fallback_total`	Counter	connector, from, to	Fallback events
`mycel_connector_profile_latency_seconds`	Histogram	connector, profile	Latency per profile

Synchronization Metrics¶

Metric	Type	Labels	Description
`mycel_lock_acquired_total`	Counter	key	Locks acquired
`mycel_lock_released_total`	Counter	key	Locks released
`mycel_lock_wait_seconds`	Histogram	key	Lock wait time
`mycel_lock_timeout_total`	Counter	key	Lock timeouts
`mycel_lock_held`	Gauge	key	Currently held locks
`mycel_semaphore_acquired_total`	Counter	key	Semaphore permits acquired
`mycel_semaphore_available`	Gauge	key	Available permits
`mycel_coordinate_signal_total`	Counter	signal	Signals emitted
`mycel_coordinate_wait_seconds`	Histogram	signal	Wait duration

Runtime Metrics¶

Metric	Type	Labels	Description
`mycel_uptime_seconds`	Gauge	-	Service uptime
`mycel_goroutines`	Gauge	-	Current goroutines
`mycel_service_info`	Gauge	service, version	Service metadata
`mycel_scheduled_flows`	Gauge	-	Scheduled flows count

Accessing Metrics¶

# Get all metrics
curl http://localhost:3000/metrics

# Filter specific metrics
curl http://localhost:3000/metrics | grep mycel_flow

# Get flow durations
curl http://localhost:3000/metrics | grep mycel_flow_duration

Example output:

# HELP mycel_requests_total Total number of HTTP requests processed
# TYPE mycel_requests_total counter
mycel_requests_total{method="GET",path="/users",status="200"} 150
mycel_requests_total{method="POST",path="/users",status="201"} 25

# HELP mycel_flow_duration_seconds Flow execution duration in seconds
# TYPE mycel_flow_duration_seconds histogram
mycel_flow_duration_seconds_bucket{flow="get_users",le="0.005"} 120
mycel_flow_duration_seconds_bucket{flow="get_users",le="0.01"} 145
mycel_flow_duration_seconds_sum{flow="get_users"} 0.45
mycel_flow_duration_seconds_count{flow="get_users"} 150

Health Checks¶

Detailed Health (`/health`)¶

Returns comprehensive status of all components:

curl http://localhost:3000/health

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "1.0.0",
  "uptime": "2h30m15s",
  "components": [
    {
      "name": "postgres",
      "status": "healthy",
      "latency": "5ms"
    },
    {
      "name": "redis",
      "status": "healthy",
      "latency": "1ms"
    }
  ]
}

Liveness Probe (`/health/live`)¶

Simple check that the process is alive. Always returns 200 unless crashed.

curl http://localhost:3000/health/live

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "1.0.0",
  "uptime": "2h30m15s"
}

Readiness Probe (`/health/ready`)¶

Checks if service is ready to receive traffic (all connectors healthy).

curl http://localhost:3000/health/ready

Response (healthy):

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z"
}

Response (not ready):

{
  "status": "unhealthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "metadata": {
    "reason": "service not ready"
  }
}

Kubernetes Configuration¶

Deployment with Probes¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
spec:
  template:
    spec:
      containers:
        - name: mycel
          image: ghcr.io/matutetandil/mycel:latest
          ports:
            - containerPort: 3000
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          resources:
            requests:
              memory: "64Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "500m"

ServiceMonitor for Prometheus Operator¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-service
  labels:
    app: my-service
spec:
  selector:
    matchLabels:
      app: my-service
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Logging¶

Log Levels¶

Level	Description	Use Case
`debug`	Detailed debugging	Development
`info`	Normal operations	Production (default)
`warn`	Warning conditions	Issues that may need attention
`error`	Error conditions	Failures that need investigation

Configuration¶

# Via command line
mycel start --log-level debug --log-format json

# Via environment variables
MYCEL_LOG_LEVEL=debug MYCEL_LOG_FORMAT=json mycel start

Log Format¶

Text format (development):

2024-01-15T10:30:00.000Z INFO  Starting service: my-service
2024-01-15T10:30:00.001Z INFO  Loaded 3 connectors: api, db, cache
2024-01-15T10:30:00.002Z INFO  REST server listening on :3000

JSON format (production):

{"time":"2024-01-15T10:30:00.000Z","level":"INFO","msg":"Starting service","service":"my-service"}
{"time":"2024-01-15T10:30:00.001Z","level":"INFO","msg":"Loaded connectors","count":3}
{"time":"2024-01-15T10:30:00.002Z","level":"INFO","msg":"REST server listening","port":3000}

Choosing a format¶

Format	Use for	Why
`text`	Local `tail` / development	Pretty, human-friendly, colored when supported.
`json`	Production / log pipelines	Each log line is a queryable object — `level`, `flow`, `connector`, `error_type` etc. become first-class fields in Loki / Elastic / Datadog. The difference between string-search and structured queries.

If you ship logs to a backend (next section), always pick json. The text format works too, but every downstream tool either re-parses it (extra cost, fragile) or limits you to substring matching.

Shipping logs to a backend¶

Mycel logs to stdout / stderr — and that's where you should keep them. The recommended pattern across every Mycel deployment is the same one that's been the cloud-native default for the last decade:

The app logs to stdout. A collector outside the app ships those logs to the backend of your choice.

Why not push from inside Mycel directly?

The runtime stays focused on flows; it doesn't carry retry / batching / back-pressure logic for an external log API.
A failing log backend cannot stall your request path.
You can swap Loki for Datadog for Elastic without redeploying Mycel — just reconfigure the collector.
Containers, k8s, and most PaaS already capture stdout for free.

A handful of well-known collectors covers every realistic backend. Pick one — they all consume Mycel's JSON logs cleanly. Run mycel start --log-format json (or MYCEL_LOG_FORMAT=json) and point the collector at the container's stdout.

Promtail → Grafana Loki¶

The Grafana-stack default. Tails container stdout, ships to Loki, queryable from Grafana.

# promtail.yaml
clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: mycel
    static_configs:
      - targets: [localhost]
        labels:
          job: mycel
          service: my-consumer
          __path__: /var/log/containers/mycel-*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg:   msg
            flow:  flow
            error: error
      - labels:
          level:
          flow:
      - timestamp:
          source: time
          format: RFC3339

In Grafana the JSON unlocks LogQL queries that wouldn't work over plain text:

# All error logs from a specific flow
{job="mycel", service="my-consumer"} | json | flow="item_create" | level="error"

# Rate of timeout errors per flow over 5 minutes
sum by (flow) (rate({job="mycel"} | json | error_type="timeout" [5m]))

# Filter by any structured field the runtime emits
{job="mycel"} | json | connector="rabbit" |~ "deadline"

Vector → any backend¶

Vector is the most flexible option: one config can fan out the same logs to several backends (Loki + S3 + a webhook, for example) with transforms in between. Good when you don't want to commit to one ecosystem.

# vector.toml
[sources.mycel]
type = "docker_logs"
include_containers = ["mycel"]

[transforms.parse]
type   = "remap"
inputs = ["mycel"]
source = '''
  . = parse_json!(.message)
'''

[sinks.loki]
type     = "loki"
inputs   = ["parse"]
endpoint = "http://loki:3100"
labels   = { job = "mycel", level = "{{ level }}", flow = "{{ flow }}" }
encoding = { codec = "json" }

Swap [sinks.loki] for [sinks.elasticsearch], [sinks.datadog_logs], [sinks.http], etc. — Vector supports ~40 sinks out of the box.

Fluent Bit → Kubernetes daemonset¶

The default in many k8s clusters. Lightweight, written in C, ships as a DaemonSet that tails every pod's stdout.

# fluent-bit.conf
[INPUT]
    Name              tail
    Path              /var/log/containers/mycel-*.log
    Parser            docker
    Tag               mycel.*
    Refresh_Interval  5

[FILTER]
    Name    parser
    Match   mycel.*
    Key_Name log
    Parser  json

[OUTPUT]
    Name        loki
    Match       mycel.*
    Host        loki
    Port        3100
    Labels      job=mycel
    Label_Keys  $level, $flow, $service

Most clusters already have Fluent Bit installed — check before adding another collector.

OpenTelemetry Collector → any OTLP backend¶

The vendor-neutral standard. One collector config can route to Loki, Datadog, New Relic, Honeycomb, Elastic, Splunk, Grafana Cloud — every major vendor speaks OTLP.

# otel-collector.yaml
receivers:
  filelog:
    include: [/var/log/containers/mycel-*.log]
    operators:
      - type: json_parser
        timestamp:
          parse_from: attributes.time
          layout_type: gotime
          layout: '2006-01-02T15:04:05.000Z07:00'
        severity:
          parse_from: attributes.level

exporters:
  otlphttp:
    endpoint: ${OTEL_LOGS_ENDPOINT}
    headers:
      "x-api-key": ${OTEL_API_KEY}

service:
  pipelines:
    logs:
      receivers:  [filelog]
      exporters:  [otlphttp]

This is the safest long-term bet — if you change vendor in 2 years, only the exporters: block changes.

Deploying the collector¶

Docker Compose: add the collector as another service alongside Mycel; share the log volume or use the Docker log driver.
Kubernetes: run Fluent Bit / Vector / OTel Collector as a DaemonSet (one per node) tailing /var/log/containers/*.log, OR as a sidecar in the Mycel pod. DaemonSet is cheaper at scale; sidecar gives you per-pod isolation.
VM / bare metal: run the collector as a systemd unit alongside Mycel; point it at journald or at Mycel's log file if you redirect stdout.

The bundled local monitoring stack at monitoring/ (Prometheus + Grafana) does not include a log collector by default — for local dev docker compose logs mycel is enough. Add Promtail/Vector to that stack when you want LogQL queries locally.

In-process shipping (future)¶

Mycel does not ship logs over the network from inside the process. There's a fair case for adding a built-in OTLP sink (logging { sink "otlp" { } } declared in HCL alongside connectors) for deployments without a collector — small VPS, edge — and it fits Mycel's "configuration, not code" ethos. It's not on the roadmap today; open an issue or a discussion if you have a concrete use case.

Distributed Tracing (OpenTelemetry)¶

Mycel can emit OpenTelemetry traces over OTLP, so a single request can be followed end-to-end across services in Jaeger, Tempo, Grafana, or any OTel-compatible backend.

Tracing is opt-in and a strict no-op when unconfigured — there is no hot-path cost unless you turn it on.

Enabling it¶

Either set MYCEL_TRACING=true, or simply point Mycel at a collector with the standard OTLP endpoint variable (which turns tracing on by itself):

# Auto-enabled: setting an OTLP endpoint is enough
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 mycel start

# Or explicitly
MYCEL_TRACING=true OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 mycel start

The OTLP/gRPC exporter reads the rest of its configuration from the standard OTEL_* environment variables — endpoint, headers, TLS/insecure, timeout — so Mycel is wired up exactly like any other OpenTelemetry service. Common ones:

Variable	Purpose
`OTEL_EXPORTER_OTLP_ENDPOINT`	Collector endpoint (e.g. `http://otel-collector:4317`)
`OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`	Traces-only endpoint override
`OTEL_EXPORTER_OTLP_HEADERS`	Headers (e.g. auth: `authorization=Bearer ...`)
`OTEL_EXPORTER_OTLP_INSECURE`	`true` to disable TLS (plaintext collector)

service.name and service.version come from your service config; override with OTEL_SERVICE_NAME if needed.

What gets traced¶

A root span per flow execution, started at the single choke-point every request passes through — so it works for any source connector (queue message, HTTP body, TCP frame, CDC event), in any environment.
Inbound context propagation: the flow joins an existing distributed trace when a W3C traceparent is present in the source headers (HTTP or message headers; lookup is case-insensitive).
Child spans around connector writes (to {} destinations), tagged with the connector, operation, and target.
Depth inside the flow: the transform/steps stage, the to { transaction {} } block, and each each loop within it get their own spans, so the trace shows where a flow's time actually goes — e.g. which each loop in a large transaction is slow — instead of a flat flow → write.
Outbound propagation on HTTP client calls and on RabbitMQ / Kafka publishes (the traceparent is written into the message headers), so the downstream service or consumer continues the same trace.

Span attributes include mycel.flow, mycel.source, mycel.connector, and the operation; errored flows and writes are marked on the span.

Correlating logs with traces¶

Logs emitted with a context during a traced flow automatically carry trace_id and span_id, so you can pivot from a log line to its trace (and back) in Grafana/Loki/Tempo. This is a no-op when there is no active span, so it adds nothing when tracing is off.

{"time":"...","level":"INFO","msg":"request","flow":"item_update","duration":"812ms","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736","span_id":"00f067aa0ba902b7"}

Granular per-step / per-rule transform spans, per-statement transaction spans, and spans around the sync primitives (lock / coordinate / sequence_guard) are planned refinements; today the transform/transaction stages are spanned as a whole and sync wait time shows up as the leading gap inside the flow span.

Header-less brokers: Redis Pub/Sub and MQTT v3 carry no message headers, so trace context cannot cross those hops (Mycel does not embed it in the payload). A trace will show the flow that consumes such a message but cannot be linked from the publishing side over that hop.

This is separate from the debugging tracer (verbose flow logging + the Studio debugger), which is for local development. The two are independent and can be active at the same time. Prometheus /metrics is unaffected by tracing.

OTLP export of metrics/logs is a planned follow-up.

Grafana Dashboard¶

Import Dashboard¶

Open Grafana
Go to Dashboards > Import
Use the JSON below or import from file

Example Dashboard JSON¶

{
  "title": "Mycel Service Dashboard",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(mycel_requests_total[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    },
    {
      "title": "Request Duration (p95)",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, rate(mycel_request_duration_seconds_bucket[5m]))",
          "legendFormat": "{{path}}"
        }
      ]
    },
    {
      "title": "Flow Errors",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(mycel_flow_errors_total[5m])",
          "legendFormat": "{{flow}} - {{error_type}}"
        }
      ]
    },
    {
      "title": "Cache Hit Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(mycel_cache_hits_total[5m])) / (sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m])))"
        }
      ]
    },
    {
      "title": "Connector Health",
      "type": "table",
      "targets": [
        {
          "expr": "mycel_connector_health",
          "legendFormat": "{{connector}}"
        }
      ]
    }
  ]
}

Common Queries¶

Request Rate¶

rate(mycel_requests_total[5m])

Error Rate¶

rate(mycel_requests_total{status=~"5.."}[5m]) / rate(mycel_requests_total[5m])

Request Duration (p95)¶

histogram_quantile(0.95, rate(mycel_request_duration_seconds_bucket[5m]))

Slow Flows¶

histogram_quantile(0.99, rate(mycel_flow_duration_seconds_bucket[5m])) > 1

Cache Hit Rate¶

sum(rate(mycel_cache_hits_total[5m])) /
(sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m])))

Unhealthy Connectors¶

mycel_connector_health == 0

Alerting Rules¶

Example Prometheus Alerts¶

groups:
  - name: mycel
    rules:
      - alert: HighErrorRate
        expr: rate(mycel_requests_total{status=~"5.."}[5m]) / rate(mycel_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.path }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: SlowRequests
        expr: histogram_quantile(0.95, rate(mycel_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow requests on {{ $labels.path }}"
          description: "p95 latency is {{ $value }}s"

      - alert: ConnectorUnhealthy
        expr: mycel_connector_health == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Connector {{ $labels.connector }} is unhealthy"

      - alert: LowCacheHitRate
        expr: |
          sum(rate(mycel_cache_hits_total[5m])) /
          (sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m]))) < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"

      - alert: HighLockContention
        expr: rate(mycel_lock_timeout_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High lock contention on {{ $labels.key }}"

Docker Compose with Monitoring¶

version: '3.8'

services:
  mycel:
    image: ghcr.io/matutetandil/mycel:latest
    ports:
      - "3000:3000"
    environment:
      - MYCEL_LOG_FORMAT=json

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

prometheus.yml¶

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'mycel'
    static_configs:
      - targets: ['mycel:3000']

Best Practices¶

Use JSON logging in production for easier parsing by log aggregators
Set appropriate alert thresholds based on your SLOs
Monitor cache hit rates - low rates indicate misconfigured cache keys
Track connector latency to identify slow dependencies
Use readiness probes to prevent routing traffic to unhealthy instances
Set resource limits based on observed metrics