Observability Guide¶
This guide covers metrics, health checks, and logging in Mycel services.
Overview¶
Mycel provides built-in observability features:
| Feature | Endpoint | Purpose |
|---|---|---|
| Metrics | /metrics |
Prometheus metrics |
| Health | /health |
Detailed health status |
| Liveness | /health/live |
Kubernetes liveness probe |
| Readiness | /health/ready |
Kubernetes readiness probe |
Prometheus Metrics¶
Mycel exposes metrics in Prometheus format at /metrics.
Request Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mycel_requests_total |
Counter | method, path, status | Total HTTP requests |
mycel_request_duration_seconds |
Histogram | method, path | Request duration |
mycel_requests_in_flight |
Gauge | method, path | Current in-flight requests |
Flow Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mycel_flow_executions_total |
Counter | flow, status | Total flow executions |
mycel_flow_duration_seconds |
Histogram | flow | Flow execution duration |
mycel_flow_errors_total |
Counter | flow, error_type | Flow errors by type |
Connector Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mycel_connector_health |
Gauge | connector, type | Health status (1=healthy) |
mycel_connector_operations_total |
Counter | connector, type, operation, status | Operations count |
mycel_connector_latency_seconds |
Histogram | connector, type, operation | Operation latency |
Cache Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mycel_cache_hits_total |
Counter | cache | Cache hits |
mycel_cache_misses_total |
Counter | cache | Cache misses |
mycel_cache_size |
Gauge | cache | Current cache size |
Profile Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mycel_connector_profile_active |
Gauge | connector, profile | Active profile (1=active) |
mycel_connector_profile_requests_total |
Counter | connector, profile | Requests per profile |
mycel_connector_profile_errors_total |
Counter | connector, profile, error | Errors per profile |
mycel_connector_profile_fallback_total |
Counter | connector, from, to | Fallback events |
mycel_connector_profile_latency_seconds |
Histogram | connector, profile | Latency per profile |
Synchronization Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mycel_lock_acquired_total |
Counter | key | Locks acquired |
mycel_lock_released_total |
Counter | key | Locks released |
mycel_lock_wait_seconds |
Histogram | key | Lock wait time |
mycel_lock_timeout_total |
Counter | key | Lock timeouts |
mycel_lock_held |
Gauge | key | Currently held locks |
mycel_semaphore_acquired_total |
Counter | key | Semaphore permits acquired |
mycel_semaphore_available |
Gauge | key | Available permits |
mycel_coordinate_signal_total |
Counter | signal | Signals emitted |
mycel_coordinate_wait_seconds |
Histogram | signal | Wait duration |
Runtime Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mycel_uptime_seconds |
Gauge | - | Service uptime |
mycel_goroutines |
Gauge | - | Current goroutines |
mycel_service_info |
Gauge | service, version | Service metadata |
mycel_scheduled_flows |
Gauge | - | Scheduled flows count |
Accessing Metrics¶
# Get all metrics
curl http://localhost:3000/metrics
# Filter specific metrics
curl http://localhost:3000/metrics | grep mycel_flow
# Get flow durations
curl http://localhost:3000/metrics | grep mycel_flow_duration
Example output:
# HELP mycel_requests_total Total number of HTTP requests processed
# TYPE mycel_requests_total counter
mycel_requests_total{method="GET",path="/users",status="200"} 150
mycel_requests_total{method="POST",path="/users",status="201"} 25
# HELP mycel_flow_duration_seconds Flow execution duration in seconds
# TYPE mycel_flow_duration_seconds histogram
mycel_flow_duration_seconds_bucket{flow="get_users",le="0.005"} 120
mycel_flow_duration_seconds_bucket{flow="get_users",le="0.01"} 145
mycel_flow_duration_seconds_sum{flow="get_users"} 0.45
mycel_flow_duration_seconds_count{flow="get_users"} 150
Health Checks¶
Detailed Health (/health)¶
Returns comprehensive status of all components:
Response:
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "1.0.0",
"uptime": "2h30m15s",
"components": [
{
"name": "postgres",
"status": "healthy",
"latency": "5ms"
},
{
"name": "redis",
"status": "healthy",
"latency": "1ms"
}
]
}
Liveness Probe (/health/live)¶
Simple check that the process is alive. Always returns 200 unless crashed.
Response:
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "1.0.0",
"uptime": "2h30m15s"
}
Readiness Probe (/health/ready)¶
Checks if service is ready to receive traffic (all connectors healthy).
Response (healthy):
Response (not ready):
{
"status": "unhealthy",
"timestamp": "2024-01-15T10:30:00Z",
"metadata": {
"reason": "service not ready"
}
}
Kubernetes Configuration¶
Deployment with Probes¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
spec:
template:
spec:
containers:
- name: mycel
image: ghcr.io/matutetandil/mycel:latest
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
ServiceMonitor for Prometheus Operator¶
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-service
labels:
app: my-service
spec:
selector:
matchLabels:
app: my-service
endpoints:
- port: http
path: /metrics
interval: 15s
Logging¶
Log Levels¶
| Level | Description | Use Case |
|---|---|---|
debug |
Detailed debugging | Development |
info |
Normal operations | Production (default) |
warn |
Warning conditions | Issues that may need attention |
error |
Error conditions | Failures that need investigation |
Configuration¶
# Via command line
mycel start --log-level debug --log-format json
# Via environment variables
MYCEL_LOG_LEVEL=debug MYCEL_LOG_FORMAT=json mycel start
Log Format¶
Text format (development):
2024-01-15T10:30:00.000Z INFO Starting service: my-service
2024-01-15T10:30:00.001Z INFO Loaded 3 connectors: api, db, cache
2024-01-15T10:30:00.002Z INFO REST server listening on :3000
JSON format (production):
{"time":"2024-01-15T10:30:00.000Z","level":"INFO","msg":"Starting service","service":"my-service"}
{"time":"2024-01-15T10:30:00.001Z","level":"INFO","msg":"Loaded connectors","count":3}
{"time":"2024-01-15T10:30:00.002Z","level":"INFO","msg":"REST server listening","port":3000}
Choosing a format¶
| Format | Use for | Why |
|---|---|---|
text |
Local tail / development |
Pretty, human-friendly, colored when supported. |
json |
Production / log pipelines | Each log line is a queryable object — level, flow, connector, error_type etc. become first-class fields in Loki / Elastic / Datadog. The difference between string-search and structured queries. |
If you ship logs to a backend (next section), always pick json. The text format works too, but every downstream tool either re-parses it (extra cost, fragile) or limits you to substring matching.
Shipping logs to a backend¶
Mycel logs to stdout / stderr — and that's where you should keep them. The recommended pattern across every Mycel deployment is the same one that's been the cloud-native default for the last decade:
The app logs to stdout. A collector outside the app ships those logs to the backend of your choice.
Why not push from inside Mycel directly?
- The runtime stays focused on flows; it doesn't carry retry / batching / back-pressure logic for an external log API.
- A failing log backend cannot stall your request path.
- You can swap Loki for Datadog for Elastic without redeploying Mycel — just reconfigure the collector.
- Containers, k8s, and most PaaS already capture stdout for free.
A handful of well-known collectors covers every realistic backend. Pick one — they all consume Mycel's JSON logs cleanly. Run mycel start --log-format json (or MYCEL_LOG_FORMAT=json) and point the collector at the container's stdout.
Promtail → Grafana Loki¶
The Grafana-stack default. Tails container stdout, ships to Loki, queryable from Grafana.
# promtail.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: mycel
static_configs:
- targets: [localhost]
labels:
job: mycel
service: my-consumer
__path__: /var/log/containers/mycel-*.log
pipeline_stages:
- json:
expressions:
level: level
msg: msg
flow: flow
error: error
- labels:
level:
flow:
- timestamp:
source: time
format: RFC3339
In Grafana the JSON unlocks LogQL queries that wouldn't work over plain text:
# All error logs from a specific flow
{job="mycel", service="my-consumer"} | json | flow="item_create" | level="error"
# Rate of timeout errors per flow over 5 minutes
sum by (flow) (rate({job="mycel"} | json | error_type="timeout" [5m]))
# Filter by any structured field the runtime emits
{job="mycel"} | json | connector="rabbit" |~ "deadline"
Vector → any backend¶
Vector is the most flexible option: one config can fan out the same logs to several backends (Loki + S3 + a webhook, for example) with transforms in between. Good when you don't want to commit to one ecosystem.
# vector.toml
[sources.mycel]
type = "docker_logs"
include_containers = ["mycel"]
[transforms.parse]
type = "remap"
inputs = ["mycel"]
source = '''
. = parse_json!(.message)
'''
[sinks.loki]
type = "loki"
inputs = ["parse"]
endpoint = "http://loki:3100"
labels = { job = "mycel", level = "{{ level }}", flow = "{{ flow }}" }
encoding = { codec = "json" }
Swap [sinks.loki] for [sinks.elasticsearch], [sinks.datadog_logs], [sinks.http], etc. — Vector supports ~40 sinks out of the box.
Fluent Bit → Kubernetes daemonset¶
The default in many k8s clusters. Lightweight, written in C, ships as a DaemonSet that tails every pod's stdout.
# fluent-bit.conf
[INPUT]
Name tail
Path /var/log/containers/mycel-*.log
Parser docker
Tag mycel.*
Refresh_Interval 5
[FILTER]
Name parser
Match mycel.*
Key_Name log
Parser json
[OUTPUT]
Name loki
Match mycel.*
Host loki
Port 3100
Labels job=mycel
Label_Keys $level, $flow, $service
Most clusters already have Fluent Bit installed — check before adding another collector.
OpenTelemetry Collector → any OTLP backend¶
The vendor-neutral standard. One collector config can route to Loki, Datadog, New Relic, Honeycomb, Elastic, Splunk, Grafana Cloud — every major vendor speaks OTLP.
# otel-collector.yaml
receivers:
filelog:
include: [/var/log/containers/mycel-*.log]
operators:
- type: json_parser
timestamp:
parse_from: attributes.time
layout_type: gotime
layout: '2006-01-02T15:04:05.000Z07:00'
severity:
parse_from: attributes.level
exporters:
otlphttp:
endpoint: ${OTEL_LOGS_ENDPOINT}
headers:
"x-api-key": ${OTEL_API_KEY}
service:
pipelines:
logs:
receivers: [filelog]
exporters: [otlphttp]
This is the safest long-term bet — if you change vendor in 2 years, only the exporters: block changes.
Deploying the collector¶
- Docker Compose: add the collector as another service alongside Mycel; share the log volume or use the Docker log driver.
- Kubernetes: run Fluent Bit / Vector / OTel Collector as a DaemonSet (one per node) tailing
/var/log/containers/*.log, OR as a sidecar in the Mycel pod. DaemonSet is cheaper at scale; sidecar gives you per-pod isolation. - VM / bare metal: run the collector as a systemd unit alongside Mycel; point it at journald or at Mycel's log file if you redirect stdout.
The bundled local monitoring stack at monitoring/ (Prometheus + Grafana) does not include a log collector by default — for local dev docker compose logs mycel is enough. Add Promtail/Vector to that stack when you want LogQL queries locally.
In-process shipping (future)¶
Mycel does not ship logs over the network from inside the process. There's a fair case for adding a built-in OTLP sink (logging { sink "otlp" { } } declared in HCL alongside connectors) for deployments without a collector — small VPS, edge — and it fits Mycel's "configuration, not code" ethos. It's not on the roadmap today; open an issue or a discussion if you have a concrete use case.
Distributed Tracing (OpenTelemetry)¶
Mycel can emit OpenTelemetry traces over OTLP, so a single request can be followed end-to-end across services in Jaeger, Tempo, Grafana, or any OTel-compatible backend.
Tracing is opt-in and a strict no-op when unconfigured — there is no hot-path cost unless you turn it on.
Enabling it¶
Either set MYCEL_TRACING=true, or simply point Mycel at a collector with the standard OTLP endpoint variable (which turns tracing on by itself):
# Auto-enabled: setting an OTLP endpoint is enough
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 mycel start
# Or explicitly
MYCEL_TRACING=true OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 mycel start
The OTLP/gRPC exporter reads the rest of its configuration from the standard OTEL_* environment variables — endpoint, headers, TLS/insecure, timeout — so Mycel is wired up exactly like any other OpenTelemetry service. Common ones:
| Variable | Purpose |
|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT |
Collector endpoint (e.g. http://otel-collector:4317) |
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT |
Traces-only endpoint override |
OTEL_EXPORTER_OTLP_HEADERS |
Headers (e.g. auth: authorization=Bearer ...) |
OTEL_EXPORTER_OTLP_INSECURE |
true to disable TLS (plaintext collector) |
service.name and service.version come from your service config; override with OTEL_SERVICE_NAME if needed.
What gets traced¶
- A root span per flow execution, started at the single choke-point every request passes through — so it works for any source connector (queue message, HTTP body, TCP frame, CDC event), in any environment.
- Inbound context propagation: the flow joins an existing distributed trace when a W3C
traceparentis present in the source headers (HTTP or message headers; lookup is case-insensitive). - Child spans around connector writes (
to {}destinations), tagged with the connector, operation, and target. - Depth inside the flow: the transform/steps stage, the
to { transaction {} }block, and eacheachloop within it get their own spans, so the trace shows where a flow's time actually goes — e.g. whicheachloop in a large transaction is slow — instead of a flat flow → write. - Outbound propagation on HTTP client calls and on RabbitMQ / Kafka publishes (the
traceparentis written into the message headers), so the downstream service or consumer continues the same trace.
Span attributes include mycel.flow, mycel.source, mycel.connector, and the operation; errored flows and writes are marked on the span.
Correlating logs with traces¶
Logs emitted with a context during a traced flow automatically carry trace_id and span_id, so you can pivot from a log line to its trace (and back) in Grafana/Loki/Tempo. This is a no-op when there is no active span, so it adds nothing when tracing is off.
{"time":"...","level":"INFO","msg":"request","flow":"item_update","duration":"812ms","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736","span_id":"00f067aa0ba902b7"}
Granular per-step / per-rule transform spans, per-statement transaction spans, and spans around the sync primitives (
lock/coordinate/sequence_guard) are planned refinements; today the transform/transaction stages are spanned as a whole and sync wait time shows up as the leading gap inside the flow span.Header-less brokers: Redis Pub/Sub and MQTT v3 carry no message headers, so trace context cannot cross those hops (Mycel does not embed it in the payload). A trace will show the flow that consumes such a message but cannot be linked from the publishing side over that hop.
This is separate from the debugging tracer (verbose flow logging + the Studio debugger), which is for local development. The two are independent and can be active at the same time. Prometheus
/metricsis unaffected by tracing.
OTLP export of metrics/logs is a planned follow-up.
Grafana Dashboard¶
Import Dashboard¶
- Open Grafana
- Go to Dashboards > Import
- Use the JSON below or import from file
Example Dashboard JSON¶
{
"title": "Mycel Service Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(mycel_requests_total[5m])",
"legendFormat": "{{method}} {{path}}"
}
]
},
{
"title": "Request Duration (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(mycel_request_duration_seconds_bucket[5m]))",
"legendFormat": "{{path}}"
}
]
},
{
"title": "Flow Errors",
"type": "graph",
"targets": [
{
"expr": "rate(mycel_flow_errors_total[5m])",
"legendFormat": "{{flow}} - {{error_type}}"
}
]
},
{
"title": "Cache Hit Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(mycel_cache_hits_total[5m])) / (sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m])))"
}
]
},
{
"title": "Connector Health",
"type": "table",
"targets": [
{
"expr": "mycel_connector_health",
"legendFormat": "{{connector}}"
}
]
}
]
}
Common Queries¶
Request Rate¶
Error Rate¶
Request Duration (p95)¶
Slow Flows¶
Cache Hit Rate¶
sum(rate(mycel_cache_hits_total[5m])) /
(sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m])))
Unhealthy Connectors¶
Alerting Rules¶
Example Prometheus Alerts¶
groups:
- name: mycel
rules:
- alert: HighErrorRate
expr: rate(mycel_requests_total{status=~"5.."}[5m]) / rate(mycel_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.path }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: SlowRequests
expr: histogram_quantile(0.95, rate(mycel_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Slow requests on {{ $labels.path }}"
description: "p95 latency is {{ $value }}s"
- alert: ConnectorUnhealthy
expr: mycel_connector_health == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Connector {{ $labels.connector }} is unhealthy"
- alert: LowCacheHitRate
expr: |
sum(rate(mycel_cache_hits_total[5m])) /
(sum(rate(mycel_cache_hits_total[5m])) + sum(rate(mycel_cache_misses_total[5m]))) < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Low cache hit rate"
description: "Cache hit rate is {{ $value | humanizePercentage }}"
- alert: HighLockContention
expr: rate(mycel_lock_timeout_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High lock contention on {{ $labels.key }}"
Docker Compose with Monitoring¶
version: '3.8'
services:
mycel:
image: ghcr.io/matutetandil/mycel:latest
ports:
- "3000:3000"
environment:
- MYCEL_LOG_FORMAT=json
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
grafana-data:
prometheus.yml¶
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'mycel'
static_configs:
- targets: ['mycel:3000']
Best Practices¶
- Use JSON logging in production for easier parsing by log aggregators
- Set appropriate alert thresholds based on your SLOs
- Monitor cache hit rates - low rates indicate misconfigured cache keys
- Track connector latency to identify slow dependencies
- Use readiness probes to prevent routing traffic to unhealthy instances
- Set resource limits based on observed metrics
See Also¶
- Configuration Reference - Full HCL reference
- Troubleshooting Guide - Common issues
- Helm Chart - Kubernetes deployment