Java Profiling Runbook
Enable Continuous Profiling
Add one of these metadata keys to a Java workload or Pod:
metadata:
annotations:
java-profiler.io/profile-mode: continuous
java-profiler.io/startup-delay: 30sThe collector waits for the startup delay, verifies HotSpot compatibility, checks for async-profiler conflicts, then uploads normalized profiles.
Enable Temporary Profiling
metadata:
annotations:
java-profiler.io/profile-disabled: "false"
java-profiler.io/profile-mode: temporary
java-profiler.io/profile-duration: 10m
java-profiler.io/startup-delay: 0s
java-profiler.io/snapshot-interval: 10sTemporary profiling stops automatically when the duration expires. High-frequency thread snapshots are only intended for temporary windows.
Temporary windows are evaluated from the target Pod/JVM lifecycle. If you add temporary metadata to a long-running Pod, it may immediately report temporary_expired. For a clean incident window, update the Pod template and roll the workload, or add a run-specific annotation such as java-profiler.io/acceptance-run: "<timestamp>" so Kubernetes creates a fresh Pod.
Disable Profiling
metadata:
annotations:
java-profiler.io/profile-disabled: "true"Explicit disable wins over continuous and temporary enablement.
When re-enabling a workload that was previously disabled, remove the key or set java-profiler.io/profile-disabled: "false" on the Pod template. A stale truthy profile-disabled annotation keeps the target in disabled_by_metadata even when profile-mode is temporary or continuous.
Validate an Existing Workload
For a production-shaped smoke test, keep the window short and save before/after evidence. The real acceptance script can point the profiler Helm release at an existing workload without creating the synthetic BusyApp:
KUBECONFIG=/path/to/kubeconfig \
JAVA_PROFILER_COLLECTOR_INTERVAL=30s \
scripts/real-acceptance.sh \
--configure-profiler \
--namespace java-profiler-qa \
--service jdk17-http-demo \
--artifact-dir /tmp/java-profiler-jdk17-demo-$(date +%Y%m%d-%H%M%S) \
--require-full-profilingThe script records target Pod state before and after the run and fails if the selected workload's restart count increases. With --require-full-profiling, only data created after the acceptance run starts can satisfy the check; historical profile rows from earlier Pods or earlier runs do not count. Use --skip-workload-rollout-check only when --service is a label-level filter rather than a Deployment name.
For strict acceptance after collector/backend/web changes, first build images from the current workspace and deploy those exact tags:
export BACKEND_IMAGE=java-profiler-backend:qa-$(date +%Y%m%d%H%M%S)
export COLLECTOR_IMAGE=java-profiler-collector:qa-$(date +%Y%m%d%H%M%S)
export WEB_IMAGE=java-profiler-web:qa-$(date +%Y%m%d%H%M%S)
bash scripts/build-real-acceptance-images.sh
KUBECONFIG=/path/to/kubeconfig \
BACKEND_IMAGE="$BACKEND_IMAGE" \
COLLECTOR_IMAGE="$COLLECTOR_IMAGE" \
WEB_IMAGE="$WEB_IMAGE" \
scripts/real-acceptance.sh \
--service jdk17-http-demo \
--configure-profiler \
--require-full-profiling \
--high-volume \
--artifact-dir /tmp/java-profiler-real-acceptance-$(date +%Y%m%d%H%M%S)If a previous run left async-profiler loaded in the target JVM and the status becomes profiler_conflict, roll the target Pod before retrying strict acceptance.
Failure Statuses
disabled_by_metadata: workload has no opt-in metadata or has explicit disableunsupported_jvm: process is not HotSpot-compatibleprofiler_conflict: another async-profiler user is detectedtemporary_expired: temporary profile window has expired; roll the Pod or start a fresh temporary windowattach_failed: collector could not attach to the JVMupload_retryable: backend or network failure can be retriedupload_dropped: collector buffer overflow dropped old batchesstorage_rejected: backend rejected invalid or conflicting data
If status is accepted but profile data stays empty, check backend ingestion health before changing the workload. Rejected profile batches commonly indicate a collector/backend payload contract mismatch or ClickHouse schema drift.
Retention
Profile samples, stacks, thread snapshots, deadlock events, target status, and ingestion health have seven-day TTLs. Optional artifact index rows are retained for 24 hours maximum.
Metrics
Collector and backend expose Prometheus-compatible metrics. Prometheus owns metric storage, dashboards, alerting, and retention. This system does not store Prometheus-style time series in ClickHouse.
Expected metric groups:
- target discovery and status counters
- profiler active, disabled, skipped, and failed counters
- upload success, retry, duplicate, and dropped-batch counters
- backend ingestion success and failure counters
- ClickHouse latency, table size, and TTL lag gauges