The Problem: Data Rich, Metric Poor
Most organizations store an enormous amount of operational and business data in Elasticsearch. Application logs, transaction records, user events, order histories — it all flows into ES indices by the terabyte. Teams build Kibana dashboards to visualize it, and for ad-hoc exploration that works beautifully.
But there is a disconnect. The observability stack — whether built on OpenTelemetry, Grafana, Datadog, or any OTLP-compatible backend — lives in a different world. It speaks metrics: time-series numbers with labels and attributes. If you want to alert when error rates spike or when order volumes drop below a threshold, you need those numbers as structured metrics in your telemetry pipeline. And Elasticsearch does not natively expose query results as OpenTelemetry metric signals.
The traditional workaround is a bespoke microservice per metric: write a tiny app that queries ES, formats the result, and exposes a /metrics endpoint. This works until you have fifty such metrics across ten teams. Now you have fifty microservices to maintain, deploy, and monitor — a second monitoring problem layered on top of the first.
What we needed was not another custom service. We needed a generic, configuration-driven bridge that could turn any Elasticsearch query result into an OpenTelemetry metric — with no code changes required to add a new metric.
Real-World Use Cases
Before diving into the architecture, here are the concrete scenarios that drove the design of the Elasticsearch Query Exporter — each one replacing custom-built services with a few lines of YAML configuration.
Operational Monitoring: Error Rate Alerting
An SRE team needs to count ERROR-level log entries per microservice every 30 seconds. The resulting es_error_count{service_name="checkout"} metric feeds into a Grafana dashboard and a PagerDuty alert rule. What previously required a custom Python script with a cron job is now a 15-line YAML block. When a new service launches, the team adds one more filter — no deployment needed.
Business Analytics: Real-Time Revenue Dashboard
A product team runs an hourly aggregation of order totals by region and payment method. The es_orders_total_value{region="eu-west"} metric drives a Grafana panel that leadership checks every morning. Because it is a standard OpenTelemetry metric, it composes naturally with infrastructure telemetry — you can correlate a revenue dip with a deployment event or a latency spike on the same timeline.
Security: Anomaly Detection Baseline
A security team tracks the cardinality of source IPs hitting the authentication endpoint every five minutes. A sudden spike in es_auth_unique_ips triggers an automated investigation workflow. The exporter's cardinality aggregation makes this a trivial configuration addition.
Capacity Planning: Index Growth Tracking
An infrastructure team uses aggregation queries to track document counts and storage sizes per index pattern. The resulting metrics feed into capacity planning models that predict when the Elasticsearch cluster will need additional nodes — turning reactive firefighting into proactive scaling.
By exposing metrics via an OTLP-compatible endpoint, the exporter integrates with any observability backend — Grafana, Datadog, New Relic, Dynatrace, Jaeger, or a self-hosted OpenTelemetry Collector pipeline. You are not locked into a single vendor or protocol.
Architecture of the Elasticsearch Query Exporter
The Elasticsearch Query Exporter is a standalone Spring Boot application. It reads a YAML configuration, creates Elasticsearch client connections per target, schedules queries at configured intervals, maps aggregation results to OpenTelemetry metrics via Micrometer, and exposes them on a standard /metrics HTTP endpoint that any OpenTelemetry-compatible collector or backend can scrape.
flowchart LR
subgraph config [Configuration]
YAMLConfig["YAML Config\nLoader"]
end
subgraph exporter [ES Query Exporter]
Scheduler["Scheduler\n(per-collector intervals)"]
QueryEngine["Query Engine\n(simplified + raw DSL)"]
ResultMapper["Result\nMapper"]
OTelRegistry["Micrometer\nOTel Registry"]
MetricsEndpoint["/metrics\nendpoint"]
end
subgraph targets [Elasticsearch Clusters]
ESClusterA["Cluster A"]
ESClusterB["Cluster B"]
end
subgraph consumers [Observability Backends]
OTelCollector["OTel Collector"]
Grafana["Grafana"]
Datadog["Datadog / New Relic\n/ Any OTLP Backend"]
end
YAMLConfig --> Scheduler
Scheduler --> QueryEngine
QueryEngine --> ESClusterA
QueryEngine --> ESClusterB
ESClusterA --> ResultMapper
ESClusterB --> ResultMapper
ResultMapper --> OTelRegistry
OTelRegistry --> MetricsEndpoint
MetricsEndpoint --> OTelCollector
MetricsEndpoint --> Grafana
MetricsEndpoint --> Datadog
Technology Stack
| Component | Choice | Rationale |
|---|---|---|
| Runtime | Java 17 + Spring Boot 3.x | Mature ecosystem, Actuator provides /metrics out of the box |
| Build | Maven | Widest CI/CD compatibility |
| Metrics | Micrometer + OpenTelemetry Registry | Spring Boot native, OTLP export + /metrics scrape endpoint |
| ES Client | co.elastic.clients:elasticsearch-java | Official typed client, ES 7.17+ and 8.x support |
| Config | SnakeYAML (bundled) | Familiar YAML syntax, env-var substitution |
| Deploy | Docker + Helm | Kubernetes-native, OpenTelemetry Collector sidecar or ServiceMonitor |
Configuration Design — Two Modes, One File
The exporter supports two query modes in the same YAML configuration file. This was a deliberate decision to serve two distinct audiences with one tool:
- Simplified mode for operations engineers and SREs who need standard monitoring metrics (error counts, latencies, throughput) without learning Elasticsearch Query DSL.
- Raw mode for data engineers and analysts who need full control over complex aggregation pipelines, nested bucket aggregations, and scripted fields.
Both modes share the same outer structure — targets, collectors, metrics, labels, intervals. The entire configuration lives in a single es-exporter.yml file (with optional external collector files via globs).
global:
scrape_timeout: 30s
min_interval: 60s # default cache TTL for all collectors
max_connections_per_target: 5
targets:
- name: production-es
endpoints:
- "https://es-node1:9200"
- "https://es-node2:9200"
auth:
type: basic # basic | api_key | certificate
username: "${ES_USERNAME}"
password: "${ES_PASSWORD}"
tls:
verify: true
ca_cert: /certs/ca.crt
collectors:
- error_rate_collector
- order_metrics_collector
collector_files:
- "collectors/*.collector.yml"Environment variable substitution with the ${VAR} syntax keeps secrets out of config files — critical for GitOps workflows where config is committed to a repository.
Simplified Mode: Monitoring Without JSON
The simplified mode is the exporter's most opinionated feature and the one that makes it accessible to teams who are not Elasticsearch experts. You describe what you want to measure, and the exporter constructs the query for you.
collectors:
- collector_name: error_rate_collector
min_interval: 30s
metrics:
- metric_name: es_error_count
type: gauge
help: "Error log entries in the last 5 minutes"
index: "app-logs-*"
query_mode: simplified
simplified:
time_field: "@timestamp"
time_range: "5m"
filters:
- field: level
value: ERROR
aggregation: count
key_labels:
- service_name
static_labels:
env: productionUnder the hood, the QueryBuilder translates this into an Elasticsearch search request:
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "range": { "@timestamp": { "gte": "now-5m" } } },
{ "term": { "level": "ERROR" } }
]
}
},
"aggs": {
"by_service_name": {
"terms": { "field": "service_name.keyword" },
"aggs": {
"metric_value": { "value_count": { "field": "_id" } }
}
}
}
}The supported simplified aggregations cover the most common monitoring patterns:
| Aggregation | ES Equivalent | Typical Use |
|---|---|---|
count | value_count | Error counts, request volumes |
sum | sum | Total revenue, total bytes transferred |
avg | avg | Average response time, average order value |
min / max | min / max | Latency percentiles, peak values |
cardinality | cardinality | Unique users, unique sessions |
When key_labels are specified, the QueryBuilder automatically wraps the aggregation inside a terms aggregation on each key label field, producing one metric data point per unique combination of label values.
Raw Mode: Full Query DSL for Power Users
For complex analytical queries — multi-level nested aggregations, pipeline aggregations like derivatives and moving averages, or scripted fields — the raw mode accepts a full Elasticsearch query body as JSON.
collectors:
- collector_name: order_metrics_collector
min_interval: 120s
metrics:
- metric_name: es_orders_total_value
type: gauge
help: "Total order value from the last hour"
index: "orders-*"
query_mode: raw
raw_query: |
{
"size": 0,
"query": {
"range": {
"created_at": { "gte": "now-1h" }
}
},
"aggs": {
"total_value": {
"sum": { "field": "order_total" }
},
"by_region": {
"terms": { "field": "region.keyword" },
"aggs": {
"region_total": {
"sum": { "field": "order_total" }
}
}
}
}
}
value_mappings:
- agg_path: "total_value.value"
value_name: total
- agg_path: "by_region"
bucket_key_label: region
value_path: "region_total.value"
static_labels:
source: elasticsearchThe value_mappings array is the key innovation here. Each mapping tells the ResultMapper exactly how to navigate the aggregation response tree:
- Single-value aggregations use
agg_pathto point at a dot-delimited path in the response (e.g.,total_value.value). This produces one metric data point. - Bucket aggregations use
agg_pathto identify the bucket aggregation,bucket_key_labelto map the bucket key to a metric attribute, andvalue_pathto extract the numeric value within each bucket. This produces N metric data points, one per bucket.
Result Mapping — From Aggregation Trees to Flat Metrics
Elasticsearch aggregation responses are deeply nested JSON structures. OpenTelemetry metrics are flat: a name, a numeric value, and a set of key-value attributes. The ResultMapper component is responsible for flattening one into the other.
Consider the aggregation response for the order metrics query above:
{
"aggregations": {
"total_value": { "value": 284750.00 },
"by_region": {
"buckets": [
{ "key": "us-east", "region_total": { "value": 142000.00 } },
{ "key": "eu-west", "region_total": { "value": 98250.00 } },
{ "key": "ap-south", "region_total": { "value": 44500.00 } }
]
}
}
}The ResultMapper produces the following OpenTelemetry metric data points from the two value_mappings:
Metric: es_orders_total_value (Gauge)
Description: Total order value from the last hour
DataPoint { attributes: {source="elasticsearch", value_name="total"}, value: 284750.0 }
DataPoint { attributes: {source="elasticsearch", region="us-east"}, value: 142000.0 }
DataPoint { attributes: {source="elasticsearch", region="eu-west"}, value: 98250.0 }
DataPoint { attributes: {source="elasticsearch", region="ap-south"}, value: 44500.0 }The mapper walks the JSON tree using the dot-delimited path, detects whether it lands on a buckets array (bucket aggregation) or a value field (single-value aggregation), and acts accordingly. This is implemented with Jackson's JsonNode tree traversal — no reflection, no code generation, just straightforward tree walking.
We chose dot-path navigation over JSONPath or JMESPath because it maps directly to how Elasticsearch names its aggregations. When you write "total_value": { "sum": ... } in your query, the response key is total_value, and you reference it as total_value.value. There is no mental translation required.
Scheduling and Caching
A critical design principle is that query execution frequency and collection frequency must be decoupled. Your observability backend may scrape or poll every 15 seconds, but you do not want to hit Elasticsearch with an expensive aggregation every 15 seconds.
sequenceDiagram
participant Scheduler
participant Cache
participant ES as Elasticsearch
participant Endpoint as /metrics
participant OTel as OTel Collector
Note over Scheduler: min_interval timer fires
Scheduler->>ES: Execute configured query
ES-->>Scheduler: Aggregation response
Scheduler->>Cache: Store result + timestamp
Note over OTel: Scrape interval (e.g. 15s)
OTel->>Endpoint: GET /metrics
Endpoint->>Cache: Check cache freshness
Cache-->>Endpoint: Return cached result
Endpoint-->>OTel: Metric data points
Note over OTel: Next scrape (cache still fresh)
OTel->>Endpoint: GET /metrics
Endpoint->>Cache: Check cache freshness
Cache-->>Endpoint: Return same cached result
Endpoint-->>OTel: Metric data points
The CollectorScheduler implements a two-layer caching strategy:
- Scheduled execution: Each collector runs on its own timer (configured via
min_interval). The scheduler uses aScheduledExecutorServicethread pool to fire queries independently. - Result caching: After each query execution, the result (metric values + attributes) is cached with a timestamp. When the observability backend scrapes
/metrics, the exporter returns the cached result if the cache is still within themin_intervalwindow.
@Component
public class CollectorScheduler {
private final ScheduledExecutorService executor;
private final Map<String, CachedResult> cache = new ConcurrentHashMap<>();
public void scheduleCollector(CollectorConfig config, MetricCollector collector) {
Duration interval = config.getMinInterval();
executor.scheduleAtFixedRate(() -> {
List<MetricResult> results = collector.collect();
cache.put(config.getName(),
new CachedResult(results, Instant.now()));
}, 0, interval.toMillis(), TimeUnit.MILLISECONDS);
}
public List<MetricResult> getCachedResults(String collectorName) {
return cache.get(collectorName).getResults();
}
}This pattern means a collector with min_interval: 120s will execute its query exactly once every two minutes, regardless of how often your backend scrapes the endpoint. The global.min_interval serves as the default, and individual collectors can override it — expensive analytical queries run less often, lightweight health checks run more often.
The first scrape after startup will always trigger a live query (the cache is cold). If you have many expensive collectors, use the global.warmup_delay setting to stagger their initial execution and avoid thundering-herd pressure on Elasticsearch.
Multi-Cluster Support
Enterprise environments rarely have a single Elasticsearch cluster. Production logs are in one cluster, business transaction data in another, and perhaps a third for security events. The exporter supports multiple targets in the same configuration, each with its own connection settings, authentication, and collector assignments.
targets:
- name: prod-logs
endpoints: ["https://logs-es:9200"]
auth:
type: api_key
api_key: "${LOGS_API_KEY}"
collectors: [error_rate_collector, latency_collector]
- name: prod-transactions
endpoints: ["https://txn-es:9200"]
auth:
type: certificate
ca_cert: /certs/ca.crt
client_cert: /certs/client.crt
client_key: /certs/client.key
collectors: [order_metrics_collector, revenue_collector]The ElasticsearchClientFactory creates an isolated client per target, each with its own connection pool, authentication handler, and TLS configuration. Metrics from different targets are differentiated by a target attribute automatically added by the exporter, so they coexist cleanly in any backend.
Supported authentication methods:
| Method | Config Key | Use Case |
|---|---|---|
| Basic Auth | username / password | Development, simple deployments |
| API Key | api_key | Cloud-managed ES, fine-grained permissions |
| mTLS / Certificate | ca_cert, client_cert, client_key | Zero-trust environments, service mesh |
Docker, Helm, and Production Deployment
The exporter ships as a multi-stage Docker image: Maven builds the fat JAR, then a slim JRE 17 base runs it. The resulting image is under 200 MB.
FROM maven:3.9-eclipse-temurin-17 AS build
COPY . /app
WORKDIR /app
RUN mvn clean package -DskipTests
FROM eclipse-temurin:17-jre-alpine
COPY --from=build /app/target/*.jar /app/exporter.jar
COPY es-exporter.yml /etc/es-exporter/es-exporter.yml
EXPOSE 9399
ENTRYPOINT ["java", "-jar", "/app/exporter.jar"]For Kubernetes deployments, the Helm chart provides:
- ConfigMap mounting
es-exporter.ymlinto the pod, editable viavalues.yaml - Secret for ES credentials, referenced by env-var substitution in the config
- OpenTelemetry Collector sidecar config and optional ServiceMonitor CRD — deploy the chart and your OTel pipeline starts collecting automatically
- Resource limits, replica count, and liveness/readiness probes as
values.yamloverrides
helm install es-exporter ./helm/elasticsearch-query-exporter \
--set elasticsearch.username=es_reader \
--set elasticsearch.password=s3cret \
--set config.global.min_interval=30sInternal Exporter Metrics
The exporter instruments itself so you can monitor the monitor. These metrics are always exposed on /metrics alongside the user-configured metrics:
| Metric | Type | Description |
|---|---|---|
es_exporter_up | Gauge | 1 if the target is reachable, 0 otherwise |
es_exporter_query_duration_seconds | Gauge | Time taken for the last query execution |
es_exporter_query_errors_total | Counter | Total query failures per collector |
es_exporter_scrape_duration_seconds | Gauge | Total time to serve a /metrics request |
Conclusion
The Elasticsearch Query Exporter demonstrates that a configuration-driven approach is the right abstraction for bridging data stores and observability pipelines. By separating the what (metric definitions in YAML) from the how (the exporter's Java runtime), organizations can add new business and operational metrics in minutes rather than days.
The dual-mode configuration — simplified for the common case, raw for the complex case — lowers the barrier to entry without sacrificing power. And because the output is standard OpenTelemetry metrics, the exporter fits seamlessly into any observability stack: Grafana, Datadog, New Relic, Dynatrace, or a self-hosted OpenTelemetry Collector pipeline routing to any OTLP-compatible backend.
The source code, Dockerfile, Helm chart, and example configurations are available in the project repository. Contributions — particularly new simplified aggregation types and additional authentication backends — are welcome.