Altisync | The AI-Native Cloud & Consulting

The Problem: Data Rich, Metric Poor

Most organizations store an enormous amount of operational and business data in Elasticsearch. Application logs, transaction records, user events, order histories — it all flows into ES indices by the terabyte. Teams build Kibana dashboards to visualize it, and for ad-hoc exploration that works beautifully.

But there is a disconnect. The observability stack — whether built on OpenTelemetry, Grafana, Datadog, or any OTLP-compatible backend — lives in a different world. It speaks metrics: time-series numbers with labels and attributes. If you want to alert when error rates spike or when order volumes drop below a threshold, you need those numbers as structured metrics in your telemetry pipeline. And Elasticsearch does not natively expose query results as OpenTelemetry metric signals.

The traditional workaround is a bespoke microservice per metric: write a tiny app that queries ES, formats the result, and exposes a /metrics endpoint. This works until you have fifty such metrics across ten teams. Now you have fifty microservices to maintain, deploy, and monitor — a second monitoring problem layered on top of the first.

Key Insight

What we needed was not another custom service. We needed a generic, configuration-driven bridge that could turn any Elasticsearch query result into an OpenTelemetry metric — with no code changes required to add a new metric.

Real-World Use Cases

Before diving into the architecture, here are the concrete scenarios that drove the design of the Elasticsearch Query Exporter — each one replacing custom-built services with a few lines of YAML configuration.

Operational Monitoring: Error Rate Alerting

An SRE team needs to count ERROR-level log entries per microservice every 30 seconds. The resulting es_error_count{service_name="checkout"} metric feeds into a Grafana dashboard and a PagerDuty alert rule. What previously required a custom Python script with a cron job is now a 15-line YAML block. When a new service launches, the team adds one more filter — no deployment needed.

Business Analytics: Real-Time Revenue Dashboard

A product team runs an hourly aggregation of order totals by region and payment method. The es_orders_total_value{region="eu-west"} metric drives a Grafana panel that leadership checks every morning. Because it is a standard OpenTelemetry metric, it composes naturally with infrastructure telemetry — you can correlate a revenue dip with a deployment event or a latency spike on the same timeline.

Security: Anomaly Detection Baseline

A security team tracks the cardinality of source IPs hitting the authentication endpoint every five minutes. A sudden spike in es_auth_unique_ips triggers an automated investigation workflow. The exporter's cardinality aggregation makes this a trivial configuration addition.

Capacity Planning: Index Growth Tracking

An infrastructure team uses aggregation queries to track document counts and storage sizes per index pattern. The resulting metrics feed into capacity planning models that predict when the Elasticsearch cluster will need additional nodes — turning reactive firefighting into proactive scaling.

Why OpenTelemetry?

By exposing metrics via an OTLP-compatible endpoint, the exporter integrates with any observability backend — Grafana, Datadog, New Relic, Dynatrace, Jaeger, or a self-hosted OpenTelemetry Collector pipeline. You are not locked into a single vendor or protocol.

Architecture of the Elasticsearch Query Exporter

The Elasticsearch Query Exporter is a standalone Spring Boot application. It reads a YAML configuration, creates Elasticsearch client connections per target, schedules queries at configured intervals, maps aggregation results to OpenTelemetry metrics via Micrometer, and exposes them on a standard /metrics HTTP endpoint that any OpenTelemetry-compatible collector or backend can scrape.

flowchart LR
    subgraph config [Configuration]
        YAMLConfig["YAML Config\nLoader"]
    end

    subgraph exporter [ES Query Exporter]
        Scheduler["Scheduler\n(per-collector intervals)"]
        QueryEngine["Query Engine\n(simplified + raw DSL)"]
        ResultMapper["Result\nMapper"]
        OTelRegistry["Micrometer\nOTel Registry"]
        MetricsEndpoint["/metrics\nendpoint"]
    end

    subgraph targets [Elasticsearch Clusters]
        ESClusterA["Cluster A"]
        ESClusterB["Cluster B"]
    end

    subgraph consumers [Observability Backends]
        OTelCollector["OTel Collector"]
        Grafana["Grafana"]
        Datadog["Datadog / New Relic\n/ Any OTLP Backend"]
    end

    YAMLConfig --> Scheduler
    Scheduler --> QueryEngine
    QueryEngine --> ESClusterA
    QueryEngine --> ESClusterB
    ESClusterA --> ResultMapper
    ESClusterB --> ResultMapper
    ResultMapper --> OTelRegistry
    OTelRegistry --> MetricsEndpoint
    MetricsEndpoint --> OTelCollector
    MetricsEndpoint --> Grafana
    MetricsEndpoint --> Datadog

Technology Stack

Component	Choice	Rationale
Runtime	`Java 17 + Spring Boot 3.x`	Mature ecosystem, Actuator provides `/metrics` out of the box
Build	`Maven`	Widest CI/CD compatibility
Metrics	`Micrometer + OpenTelemetry Registry`	Spring Boot native, OTLP export + `/metrics` scrape endpoint
ES Client	`co.elastic.clients:elasticsearch-java`	Official typed client, ES 7.17+ and 8.x support
Config	`SnakeYAML` (bundled)	Familiar YAML syntax, env-var substitution
Deploy	`Docker + Helm`	Kubernetes-native, OpenTelemetry Collector sidecar or ServiceMonitor

Configuration Design — Two Modes, One File

The exporter supports two query modes in the same YAML configuration file. This was a deliberate decision to serve two distinct audiences with one tool:

Simplified mode for operations engineers and SREs who need standard monitoring metrics (error counts, latencies, throughput) without learning Elasticsearch Query DSL.
Raw mode for data engineers and analysts who need full control over complex aggregation pipelines, nested bucket aggregations, and scripted fields.

Both modes share the same outer structure — targets, collectors, metrics, labels, intervals. The entire configuration lives in a single es-exporter.yml file (with optional external collector files via globs).

es-exporter.yml — top-level structure

global:
  scrape_timeout: 30s
  min_interval: 60s         # default cache TTL for all collectors
  max_connections_per_target: 5

targets:
  - name: production-es
    endpoints:
      - "https://es-node1:9200"
      - "https://es-node2:9200"
    auth:
      type: basic            # basic | api_key | certificate
      username: "${ES_USERNAME}"
      password: "${ES_PASSWORD}"
    tls:
      verify: true
      ca_cert: /certs/ca.crt
    collectors:
      - error_rate_collector
      - order_metrics_collector

collector_files:
  - "collectors/*.collector.yml"

Environment variable substitution with the ${VAR} syntax keeps secrets out of config files — critical for GitOps workflows where config is committed to a repository.

Simplified Mode: Monitoring Without JSON

The simplified mode is the exporter's most opinionated feature and the one that makes it accessible to teams who are not Elasticsearch experts. You describe what you want to measure, and the exporter constructs the query for you.

A simplified collector: error counts by service

collectors:
  - collector_name: error_rate_collector
    min_interval: 30s
    metrics:
      - metric_name: es_error_count
        type: gauge
        help: "Error log entries in the last 5 minutes"
        index: "app-logs-*"
        query_mode: simplified
        simplified:
          time_field: "@timestamp"
          time_range: "5m"
          filters:
            - field: level
              value: ERROR
          aggregation: count
        key_labels:
          - service_name
        static_labels:
          env: production

Under the hood, the QueryBuilder translates this into an Elasticsearch search request:

Generated Elasticsearch query

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "range": { "@timestamp": { "gte": "now-5m" } } },
        { "term": { "level": "ERROR" } }
      ]
    }
  },
  "aggs": {
    "by_service_name": {
      "terms": { "field": "service_name.keyword" },
      "aggs": {
        "metric_value": { "value_count": { "field": "_id" } }
      }
    }
  }
}

The supported simplified aggregations cover the most common monitoring patterns:

Aggregation	ES Equivalent	Typical Use
`count`	`value_count`	Error counts, request volumes
`sum`	`sum`	Total revenue, total bytes transferred
`avg`	`avg`	Average response time, average order value
`min` / `max`	`min` / `max`	Latency percentiles, peak values
`cardinality`	`cardinality`	Unique users, unique sessions

When key_labels are specified, the QueryBuilder automatically wraps the aggregation inside a terms aggregation on each key label field, producing one metric data point per unique combination of label values.

Raw Mode: Full Query DSL for Power Users

For complex analytical queries — multi-level nested aggregations, pipeline aggregations like derivatives and moving averages, or scripted fields — the raw mode accepts a full Elasticsearch query body as JSON.

A raw collector: order value by region

collectors:
  - collector_name: order_metrics_collector
    min_interval: 120s
    metrics:
      - metric_name: es_orders_total_value
        type: gauge
        help: "Total order value from the last hour"
        index: "orders-*"
        query_mode: raw
        raw_query: |
          {
            "size": 0,
            "query": {
              "range": {
                "created_at": { "gte": "now-1h" }
              }
            },
            "aggs": {
              "total_value": {
                "sum": { "field": "order_total" }
              },
              "by_region": {
                "terms": { "field": "region.keyword" },
                "aggs": {
                  "region_total": {
                    "sum": { "field": "order_total" }
                  }
                }
              }
            }
          }
        value_mappings:
          - agg_path: "total_value.value"
            value_name: total
          - agg_path: "by_region"
            bucket_key_label: region
            value_path: "region_total.value"
        static_labels:
          source: elasticsearch

The value_mappings array is the key innovation here. Each mapping tells the ResultMapper exactly how to navigate the aggregation response tree:

Single-value aggregations use agg_path to point at a dot-delimited path in the response (e.g., total_value.value). This produces one metric data point.
Bucket aggregations use agg_path to identify the bucket aggregation, bucket_key_label to map the bucket key to a metric attribute, and value_path to extract the numeric value within each bucket. This produces N metric data points, one per bucket.

Result Mapping — From Aggregation Trees to Flat Metrics

Elasticsearch aggregation responses are deeply nested JSON structures. OpenTelemetry metrics are flat: a name, a numeric value, and a set of key-value attributes. The ResultMapper component is responsible for flattening one into the other.

Consider the aggregation response for the order metrics query above:

Elasticsearch aggregation response (abbreviated)

{
  "aggregations": {
    "total_value": { "value": 284750.00 },
    "by_region": {
      "buckets": [
        { "key": "us-east",  "region_total": { "value": 142000.00 } },
        { "key": "eu-west",  "region_total": { "value": 98250.00  } },
        { "key": "ap-south", "region_total": { "value": 44500.00  } }
      ]
    }
  }
}

The ResultMapper produces the following OpenTelemetry metric data points from the two value_mappings:

exported metric data points

Metric: es_orders_total_value (Gauge)
Description: Total order value from the last hour

DataPoint { attributes: {source="elasticsearch", value_name="total"},     value: 284750.0 }
DataPoint { attributes: {source="elasticsearch", region="us-east"},  value: 142000.0 }
DataPoint { attributes: {source="elasticsearch", region="eu-west"},  value: 98250.0  }
DataPoint { attributes: {source="elasticsearch", region="ap-south"}, value: 44500.0  }

The mapper walks the JSON tree using the dot-delimited path, detects whether it lands on a buckets array (bucket aggregation) or a value field (single-value aggregation), and acts accordingly. This is implemented with Jackson's JsonNode tree traversal — no reflection, no code generation, just straightforward tree walking.

Design Decision

We chose dot-path navigation over JSONPath or JMESPath because it maps directly to how Elasticsearch names its aggregations. When you write "total_value": { "sum": ... } in your query, the response key is total_value, and you reference it as total_value.value. There is no mental translation required.

Scheduling and Caching

A critical design principle is that query execution frequency and collection frequency must be decoupled. Your observability backend may scrape or poll every 15 seconds, but you do not want to hit Elasticsearch with an expensive aggregation every 15 seconds.

sequenceDiagram
    participant Scheduler
    participant Cache
    participant ES as Elasticsearch
    participant Endpoint as /metrics
    participant OTel as OTel Collector

    Note over Scheduler: min_interval timer fires
    Scheduler->>ES: Execute configured query
    ES-->>Scheduler: Aggregation response
    Scheduler->>Cache: Store result + timestamp

    Note over OTel: Scrape interval (e.g. 15s)
    OTel->>Endpoint: GET /metrics
    Endpoint->>Cache: Check cache freshness
    Cache-->>Endpoint: Return cached result
    Endpoint-->>OTel: Metric data points

    Note over OTel: Next scrape (cache still fresh)
    OTel->>Endpoint: GET /metrics
    Endpoint->>Cache: Check cache freshness
    Cache-->>Endpoint: Return same cached result
    Endpoint-->>OTel: Metric data points

The CollectorScheduler implements a two-layer caching strategy:

Scheduled execution: Each collector runs on its own timer (configured via min_interval). The scheduler uses a ScheduledExecutorService thread pool to fire queries independently.
Result caching: After each query execution, the result (metric values + attributes) is cached with a timestamp. When the observability backend scrapes /metrics, the exporter returns the cached result if the cache is still within the min_interval window.

CollectorScheduler.java (conceptual)

@Component
public class CollectorScheduler {

    private final ScheduledExecutorService executor;
    private final Map<String, CachedResult> cache = new ConcurrentHashMap<>();

    public void scheduleCollector(CollectorConfig config, MetricCollector collector) {
        Duration interval = config.getMinInterval();
        executor.scheduleAtFixedRate(() -> {
            List<MetricResult> results = collector.collect();
            cache.put(config.getName(),
                new CachedResult(results, Instant.now()));
        }, 0, interval.toMillis(), TimeUnit.MILLISECONDS);
    }

    public List<MetricResult> getCachedResults(String collectorName) {
        return cache.get(collectorName).getResults();
    }
}

This pattern means a collector with min_interval: 120s will execute its query exactly once every two minutes, regardless of how often your backend scrapes the endpoint. The global.min_interval serves as the default, and individual collectors can override it — expensive analytical queries run less often, lightweight health checks run more often.

Watch Out

The first scrape after startup will always trigger a live query (the cache is cold). If you have many expensive collectors, use the global.warmup_delay setting to stagger their initial execution and avoid thundering-herd pressure on Elasticsearch.

Multi-Cluster Support

Enterprise environments rarely have a single Elasticsearch cluster. Production logs are in one cluster, business transaction data in another, and perhaps a third for security events. The exporter supports multiple targets in the same configuration, each with its own connection settings, authentication, and collector assignments.

Two targets in one configuration

targets:
  - name: prod-logs
    endpoints: ["https://logs-es:9200"]
    auth:
      type: api_key
      api_key: "${LOGS_API_KEY}"
    collectors: [error_rate_collector, latency_collector]

  - name: prod-transactions
    endpoints: ["https://txn-es:9200"]
    auth:
      type: certificate
      ca_cert: /certs/ca.crt
      client_cert: /certs/client.crt
      client_key: /certs/client.key
    collectors: [order_metrics_collector, revenue_collector]

The ElasticsearchClientFactory creates an isolated client per target, each with its own connection pool, authentication handler, and TLS configuration. Metrics from different targets are differentiated by a target attribute automatically added by the exporter, so they coexist cleanly in any backend.

Supported authentication methods:

Method	Config Key	Use Case
Basic Auth	`username` / `password`	Development, simple deployments
API Key	`api_key`	Cloud-managed ES, fine-grained permissions
mTLS / Certificate	`ca_cert`, `client_cert`, `client_key`	Zero-trust environments, service mesh

Docker, Helm, and Production Deployment

The exporter ships as a multi-stage Docker image: Maven builds the fat JAR, then a slim JRE 17 base runs it. The resulting image is under 200 MB.

Dockerfile (multi-stage)

FROM maven:3.9-eclipse-temurin-17 AS build
COPY . /app
WORKDIR /app
RUN mvn clean package -DskipTests

FROM eclipse-temurin:17-jre-alpine
COPY --from=build /app/target/*.jar /app/exporter.jar
COPY es-exporter.yml /etc/es-exporter/es-exporter.yml
EXPOSE 9399
ENTRYPOINT ["java", "-jar", "/app/exporter.jar"]

For Kubernetes deployments, the Helm chart provides:

ConfigMap mounting es-exporter.yml into the pod, editable via values.yaml
Secret for ES credentials, referenced by env-var substitution in the config
OpenTelemetry Collector sidecar config and optional ServiceMonitor CRD — deploy the chart and your OTel pipeline starts collecting automatically
Resource limits, replica count, and liveness/readiness probes as values.yaml overrides

helm install example

helm install es-exporter ./helm/elasticsearch-query-exporter \
  --set elasticsearch.username=es_reader \
  --set elasticsearch.password=s3cret \
  --set config.global.min_interval=30s

Internal Exporter Metrics

The exporter instruments itself so you can monitor the monitor. These metrics are always exposed on /metrics alongside the user-configured metrics:

Metric	Type	Description
`es_exporter_up`	Gauge	1 if the target is reachable, 0 otherwise
`es_exporter_query_duration_seconds`	Gauge	Time taken for the last query execution
`es_exporter_query_errors_total`	Counter	Total query failures per collector
`es_exporter_scrape_duration_seconds`	Gauge	Total time to serve a `/metrics` request

Conclusion

The Elasticsearch Query Exporter demonstrates that a configuration-driven approach is the right abstraction for bridging data stores and observability pipelines. By separating the what (metric definitions in YAML) from the how (the exporter's Java runtime), organizations can add new business and operational metrics in minutes rather than days.

The dual-mode configuration — simplified for the common case, raw for the complex case — lowers the barrier to entry without sacrificing power. And because the output is standard OpenTelemetry metrics, the exporter fits seamlessly into any observability stack: Grafana, Datadog, New Relic, Dynatrace, or a self-hosted OpenTelemetry Collector pipeline routing to any OTLP-compatible backend.

The source code, Dockerfile, Helm chart, and example configurations are available in the project repository. Contributions — particularly new simplified aggregation types and additional authentication backends — are welcome.