Introduction: Why the ELK Stack Powers Modern Observability
The ELK Stack — Elasticsearch, Logstash, and Kibana (now part of the Elastic Stack with Beats) — remains the most widely deployed open-source observability platform, processing petabytes of log data daily across organisations from startups to Fortune 500 enterprises. For distributed systems built with Node.js microservices and Java backend services, centralised logging is not optional — it's the foundation of debugging, performance monitoring, and incident response.
In 2025, the Elastic Stack has evolved with Elastic Agent for unified data collection, cross-cluster search for multi-region deployments, and Elastic Security for SIEM integration. This guide covers Elasticsearch cluster architecture, Logstash pipeline design, Beats data shipping, Kibana dashboard creation, Node.js and Java integration patterns, cluster scaling strategies, alerting configuration, and security hardening for production ELK deployments.
Elasticsearch Cluster Architecture and Index Design
Design Elasticsearch clusters for reliability, performance, and cost efficiency:
- Node Roles: Configure dedicated node roles —
master-eligiblenodes (3 minimum for quorum, lightweight) manage cluster state,datanodes store and search indices (CPU/memory-intensive),ingestnodes run preprocessing pipelines, andcoordinatingnodes route requests and aggregate results. Separating roles prevents resource contention — a data node running heavy searches shouldn't destabilise cluster management. - Index Lifecycle Management (ILM): Configure ILM policies for automatic index management —
hotphase (active writes, fast SSD storage),warmphase (read-only, standard storage),coldphase (infrequent access, compressed), anddeletephase (TTL-based removal). A typical log retention policy: 7 days hot, 30 days warm, 90 days cold, delete after 365 days. - Shard Strategy: Each index splits into shards (default 1 primary + 1 replica). Target 20-50GB per shard for optimal performance — too many small shards waste memory (each shard consumes ~500MB heap), too few large shards create hot spots. For time-series data, use data streams with daily rollover to maintain consistent shard sizes.
- Index Templates and Mappings: Define index templates with explicit field mappings — avoid dynamic mapping in production (it creates
text+keywordmulti-fields for every string, doubling storage). Usekeywordfor log levels, hostnames, and IDs;textonly for fields requiring full-text search; anddatefor timestamps withstrict_date_optional_timeformat. - Cross-Cluster Search: For multi-region deployments, use cross-cluster search (CCS) to query indices across clusters without data replication. Configure remote cluster connections in
elasticsearch.yml— Kibana dashboards transparently query all clusters for unified visibility.
Logstash Pipeline Design: Input, Filter, and Output
Build data processing pipelines that transform raw logs into structured data:
- Input Plugins: Logstash supports 50+ input plugins —
beats(receive from Filebeat/Metricbeat),kafka(consume from Kafka topics for high-throughput buffering),http(receive JSON payloads via HTTP),jdbc(poll databases for change data), andsyslog(receive RFC5424 messages). For high-volume deployments, use Kafka as a buffer between Beats and Logstash to handle traffic spikes. - Filter Plugins: Transform and enrich log data —
grokparses unstructured text into structured fields using regex patterns (200+ built-in patterns for Apache, Syslog, Java stack traces),mutaterenames/removes/converts fields,dateparses timestamps into@timestamp,geoipadds geolocation data from IP addresses, anddissectprovides faster parsing for delimited log formats. - Output Plugins: Route processed data —
elasticsearch(primary output with index naming, pipeline routing, and bulk indexing),stdout(debugging),file(archive to disk),kafka(forward to downstream consumers), ands3(long-term archive). Use conditional outputs to route different log types to different indices or destinations. - Pipeline Configuration: Optimise Logstash throughput with
pipeline.workers(match CPU cores),pipeline.batch.size(increase for higher throughput, default 125), andpipeline.batch.delay(latency vs throughput tradeoff). Monitor pipeline metrics with/_node/stats/pipelinesAPI to identify bottlenecks. - Persistent Queues: Enable persistent queues (
queue.type: persisted) for at-least-once delivery — if Logstash crashes, queued events survive restart. Configurequeue.max_bytesto prevent disk exhaustion. Persistent queues replace the need for external message brokers in moderate-volume deployments.
Beats: Lightweight Data Shippers for Every Data Source
Deploy purpose-built data collectors across your infrastructure:
- Filebeat: The most common Beat — tails log files, handles log rotation, maintains registry of file positions for exactly-once delivery, and ships to Logstash or Elasticsearch. Configure multiline patterns for Java stack traces (
multiline.pattern: '^[[:space:]]') and use modules for pre-built configurations (Nginx, Apache, MySQL, PostgreSQL, System logs). - Metricbeat: Collects system and service metrics — CPU, memory, disk, network (system module), plus application-specific metrics for Kubernetes, Docker, MongoDB, Redis, PostgreSQL, and JVM/JMX. Ships metrics every 10 seconds (configurable) to Elasticsearch with pre-built Kibana dashboards for immediate visualisation.
- APM Agent: Elastic APM agents (Node.js, Java, Python, .NET, Go) instrument applications automatically — capturing HTTP transactions, database queries, external API calls, and custom spans. APM data correlates with logs via trace IDs for end-to-end distributed tracing. The Node.js agent requires a single
require('elastic-apm-node').start()line. - Heartbeat: Monitors service availability with ICMP, TCP, and HTTP checks — verify endpoints are reachable, check TLS certificate expiry, validate HTTP response codes and body content. Configure uptime monitors for critical services and create Kibana uptime dashboards with SLA calculations.
- Elastic Agent: The unified data shipper replacing individual Beats — a single agent managed via Fleet (centralised configuration) that collects logs, metrics, APM data, and security events. Fleet policies push configuration changes to thousands of agents without manual updates. Use Elastic Agent for new deployments; existing Beats installations continue working.
Node.js Application Logging and APM Integration
Instrument Node.js microservices for comprehensive observability:
- Winston with ECS Format: Use Winston logger with
@elastic/ecs-winston-formatfor Elastic Common Schema (ECS) compliance — structured JSON logs with standardised field names (log.level,message,service.name,trace.id). ECS format ensures consistent field mapping across Node.js, Java, and Python services in the same Elasticsearch cluster. - Pino for High Performance: For high-throughput services, use Pino logger (5× faster than Winston) with
pino-elasticsearchtransport for direct Elasticsearch ingestion. Pino's async logging avoids blocking the event loop during log serialisation. Usepino-prettyfor development and ECS format for production. - Correlation IDs: Propagate request correlation IDs (trace IDs) through Express/Fastify middleware — generate a UUID at the API gateway, pass via
x-correlation-idheader, and include in every log line. This enables filtering all logs for a single request across multiple microservices in Kibana. Elastic APM auto-generates trace IDs when using the agent. - Error Tracking: Log unhandled exceptions and promise rejections with full stack traces — configure
process.on('uncaughtException')andprocess.on('unhandledRejection')to capture and ship error details before process exit. Include request context (URL, user ID, request body) in error logs for faster debugging. - Structured Log Context: Add request metadata to every log line using
cls-hooked(continuation-local storage) or AsyncLocalStorage — HTTP method, URL, user ID, response time, and status code are automatically attached without passing logger instances through every function call.
Transform Your Publishing Workflow
Our experts can help you build scalable, API-driven publishing systems tailored to your business.
Java Application Logging: Logback, SLF4J, and JVM Metrics
Configure Java services for ELK-compatible structured logging:
- Logback with ECS Encoder: Use
co.elastic.logging:logback-ecs-loggingfor Elastic Common Schema output — JSON-formatted logs with ECS field names, automatic MDC (Mapped Diagnostic Context) inclusion, and stack trace serialisation. Configure inlogback-spring.xmlwithEcsEncoderreplacingPatternLayoutEncoderfor production profiles. - SLF4J MDC for Context: Use SLF4J's MDC (Mapped Diagnostic Context) to attach request-scoped metadata —
MDC.put("traceId", traceId)in a servlet filter or Spring interceptor. All subsequent log lines in the request thread include the trace ID, user ID, and session ID without explicit passing. Clear MDC in a finally block to prevent context leaking between requests. - Spring Boot Actuator Metrics: Export JVM metrics (heap usage, GC pauses, thread counts, connection pool stats) via Micrometer + Elastic registry — metrics ship to Elasticsearch for Kibana dashboards alongside logs. Monitor GC pause times (G1GC target: <200ms), heap pressure (>80% triggers investigation), and thread pool saturation.
- Log4j2 Async Logging: For high-throughput Java services, use Log4j2 with LMAX Disruptor for lock-free asynchronous logging — 10× throughput improvement over synchronous logging. Configure
AsyncLoggerwithRingBufferSize=262144andWaitStrategy=busySpinfor lowest latency at the cost of CPU usage. - Exception Grouping: Configure Elastic APM Java agent for automatic exception grouping — similar exceptions cluster together rather than creating thousands of individual error entries. Group by exception class, message pattern, and stack trace fingerprint. Set
capture_body: allfor HTTP request body capture during error investigation.
Kibana Dashboards: Visualization, Alerting, and SIEM
Build operational dashboards that provide actionable insights:
- Dashboard Architecture: Create layered dashboards —
Overview(system health, error rates, request volumes across all services),Service-Level(per-service latency, throughput, error breakdown),Infrastructure(CPU, memory, disk, network per node), andInvestigation(log search, trace analysis, error deep-dive). Use dashboard drill-down links to navigate from overview to detail. - Key Visualisations: Use TSVB (Time Series Visual Builder) for real-time metric trends, Lens for drag-and-drop chart creation, Vega for custom visualisations (heatmaps, Sankey diagrams), and Maps for geolocation data. Create saved searches with KQL (Kibana Query Language) for common investigation patterns.
- Alerting Rules: Configure Elastic alerting (formerly Watcher) for operational alerts — error rate exceeding threshold (>1% 5xx responses), latency degradation (p99 > 2s for 5 minutes), disk space warnings (<20% free), and log volume anomalies (ML-based). Route alerts to Slack, PagerDuty, email, or webhook endpoints with severity-based escalation.
- Machine Learning Anomaly Detection: Elastic ML automatically detects anomalies in time-series data — unusual traffic patterns, error rate spikes, latency outliers, and log volume changes. Create ML jobs from Kibana with no data science expertise — the platform learns normal patterns and alerts on deviations.
- Elastic Security (SIEM): Use Elastic Security for security information and event management — correlate security events across network, endpoint, and application logs. Pre-built detection rules identify common threats (brute force attacks, data exfiltration, privilege escalation). The Timeline investigation tool enables security analysts to piece together attack narratives.
Cluster Scaling, Security, and MDS ELK Services
Operate production ELK clusters with enterprise-grade reliability:
- Horizontal Scaling: Add data nodes to increase storage and query capacity — Elasticsearch automatically rebalances shards across new nodes. Use hot-warm-cold architecture with different hardware tiers: NVMe SSDs for hot nodes (recent data, fast queries), HDDs for warm/cold nodes (historical data, lower cost). Scale Logstash horizontally behind a load balancer for ingestion throughput.
- Security Configuration: Enable TLS encryption for all inter-node communication (
xpack.security.transport.ssl) and HTTP API access (xpack.security.http.ssl). Configure RBAC (role-based access control) — read-only roles for dashboards, write roles for ingestion, and admin roles for cluster management. Integrate with LDAP/Active Directory or SAML for enterprise SSO. - Backup and Recovery: Configure snapshot repositories (S3, GCS, Azure Blob) for automated cluster backups — daily snapshots with weekly full snapshots and daily incrementals. Test restore procedures regularly. Use Searchable Snapshots to query archived data directly from object storage without restoring to the cluster.
- Cost Optimisation: Reduce storage costs with ILM-driven tiering,
force_mergeon read-only indices to reduce segment count,best_compressioncodec for cold indices (40% smaller), and field data type optimisation (keyword vs text, scaled_float vs double). Monitor index storage with_cat/indices?s=store.size:desc.
MetaDesign Solutions delivers ELK Stack implementation and managed observability services — from cluster architecture design and Logstash pipeline development through Kibana dashboard creation, Node.js/Java integration, alerting configuration, security hardening, and ongoing cluster management for organisations building comprehensive monitoring across distributed systems.



