Software Engineering & Digital Products for Global Enterprises since 2006
CMMi Level 3SOC 2ISO 27001
Menu
View all services
Staff Augmentation
Embed senior engineers in your team within weeks.
Dedicated Teams
A ring-fenced squad with PM, leads, and engineers.
Build-Operate-Transfer
We hire, run, and transfer the team to you.
Contract-to-Hire
Try the talent. Convert when you're ready.
ForceHQ
Skill testing, interviews and ranking — powered by AI.
RoboRingo
Build, deploy and monitor voice agents without code.
MailGovern
Policy, retention and compliance for enterprise email.
Vishing
Test and train staff against AI-driven voice attacks.
CyberForceHQ
Continuous, adaptive security training for every team.
IDS Load Balancer
Built for Multi Instance InDesign Server, to distribute jobs.
AutoVAPT.ai
AI agent for continuous, automated vulnerability and penetration testing.
Salesforce + InDesign Connector
Bridge Salesforce data into InDesign to design print catalogues at scale.
View all solutions
Banking, Financial Services & Insurance
Cloud, digital and legacy modernisation across financial entities.
Healthcare
Clinical platforms, patient engagement, and connected medical devices.
Pharma & Life Sciences
Trial systems, regulatory data, and field-force enablement.
Professional Services & Education
Workflow automation, learning platforms, and consulting tooling.
Media & Entertainment
AI video processing, OTT platforms, and content workflows.
Technology & SaaS
Product engineering, integrations, and scale for tech companies.
Retail & eCommerce
Shopify, print catalogues, web-to-print, and order automation.
View all industries
Blog
Engineering notes, opinions, and field reports.
Case Studies
How clients shipped — outcomes, stack, lessons.
White Papers
Deep-dives on AI, talent models, and platforms.
Portfolio
Selected work across industries.
View all resources
About Us
Who we are, our story, and what drives us.
Co-Innovation
How we partner to build new products together.
Careers
Open roles and what it's like to work here.
News
Press, announcements, and industry updates.
Leadership
The people steering MetaDesign.
Locations
Gurugram, Brisbane, Detroit and beyond.
Contact Us
Talk to sales, hiring, or partnerships.
Request TalentStart a Project
Java & JVM

Setting Up a Complete ELK Stack to Monitor Distributed Systems with Node.js and Java

SS
Sukriti Srivastava
Technical Content Lead
February 11, 2025
10 min read
Setting Up a Complete ELK Stack to Monitor Distributed Systems with Node.js and Java — Java & JVM | MetaDesign Solutions

Introduction: Why the ELK Stack Powers Modern Observability

The ELK Stack — Elasticsearch, Logstash, and Kibana (now part of the Elastic Stack with Beats) — remains the most widely deployed open-source observability platform, processing petabytes of log data daily across organisations from startups to Fortune 500 enterprises. For distributed systems built with Node.js microservices and Java backend services, centralised logging is not optional — it's the foundation of debugging, performance monitoring, and incident response.

In 2025, the Elastic Stack has evolved with Elastic Agent for unified data collection, cross-cluster search for multi-region deployments, and Elastic Security for SIEM integration. This guide covers Elasticsearch cluster architecture, Logstash pipeline design, Beats data shipping, Kibana dashboard creation, Node.js and Java integration patterns, cluster scaling strategies, alerting configuration, and security hardening for production ELK deployments.

Elasticsearch Cluster Architecture and Index Design

Design Elasticsearch clusters for reliability, performance, and cost efficiency:

  • Node Roles: Configure dedicated node roles — master-eligible nodes (3 minimum for quorum, lightweight) manage cluster state, data nodes store and search indices (CPU/memory-intensive), ingest nodes run preprocessing pipelines, and coordinating nodes route requests and aggregate results. Separating roles prevents resource contention — a data node running heavy searches shouldn't destabilise cluster management.
  • Index Lifecycle Management (ILM): Configure ILM policies for automatic index management — hot phase (active writes, fast SSD storage), warm phase (read-only, standard storage), cold phase (infrequent access, compressed), and delete phase (TTL-based removal). A typical log retention policy: 7 days hot, 30 days warm, 90 days cold, delete after 365 days.
  • Shard Strategy: Each index splits into shards (default 1 primary + 1 replica). Target 20-50GB per shard for optimal performance — too many small shards waste memory (each shard consumes ~500MB heap), too few large shards create hot spots. For time-series data, use data streams with daily rollover to maintain consistent shard sizes.
  • Index Templates and Mappings: Define index templates with explicit field mappings — avoid dynamic mapping in production (it creates text + keyword multi-fields for every string, doubling storage). Use keyword for log levels, hostnames, and IDs; text only for fields requiring full-text search; and date for timestamps with strict_date_optional_time format.
  • Cross-Cluster Search: For multi-region deployments, use cross-cluster search (CCS) to query indices across clusters without data replication. Configure remote cluster connections in elasticsearch.yml — Kibana dashboards transparently query all clusters for unified visibility.

Logstash Pipeline Design: Input, Filter, and Output

Build data processing pipelines that transform raw logs into structured data:

  • Input Plugins: Logstash supports 50+ input plugins — beats (receive from Filebeat/Metricbeat), kafka (consume from Kafka topics for high-throughput buffering), http (receive JSON payloads via HTTP), jdbc (poll databases for change data), and syslog (receive RFC5424 messages). For high-volume deployments, use Kafka as a buffer between Beats and Logstash to handle traffic spikes.
  • Filter Plugins: Transform and enrich log data — grok parses unstructured text into structured fields using regex patterns (200+ built-in patterns for Apache, Syslog, Java stack traces), mutate renames/removes/converts fields, date parses timestamps into @timestamp, geoip adds geolocation data from IP addresses, and dissect provides faster parsing for delimited log formats.
  • Output Plugins: Route processed data — elasticsearch (primary output with index naming, pipeline routing, and bulk indexing), stdout (debugging), file (archive to disk), kafka (forward to downstream consumers), and s3 (long-term archive). Use conditional outputs to route different log types to different indices or destinations.
  • Pipeline Configuration: Optimise Logstash throughput with pipeline.workers (match CPU cores), pipeline.batch.size (increase for higher throughput, default 125), and pipeline.batch.delay (latency vs throughput tradeoff). Monitor pipeline metrics with /_node/stats/pipelines API to identify bottlenecks.
  • Persistent Queues: Enable persistent queues (queue.type: persisted) for at-least-once delivery — if Logstash crashes, queued events survive restart. Configure queue.max_bytes to prevent disk exhaustion. Persistent queues replace the need for external message brokers in moderate-volume deployments.

Beats: Lightweight Data Shippers for Every Data Source

Deploy purpose-built data collectors across your infrastructure:

  • Filebeat: The most common Beat — tails log files, handles log rotation, maintains registry of file positions for exactly-once delivery, and ships to Logstash or Elasticsearch. Configure multiline patterns for Java stack traces (multiline.pattern: '^[[:space:]]') and use modules for pre-built configurations (Nginx, Apache, MySQL, PostgreSQL, System logs).
  • Metricbeat: Collects system and service metrics — CPU, memory, disk, network (system module), plus application-specific metrics for Kubernetes, Docker, MongoDB, Redis, PostgreSQL, and JVM/JMX. Ships metrics every 10 seconds (configurable) to Elasticsearch with pre-built Kibana dashboards for immediate visualisation.
  • APM Agent: Elastic APM agents (Node.js, Java, Python, .NET, Go) instrument applications automatically — capturing HTTP transactions, database queries, external API calls, and custom spans. APM data correlates with logs via trace IDs for end-to-end distributed tracing. The Node.js agent requires a single require('elastic-apm-node').start() line.
  • Heartbeat: Monitors service availability with ICMP, TCP, and HTTP checks — verify endpoints are reachable, check TLS certificate expiry, validate HTTP response codes and body content. Configure uptime monitors for critical services and create Kibana uptime dashboards with SLA calculations.
  • Elastic Agent: The unified data shipper replacing individual Beats — a single agent managed via Fleet (centralised configuration) that collects logs, metrics, APM data, and security events. Fleet policies push configuration changes to thousands of agents without manual updates. Use Elastic Agent for new deployments; existing Beats installations continue working.

Node.js Application Logging and APM Integration

Instrument Node.js microservices for comprehensive observability:

  • Winston with ECS Format: Use Winston logger with @elastic/ecs-winston-format for Elastic Common Schema (ECS) compliance — structured JSON logs with standardised field names (log.level, message, service.name, trace.id). ECS format ensures consistent field mapping across Node.js, Java, and Python services in the same Elasticsearch cluster.
  • Pino for High Performance: For high-throughput services, use Pino logger (5× faster than Winston) with pino-elasticsearch transport for direct Elasticsearch ingestion. Pino's async logging avoids blocking the event loop during log serialisation. Use pino-pretty for development and ECS format for production.
  • Correlation IDs: Propagate request correlation IDs (trace IDs) through Express/Fastify middleware — generate a UUID at the API gateway, pass via x-correlation-id header, and include in every log line. This enables filtering all logs for a single request across multiple microservices in Kibana. Elastic APM auto-generates trace IDs when using the agent.
  • Error Tracking: Log unhandled exceptions and promise rejections with full stack traces — configure process.on('uncaughtException') and process.on('unhandledRejection') to capture and ship error details before process exit. Include request context (URL, user ID, request body) in error logs for faster debugging.
  • Structured Log Context: Add request metadata to every log line using cls-hooked (continuation-local storage) or AsyncLocalStorage — HTTP method, URL, user ID, response time, and status code are automatically attached without passing logger instances through every function call.

Transform Your Publishing Workflow

Our experts can help you build scalable, API-driven publishing systems tailored to your business.

Book a free consultation

Java Application Logging: Logback, SLF4J, and JVM Metrics

Configure Java services for ELK-compatible structured logging:

  • Logback with ECS Encoder: Use co.elastic.logging:logback-ecs-logging for Elastic Common Schema output — JSON-formatted logs with ECS field names, automatic MDC (Mapped Diagnostic Context) inclusion, and stack trace serialisation. Configure in logback-spring.xml with EcsEncoder replacing PatternLayoutEncoder for production profiles.
  • SLF4J MDC for Context: Use SLF4J's MDC (Mapped Diagnostic Context) to attach request-scoped metadata — MDC.put("traceId", traceId) in a servlet filter or Spring interceptor. All subsequent log lines in the request thread include the trace ID, user ID, and session ID without explicit passing. Clear MDC in a finally block to prevent context leaking between requests.
  • Spring Boot Actuator Metrics: Export JVM metrics (heap usage, GC pauses, thread counts, connection pool stats) via Micrometer + Elastic registry — metrics ship to Elasticsearch for Kibana dashboards alongside logs. Monitor GC pause times (G1GC target: <200ms), heap pressure (>80% triggers investigation), and thread pool saturation.
  • Log4j2 Async Logging: For high-throughput Java services, use Log4j2 with LMAX Disruptor for lock-free asynchronous logging — 10× throughput improvement over synchronous logging. Configure AsyncLogger with RingBufferSize=262144 and WaitStrategy=busySpin for lowest latency at the cost of CPU usage.
  • Exception Grouping: Configure Elastic APM Java agent for automatic exception grouping — similar exceptions cluster together rather than creating thousands of individual error entries. Group by exception class, message pattern, and stack trace fingerprint. Set capture_body: all for HTTP request body capture during error investigation.

Kibana Dashboards: Visualization, Alerting, and SIEM

Build operational dashboards that provide actionable insights:

  • Dashboard Architecture: Create layered dashboards — Overview (system health, error rates, request volumes across all services), Service-Level (per-service latency, throughput, error breakdown), Infrastructure (CPU, memory, disk, network per node), and Investigation (log search, trace analysis, error deep-dive). Use dashboard drill-down links to navigate from overview to detail.
  • Key Visualisations: Use TSVB (Time Series Visual Builder) for real-time metric trends, Lens for drag-and-drop chart creation, Vega for custom visualisations (heatmaps, Sankey diagrams), and Maps for geolocation data. Create saved searches with KQL (Kibana Query Language) for common investigation patterns.
  • Alerting Rules: Configure Elastic alerting (formerly Watcher) for operational alerts — error rate exceeding threshold (>1% 5xx responses), latency degradation (p99 > 2s for 5 minutes), disk space warnings (<20% free), and log volume anomalies (ML-based). Route alerts to Slack, PagerDuty, email, or webhook endpoints with severity-based escalation.
  • Machine Learning Anomaly Detection: Elastic ML automatically detects anomalies in time-series data — unusual traffic patterns, error rate spikes, latency outliers, and log volume changes. Create ML jobs from Kibana with no data science expertise — the platform learns normal patterns and alerts on deviations.
  • Elastic Security (SIEM): Use Elastic Security for security information and event management — correlate security events across network, endpoint, and application logs. Pre-built detection rules identify common threats (brute force attacks, data exfiltration, privilege escalation). The Timeline investigation tool enables security analysts to piece together attack narratives.

Cluster Scaling, Security, and MDS ELK Services

Operate production ELK clusters with enterprise-grade reliability:

  • Horizontal Scaling: Add data nodes to increase storage and query capacity — Elasticsearch automatically rebalances shards across new nodes. Use hot-warm-cold architecture with different hardware tiers: NVMe SSDs for hot nodes (recent data, fast queries), HDDs for warm/cold nodes (historical data, lower cost). Scale Logstash horizontally behind a load balancer for ingestion throughput.
  • Security Configuration: Enable TLS encryption for all inter-node communication (xpack.security.transport.ssl) and HTTP API access (xpack.security.http.ssl). Configure RBAC (role-based access control) — read-only roles for dashboards, write roles for ingestion, and admin roles for cluster management. Integrate with LDAP/Active Directory or SAML for enterprise SSO.
  • Backup and Recovery: Configure snapshot repositories (S3, GCS, Azure Blob) for automated cluster backups — daily snapshots with weekly full snapshots and daily incrementals. Test restore procedures regularly. Use Searchable Snapshots to query archived data directly from object storage without restoring to the cluster.
  • Cost Optimisation: Reduce storage costs with ILM-driven tiering, force_merge on read-only indices to reduce segment count, best_compression codec for cold indices (40% smaller), and field data type optimisation (keyword vs text, scaled_float vs double). Monitor index storage with _cat/indices?s=store.size:desc.

MetaDesign Solutions delivers ELK Stack implementation and managed observability services — from cluster architecture design and Logstash pipeline development through Kibana dashboard creation, Node.js/Java integration, alerting configuration, security hardening, and ongoing cluster management for organisations building comprehensive monitoring across distributed systems.

FAQ

Frequently Asked Questions

Common questions about this topic, answered by our engineering team.

The ELK Stack (Elastic Stack) consists of Elasticsearch (distributed search and analytics engine for storing and querying data), Logstash (data processing pipeline for collecting, transforming, and shipping logs with 50+ input/output plugins), Kibana (visualisation platform for dashboards, alerting, and data exploration), and Beats (lightweight data shippers: Filebeat for logs, Metricbeat for metrics, APM agents for tracing). Data flows from applications through Beats/Logstash to Elasticsearch, with Kibana providing the query and visualisation layer.

Use Winston or Pino logger with Elastic Common Schema (ECS) format for structured JSON logging. Deploy Filebeat to ship log files to Logstash/Elasticsearch. Install Elastic APM Node.js agent (single require line) for automatic transaction tracing, database query monitoring, and error tracking. Propagate correlation IDs via middleware for cross-service request tracing. Use AsyncLocalStorage for automatic log context enrichment.

Configure Logback with ECS Encoder (co.elastic.logging:logback-ecs-logging) for structured JSON output. Use SLF4J MDC for request-scoped context (trace IDs, user IDs). Deploy Filebeat or ship directly to Logstash. Install Elastic APM Java agent for automatic Spring Boot instrumentation. Export JVM metrics via Micrometer + Elastic registry for heap, GC, and thread pool monitoring.

Separate node roles (master, data, ingest, coordinating) to prevent resource contention. Implement hot-warm-cold architecture with ILM policies for automatic tiering. Target 20-50GB per shard with daily index rollover. Scale Logstash horizontally behind a load balancer. Use Kafka as an ingestion buffer for traffic spikes. Enable cross-cluster search for multi-region deployments rather than replicating data.

Enable TLS for inter-node transport and HTTP API communication. Configure RBAC with role-based access (read-only for dashboards, write for ingestion, admin for management). Integrate with LDAP/AD or SAML for SSO. Use API keys for programmatic access. Configure audit logging for compliance. Implement network-level security with VPC/firewall rules restricting access to cluster ports (9200/9300).

Discussion

Join the Conversation

Ready when you are

Let's build something great together.

A 30-minute call with a principal engineer. We'll listen, sketch, and tell you whether we're the right partner — even if the answer is no.

Talk to a strategist
Need help with your project? Let's talk.
Book a call