To improve observability, reliability, and operational efficiency, we
implemented a centralized logging solution using Amazon OpenSearch
Service and Fluent Bit across both staging and production environments.
This solution was aimed at aggregating logs from all ECS services in real
time, enabling developers to analyze logs, troubleshoot issues quickly, and
maintain security boundaries.
Problem Statement and Existing Architecture
The existing CloudWatch logs lacked consistency and meaningful structure, making them hard to interpret during system failures or debugging sessions. CloudWatch query interface made it difficult to perform complex searches or filter logs efficiently across multiple ECS services. In addition logs were scattered across different ECS tasks and services, with no unified view.
Each developer had to have an account in AWS in order to view application logs in CloudWatch. Consequently, this posed a security risk, as it required a minimum number of users to have console access to AWS.
Objective
The primary objective was to build a scalable and secure log aggregation
system that:
– Collects logs directly from ECS services.
– Pushes logs to OpenSearch with minimal overhead.
– Enables developers to visualize and query logs via OpenSearch
Dashboards.
– Supports fine-grained access control without exposing AWS infrastructure.
– Automate index lifecycle policies in OpenSearch.
Solution Architecture
We deployed Fluent Bit as a sidecar ECS service (in DAEMON mode) on EC2
launch type clusters. Each ECS node runs Fluent Bit, which listens for logs
from application containers using the `fluentd` log driver. These logs are
forwarded to an OpenSearch domain hosted in AWS. Log data is indexed by Fluent Bit in a time-series format using the Logstash-compatible format.
Migrating to OpenSearch enabled us to create users and roles in OpenSearch that didn’t require access to AWS. These users outside of AWS, could access application logs from OpenSearch view not necessary needing access to AWS console or even they didn’t need AWS account
We implemented Index Lifecycle Policies automation, providing a framework for managing indices with minimal manual intervention. These policies optimize data management throughout an index’s life, enhance efficiency, reduce operational costs, and maintain compliance with data governance, streamlining data management processes.
Implementation Steps
1. Created Fluent Bit Docker image with a custom configuration for AWS
OpenSearch output.
2. Pushed the image to Amazon ECR for use in ECS task definitions.
3. Defined ECS services using Terraform to run Fluent Bit in DAEMON mode
with proper CPU/memory settings.
4. Configured IAM roles and domain policies to securely allow Fluent Bit to
write to OpenSearch.
5. Deployed OpenSearch domains in both staging and production with fine-grained access control enabled.
6. Set up OpenSearch Dashboards for developer access.
7. Created and documented dev-team-role and admin roles for secured
access.
Access and Permissions
Two IAM roles were created:
– `fluentbit-role`: Allowed to push logs to OpenSearch using `es:ESHttp*` permissions.
– `dev-team-role`: Mapped inside OpenSearch to grant users read access to logs and dashboard creation abilities.
Additionally, we created Internal users inside OpenSearch for both staging and production environments.
Benefits
Real-time log ingestion from ECS.
– Centralized view of logs across all services.
– Role-based access control for development and operations teams.
– No dependency on CloudWatch Logs.
– Automatically deletes or shrinks old indexes, reducing search latency.
– Scalable and cost-effective observability solution using native AWS services.