
To improve observability, reliability, and operational efficiency, we
implemented a centralized logging solution using Amazon OpenSearch
Service and Fluent Bit across both staging and production environments.
This solution was aimed at aggregating logs from all ECS services in real
time, enabling developers to analyze logs, troubleshoot issues quickly, and
maintain security boundaries.
Problem Statement and Existing Architecture
The existing CloudWatch logs lacked consistency and meaningful structure, making them hard to interpret during system failures or debugging sessions. CloudWatch query interface made it difficult to perform complex searches or filter logs efficiently across multiple ECS services. In addition logs were scattered across different ECS tasks and services, with no unified view.
Each developer had to have an account in AWS in order to view application logs in CloudWatch. Consequently, this posed a security risk, as it required a minimum number of users to have console access to AWS.
Objective
The primary objective was to build a scalable and secure log aggregation
system that:
– Collects logs directly from ECS services.
– Pushes logs to OpenSearch with minimal overhead.
– Enables developers to visualize and query logs via OpenSearch
Dashboards.
– Supports fine-grained access control without exposing AWS infrastructure.
– Automate index lifecycle policies in OpenSearch.
Solution Architecture
We deployed Fluent Bit as a sidecar ECS service (in DAEMON mode) on EC2
launch type clusters. Each ECS node runs Fluent Bit, which listens for logs
from application containers using the `fluentd` log driver. These logs are
forwarded to an OpenSearch domain hosted in AWS. Log data is indexed by Fluent Bit in a time-series format using the Logstash-compatible format.
Migrating to OpenSearch enabled us to create users and roles in OpenSearch that didn’t require access to AWS. These users outside of AWS, could access application logs from OpenSearch view not necessary needing access to AWS console or even they didn’t need AWS account
We implemented Index Lifecycle Policies automation, providing a framework for managing indices with minimal manual intervention. These policies optimize data management throughout an index’s life, enhance efficiency, reduce operational costs, and maintain compliance with data governance, streamlining data management processes.
Implementation Steps
1. Created Fluent Bit Docker image with a custom configuration for AWS
OpenSearch output.
2. Pushed the image to Amazon ECR for use in ECS task definitions.
3. Defined ECS services using Terraform to run Fluent Bit in DAEMON mode
with proper CPU/memory settings.
4. Configured IAM roles and domain policies to securely allow Fluent Bit to
write to OpenSearch.
5. Deployed OpenSearch domains in both staging and production with fine-grained access control enabled.
6. Set up OpenSearch Dashboards for developer access.
7. Created and documented dev-team-role and admin roles for secured
access.
Access and Permissions
To ensure robust security and proper segregation of duties, specific IAM roles were meticulously configured for the logging solution:
Fluent Bit Role: A dedicated fluentbit-role was created to push logs to OpenSearch.
Developer Team Role: A dev-team-role was established and mapped within OpenSearch. This role provides development team members with read-only access to the logs and the necessary permissions to create and manage dashboards, allowing them to monitor application behavior without altering the underlying data or configuration. Least privileged access to the OpenSearch console was given as a core security best practice, ensuring that team members could only access the functionalities required for their specific tasks.
Internal Users: For both staging and production environments, internal users were created directly within OpenSearch. These users were assigned roles that provided granular access to specific indices and functionalities, further enforcing the principle of least privilege.
This layered approach to access and permissions ensures that while logs are seamlessly collected and accessible, the security posture of the OpenSearch domain remains strong and compliant with best practices.
Benefits
Real-time log ingestion from ECS.
– Centralized view of logs across all services.
– Role-based access control for development and operations teams.
– Improved developer productivity when debugging complex production incidents
– No dependency on CloudWatch Logs.
– Automatically deletes or shrinks old indexes, reducing search latency.
– Scalable and cost-effective observability solution using native AWS services.