Article
Before switching to Loki for our logging needs, we were using AWS CloudWatch Logs. One frequently occurring complaint we in the infrastructure team kept hearing was – they are hard to parse and navigate. While CloudWatch queries are not more complex to write than those of the Loki query language, when troubleshooting issues and looking for answers in logs developers aren’t really looking to write queries – they just want to be able to find the relevant logs in the fastest way possible. The Grafana/Loki environment allowed us to do something we could not in CloudWatch – create logging dashboards with simple drop-down menus that allow users to quickly get the logs they need.
Another regular complaint we had about CloudWatch was the price. The CloudWatch pricing model, which is based on charging for data ingestion, storage and log queries meant that we were incurring significant costs for logging every month. The cost of log queries in particular would become problematic when our software engineers needed to parse the logs.
Grafana Loki only indexes metadata about logs, and allows the log entries themselves to be stored compressed in S3, making it a very lightweight solution with a small compute footprint. So making the move to self-hosted Loki was the logical choice for us.
Overall Loki has worked well for us; it’s been significantly cheaper in platform costs than Cloudwatch, and end users have been much happier with the easy-to-navigate logs dashboard in Grafana. Not everything was perfect though: deployment of self-hosted Loki is not simple or very well documented, or at least not in a way that matches our needs, and our biggest cost by far was the engineering time required to deploy it. The following tutorial is the explanation I wish had existed when we set out to deploy self-hosted Loki.
Table of contents
Preparing Your Environment
The first step in deploying self-hosted Grafana Loki with a Helm chart is determining whether to install it in monolithic or scalable mode.
The monolithic mode is suitable for smaller environments that have consistent load for both writing and reading logs. Our logging needs are fairly complex, so the decision to go with a scalable deployment made sense for us.
The next step in the process is to determine what storage system will be used for storing the logs. Loki can store logs in file system storage or cloud object storage (such as AWS S3 and Azure Blob Storage). Previous versions allowed storage in databases such as Cassandra, but as of April 2024 this functionality is deprecated.
Since our infrastructure is hosted in AWS, and this once again made the decision very easy for us – our logs are stored in an S3 bucket.
The first thing to do, is to create a dedicated S3 bucket for Loki.
In order to store the logs in your S3 bucket, you need to prepare an IAM role that has the permissions to read, write and list objects in your S3 bucket. This role will be used by Loki to write and read the log chunks in the bucket.
The permission statement for the S3 bucket is very simple:
“Statement”: [
{
“Action”: “s3:*”,
“Effect”: “Allow”,
“Resource”: [
“arn:aws:s3:::loki/*”,
“arn:aws:s3:::loki”
],
“Sid”: “1”
}
],
“Version”: “2012-10-17”
}
To allow your EKS-deployed Loki service to assume the role, you need to allow the EKS OIDC provider to assume the role. This can be done by configuring the trust relationship to allow AssumeRoleWithWebIdentity for the EKS OIDC identity:
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Federated”: “arn:aws:iam::{account-id}:oidc-provider/oidc.eks.region.amazonaws.com/id/{ID}”
},
“Action”: “sts:AssumeRoleWithWebIdentity”,
“Condition”: {
“StringLike”: {
“oidc.eks.{region}.amazonaws.com/id/{ID}:sub”: “system:serviceaccount:*:loki”
}
}
}
]
}
If you have not already created an IAM OIDC provider for your cluster, you can follow the AWS Guide
Terraforming a Helm chart
Grafana recommends using a Helm chart to deploy Loki on a Kubernetes cluster. You can opt to install it directly from the Helm chart, following the guide provided by Grafana. But in an IAC organisation, it can be tricky to merge these deployment paradigms, so I handled this by deploying the Helm chart via Terraform.
The Terraform Helm provider allows you to create a helm_release resource.
name = “loki”
repository = “https://grafana.github.io/helm-charts”
chart = “loki”
namespace = “loki”
create_namespace = true
version = “5.30.0”
values = [file(“./values.yaml”)]
}
As you can see, this resource refers to the values.yaml file. Here is an example of the first basic values.yaml file:
auth_enabled: false
serviceAccountName: loki
# Global storage configuration
storage:
bucketNames:
chunks: ‘loki’
ruler: ‘loki’
admin: ‘loki’
type: s3
s3:
region: ‘{region}’
#Schema configuration, including the index period and the schema version to use
schemaConfig:
configs:
– from: 2023-10-12
store: tsdb
object_store: s3
schema: v12
index:
prefix: index_
period: 24h
# Storage configuration for the tsdb chunks
storageConfig:
tsdb_shipper:
shared_store: s3
active_index_directory: /var/loki/index
cache_location: /var/loki/cache
cache_ttl: 24h
# Service account for the loki pods
serviceAccount:
create: true
name: loki
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::{account-id}:role/loki
This configuration is enough to run Loki. With this configuration, you can send logs from the local cluster and use them in a Grafana dashboard. The configuration for the rest of the components is derived from the default values.yaml file of the Helm chart used.
Another thing to note here, is that I did not use a secretAccessKey and accessKey for access to S3. As a general security best practice, we avoid using those for direct access to any AWS resources. Instead, I used the IAM role configured earlier.
Different Helm charts published by Grafana have had different values.yaml formats, so always refer to the default file provided in the Helm chart you are using.
The S3 configuration for example, was a big pain point, as different examples I encountered showed different ways to configure this.
For the particular chart version I used, no endpoint configuration was necessary(and indeed having the endpoint configuration in the values.yaml file resulted in a Loki deployment that could not write logs to the S3 bucket)
Configuration for log ingestion from other Kubernetes clusters
In order to ingest logs from other clusters, you can configure an ingress for the deployment:
ingress:
enabled: true
ingressClassName: “alb”
annotations:
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/listen-ports: ‘[{“HTTP”: 80}, {“HTTPS”:443}]’
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:eu-west-1:{account-id}:certificate/{certificate-id}
hosts:
– {hostname}
Since all of our EKS clusters are in peered VPC, I used an internal ALB.
Configuration of autoscaling and resource allocation:
The following configuration is not needed in order to have a functioning Loki deployment. They are nice to have’s that have made management of Loki and our cluster a little bit easier.
By default, the Loki Helm charts do not include resource requests and limits, so the Loki pods can take as many or as little resources they need. However, since both the read and write components can consume a significant amount of memory, I would recommend to specify both the requests and limits for at least those components, in order to make sure that the pods are scheduled on nodes that have sufficient capacity to run them.
write:
autoscaling:
enabled: true
targetMemoryUtilizationPercentage: 75
# — Minimum autoscaling replicas for the write.
minReplicas: 2
# — Maximum autoscaling replicas for the write.
maxReplicas: 6
behavior:
scaleUp:
policies:
– type: Pods
value: 1
periodSeconds: 900
scaleDown:
policies:
– type: Pods
value: 1
periodSeconds: 1800
stabilizationWindowSeconds: 3600
extraArgs:
#Instructs the ingester to write the WAL to S3 at shutdown
– ‘-ingester.flush-on-shutdown=true’
#Resources for the distributor
resources:
requests:
cpu: 200m
memory: 4096Mi
limits:
cpu: 600m
memory: 5120Mi
#Autoscaling for the querier:
read:
autoscaling:
enabled: true
# — Minimum autoscaling replicas for the querier.
minReplicas: 2
# — Maximum autoscaling replicas for the querier.
maxReplicas: 8
#Resources for the querier
resources:
requests:
cpu: 50m
memory: 2048Mi
limits:
cpu: 200m
memory: 3072Mi
#Autoscaling for all backend components: compactor, query scheduler, index gateway, config
backend:
autoscaling:
enabled: true
targetMemoryUtilizationPercentage: 70
# — Minimum autoscaling replicas for the backend.
minReplicas: 2
# — Maximum autoscaling replicas for the backend.
maxReplicas: 6
#Resources for the backend components
resources:
requests:
cpu: 20m
memory: 256Mi
limits:
cpu: 80m
memory: 512Mi
#Resources for the gateway
gateway:
resources:
requests:
cpu: 20m
memory: 128Mi
limits:
cpu: 100m
memory: 512Mi
The default Loki autoscaling configuration has a minReplicas value of 2 and maxReplicas value of 6. Here, I have also opted to increase the maxReplicas for the read component to 8, as we sometimes see high read usage.
Configuring Promtail to send logs to Loki
You should now have a Loki deployment running on your cluster with appropriate permissions and resources, but nothing is sending any logs to Loki yet. Next you need to install a log shipper that will send the desired logs to Loki. Loki supports several different log shippers, and for our Kubernetes workloads I opted to use Promtail.
Promtail is an agent on the Kubernetes cluster which ships the contents of local logs to a Loki instance.
To deploy Promtail, you can use the same method of deploying the Helm chart via Terraform.
name = “promtail”
repository = “https://grafana.github.io/helm-charts”
chart = “promtail”
namespace = “promtail”
create_namespace = true
version = “6.15.2”
values = [
file(“./values.yaml”)
]
}
The Promtail helm chart will deploy Promtail as a daemon on every node in the cluster and, using it’s service discovery mechanism will fetch the required labels from the Kubernetes API.
For Promtail running in the same Kubernetes cluster, you can send the logs directly to the Loki Gateway service:
enabled: true
logLevel: info
serverPort: 3100
clients:
#Location of the Loki gateway to which logs are written
– url: http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push
tenant_id: 1
external_labels:
cluster_name: {cluster name} # this can be a label extracted from the cluster metadata as well
resources:
limits:
cpu: 100m
memory: 150Mi
requests:
cpu: 50m
memory: 100Mi
serviceMonitor:
enabled: true
As with the Loki configuration, I have explicitly assigned resource requests and limits for the Promtail pods.
For external clusters, you can send logs to the Loki Ingress:
enabled: true
logLevel: info
serverPort: 3100
clients:
#Location of the Loki gateway to which logs are written
– url: http://{hostname}/loki/api/v1/push
# tenant_id: 1
external_labels:
cluster_name: {cluster_name}
snippets:
pipelineStages:
– cri: {}
– multiline:
firstline: ‘^\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}\]’
max_wait_time: 5s
– regex:
expression: ‘^(?P<timestamp>\[\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}\,\d{3}\]) (?P<message>(?s:.*))$’
resources:
limits:
cpu: 100m
memory: 150Mi
requests:
cpu: 50m
memory: 100Mi
serviceMonitor:
enabled: true
In the case of the example above, the external cluster has some multiline logs. In order for Promtail to interpret those as one line, as opposed to multiple single-line entry, I used the multiline parameter to define the starting format of log multilines and the maximum wait time for line continuations to appear.
Additional tips for better performance
Tuning Loki for optimal performance:
http_server_read_timeout: 300s
http_server_write_timeout: 300s
grpc_server_max_recv_msg_size: 104857600 # 100 Mb
grpc_server_max_send_msg_size: 104857600 # 100 Mb
# Limits configuration
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
reject_old_samples_max_age: 168h
max_query_series: 50000
max_query_parallelism: 256
query_timeout: 290s
split_queries_by_interval: 1h
Through some trial and error, these are the parameters I found work best for us. The server parameters allow for longer and bigger read and write queries. Since we sometimes parse larger sets of data this allows us to access them without getting timeouts in Grafana. It is important to also increase the timeout parameter on Grafana side in order for this to work.
The limits_config block configures global and per-tenant limits in Loki. In order to allow the ingestion of log data, I have increased the ingestion_rate_mb and ingestion_burst_size_mb.
And once again, because we sometimes query big sets of data I have increased the max_query_series and max_query_parallelism.
The above configuration items are added under the Loki block.
Querying your logs from Grafana
In order to be able to view Loki logs in Grafana, you need to add Loki as a source in your Grafana instance.
Loki uses it’s own query language called LogQL, which can be used to parse the logs from Grafana. This makes it possible to create dashboards that give access to specific sets of logs based on their labels.
In one such dashboard, I have allowed the logs to be searched by the Environment(cluster), Service(app), and the Log Level(Critical, Error, Warning, Info, or Debug) and added a free-form search that allows anyone trying to find specific terms in the logs without writing a query themselves, through the use of variables.
Since the labels match those pulled from Prometheus, this allows us to create integrated dashboards that show Prometheus metrics, Loki logs, and metrics derived from the Loki logs in the same dashboard.
Results
After moving our logs to Loki, our developers have been a lot happier with the log search experience. We have also been able to integrate data from our logs to the Grafana dashboards we were already using.
Lastly, we have been able to make significant savings from our AWS bills every month.