paint-brush
Bug Resolution Is So Much Better With Log Analysisby@rsulakshana
198 reads New Story

Bug Resolution Is So Much Better With Log Analysis

by Sulakshana singhJanuary 27th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Google Log Explorer is a tool to help you troubleshoot problems in the cloud. It is a place where all the logs you need for your troubleshooting walk are stored. The logs act as street signs pointing toward what went wrong.
featured image - Bug Resolution Is So Much Better With Log Analysis
Sulakshana singh HackerNoon profile picture

In contrast to on-premise environments, the cloud is a scalable world, where applications are dispersed, and workloads are containerized as needed. Logs are significant in the context of service failing, performance failures or anomalous behavior. They provide developers with the ability to monitor requests in multiple containers in the cloud. Hence, proficient log analysis plays a vital role in the present world and, in the long run, it also provides an accelerated defect and bug resolution.


Context


As a software engineer, one of the responsibilities is to resolve the defect raised as part of system testing, Integration testing or Issue reported by End users. During this defect resolution process, one of the most important steps is to perform the Root Cause Analysis (RCA) which will help us to effectively close the issue and permanently avoid upcoming uncertainty of the defect getting reopened.


In microservices architecture, a result is achieved after a combination of different microservices interact. Hence, it is a tedious task for anyone to easily detect the root of the problem and offer a solution promptly.


How to take on issues in Google Log explorer?


Before you enter Google Log Explorer, think of it as a whole new city. The logs act as street signs pointing toward what went wrong. But much like getting around an unknown city, it is essential to have the right map and the permissions to explore effectively.


Make sure you have the correct role logging.viewAccessor. Otherwise,  you are refused access to the Log Explorer. This permission is project-scoped, that is, you may only be able to access the logs of the containers deployed in that specific project.


Imagine your project to be the neighborhood where you start exploring. Log in to the Google Cloud Console and select the project where your workloads run. Then go to the "Workloads" menu, your guide to a hallway of busy containers. Search for the container you want to analyze, filter it out, and there you will find a place where all the logs you need for your troubleshooting walk are stored.


With this checklist done, go ahead; now you know how to locate Google Log Explorer and unearth the puzzles in the logs to troubleshoot.


reference - My Test Google Cloud account


How I mastered log analysis in Google Cloud?


When I was first introduced to Google Log Explorer, I felt like walking into an enormous library of data, each log providing a subtle hint on the questions I needed to solve. As I explored, I came to realize that mastering log analysis requires strategizing, persevering, and having a good knowledge of what to look for.


Filter on date and time -  Log Explorer is able to analyze for dates and times within a designated time window. The available date and time options are as below:


  1. Quick Snapshots: Whenever I solved recent problems, I filtered logs down to the last hour or the last 24 hours. The quick snapshots indicated the real-time anomalies and allowed me to identify any tricky trends.

  2. Broader Trends: For a problem that kept happening or one that had existed for some time, queries on logs from the previous week or last 30 days uncovered trends that were otherwise not evident.

  3. Custom Time Windows: However, sometimes I needed to dig even deeper. The time filter that allowed me to change the timing of the system enabled me to jump in on the actual window in which an event occurred. By using start and end dates, I could uncover only the logs that matter.

  4. From Then to Now: For ongoing investigations, "start date to current" option was used to trace the events over time, capturing all the informational nuggets leading right up to the moment.


reference - My Test Google Cloud account


Filter Logs by Severity - You will be able to filter logs based on the severity, such as Error, Alert, Critical, Info, Debug, and other severity. This is useful for narrowing the attention to particular log types that are relevant to what is under investigation.


Filter Logs by Query - Please use the following sample query to filter logs. Customize and adapt it according to your needs to match your requirements.


resource.type="k8s_container"
resource.labels.project_id="e-caldron-448519-s5"
resource.labels.location="us-central1"
resource.labels.cluster_name="singh-deployment-cluster"
resource.labels.namespace_name="default"
labels.k8s-pod/app="singh-deployment"


Filter Logs by trace - The below query is an example of how trace can be used to analyze requests across containers.

resource.type="k8s_container"
resource.labels.project_id="e-caldron-448519-s5"
resource.labels.location="us-central1"
resource.labels.cluster_name="singh-deployment-cluster"
resource.labels.namespace_name="default"
labels.k8s-pod/app="singh-deployment"
jsonPayload.trace_id = "1ab527ht67lkhjddg876nhdkd"


Lessons learned from Google’s Log Explorer

Working with Google Cloud has been a transformative experience, allowing me to master log analysis techniques and gain valuable insights into the most common issues in Kubernetes.


  1. CrashLoopBackOff - One of the most prevalent problems, which causes the pods to be unavailable resulting in poor user experience because of service unavailability.
  2. Image not available - Due to organizational policy, old images are deleted and you see crash loop error.
  3. Missing Configuration - In rare cases, if a developer forgets to include some configuration in order to provide it to pods, pods might crash in a loop.
  4. Vault connection - Personally identifiable information is kept to a cloud vault and services are not able to make a connection to the vault, resulting in a crash loop.
  5. Code problem - Crash looping occurs when developers push an unverified code to Kubernetes and apply it.
  6. Certificate expired - Crashes looping for Google managed keys due to lack of renewal.
  7. Persistent Pod reboots - Unusually, application pods keep being restarted and this may be because of an OutOfMemory error caused by system exhaustion (that is, lack of resources in the helm).
  8. Network-related problems - In microservice architectures, a service can participate in multiple communications between different services, where one of the services is down, or where a firewall of another service denies communication.


Can we reduce the risk of downtime in the cloud?

Problems are not unavoidable, however, proactive steps can be taken to reduce the risk and improve the end-user experience. Google Monitoring allows construction of alerts, which are required to inform developers and production support staff to react well, promptly, and proactively.


Lets see how to create the Alert policy in Google Cloud monitoring

reference - My Test Google Cloud account


What are the notification channels available in the google cloud which can be configured in the above policy.

reference - My Test Google Cloud account


It is not actually knowing the tools that would set you apart from others in the diverse and continuously evolving field of cloud computing, but how you actually use those tools for real-world problem-solving. My hands-on experience with Google Log Explorer brought me to the realization that even the most powerful tools require a personal touch to extract their true potential. I learned to spot patterns in recurring failures, isolate even the most subtle anomalies in logs, and derive from them actionable insights in order to prevent downtime. One of my biggest takeaways will be how to convert raw data into actionable steps, be it fine-tuning configurations or proactively solving issues of resource constraints. It is not merely solving the issues, it is about predicting those issues in advance.