When one of our marketplaces reported a strange HTTP error on our internal service, The Common Platform, Adevinta’s site reliability engineers swung into immediate action. What followed was an investigation that lasted several months into one of the trickiest (and strangest) errors we had ever seen.
Along the way we learned:
- How CPU throttling and ingress controller configurations can mask deeper issues
- Faulty Fluent-bit agents create buffer overflows, but resetting nodes only provides a temporary improvement
- Building a dashboard allowed us to collect and analyse metrics on a more granular level, vital for narrowing down our investigation efforts
- The challenges of configuring internal DNS requests for maximum performance
But we did solve the problem – and our client reported a significant performance improvement as a result. Read the full article on the the Adevinta blog.