caddy issues
this sounds a whole lot like something else
# context
Recently, with about 70 containers running, we experienced a lot of 502 Timeout Errors, and Caddy reports I/O timeout errors in its logs. It started with Whoogle search and Linkding, which we initially thought could be because their back-end was Python so maybe it’s slower than other services, but it would occasionally happen to the rest as well, like Portainer, Flame (a pretty lightweight homepage). A lot of these I/O entries in the logs are justified, like Websocket errors or just bugs within the services themselves (like the weird issue with Netdata’s reporting missing JS libraries when loading with a reverse proxy, which seems to be a bandwidth problem in the end).
# troubleshooting
We did some troubleshooting by doing the following:
- Cleaning the
Caddyfile
and making sure there aren’t any inefficient matching rules or empty brackets. - Removing entries in
Caddyfile
to see if that was the problem. - Shutting down 90% of our services to see if that was the problem.
- Checking the percentage of iowait CPU usage, as well as, I/O usage to see if there was some weird issue causing the upstream to not respond in time.
- Increasing the open file ulimit for Caddy and also for the kernel
- Making edits according to [[https://wiki.archlinux.org/title/sysctl#Improving_performance%7Cthis Arch wiki page on systctl]] and rebooting.
# results
In the end, we couldn’t really find out the reason why the I/O performance of Caddy was being bottle-necked, we could’ve tried to install it baremetal, but I’m not sure this would contribute much as this problem happened all of a sudden after weeks of normal, expected performance