Case Study : How luroConnect Insight detected network throttling by a cloud provider

Introduction

“The site is slow” is very common complaint made. A system administrator gets this complaint and is really not sure what to do. Typically the system administrator will

  • run top command to see if the system parameters are good, typically CPU and load average
  • check mysql slow log to see if any queries are slowing down

The typical solution might involve restarting various services

  • nginx
  • php-fpm
  • mysql

The problem seems to go away only sometimes to reappear. If on a cloud service, a solution might be to add more cloud app servers.
But, when CPU load is not high and load average is not high, what do you do?

We had one such instance for a luroConnect customer – a funded eCommerce site servicing the B2C and B2B markets. They had recently moved to a new service provider with larger servers including physical servers for the app and db with some VMs to take additional load.

CPU, memory, load average, slow logs did not report any load at all – infact we found that the CPU and load average were way below our expectation based on the hits. We then setup luroConnect Insight log file analytics to observe the nginx log file. The reponse vs upstream average plot was the one that alarmed us. Nginx log file parameters $request_time and $upstream_response_time are setup as part of the standard luroConnect Insight setup. This allows tracking of php responses independent of the response time which as an end-to-end response time.
$request_time : request processing time in seconds with a milliseconds resolution; time elapsed between the first bytes were read from the client and the log write after the last bytes were sent to the client
$upstream_response_time : keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution.

The difference can be due to slower internet connection of a browser. However, at the high hit rate the site got, that was unlikely. The only possible explanation was a network issue with the cloud service provider, possibly throttling of the internet connection.

The resolution

We talked to the service provider with proof and convinced them of the method used to make the observations – which were clearly unique to what they had seen earlier. We then setup a test by creating a flood of requests from another server on the same subnet and got the same throttling result.
They investigated to find a misconfigured network equipment.

Conclusion

luroConnect was not designed with this use case in mind. But it was designed to give insights into the web server that regular monitoring tools do not give.