Rescue Situation: Sitecore CPU Spikes

Let's just begin by saying 9 times out of 10 performance issues are not Sitecore in and of itself. It is typically custom code, downloaded code, or an environment that is not of best practices. If the issues are indeed Sitecore related then their support team is responsive and quite helpful.

Recently we underwent a journey of a situation where after a deployment, the CPUs of our CD servers were pegging between 90-100%. So we cranked up the coffee pot and started digging. Here are the things we checked methodically, and this ultimately led to us finding the culprit.

1) What has changed? - Usually something has changed, so this is always a good first step. So this situation began after a deployment in which the development environment was now using a continuous integration tool to deploy the code instead of manual code moves of specific files being copied over. This was the first time, the entire code base went as a whole. So since the manual deployments had been happening since day one, we had to assume, everything changed.

2) Was it Traffic Related? - On the night of the deployment, our site verification was successful. We had about 10 people going through each new feature that was expected to be delivered, and all was well. Site was running smoothly (so it seemed).

So the next morning we walked into a critical situation where the site was crashing. We checked our analytics for above average traffic. We wondered was there a marketing promotion going on that we were unaware of? Are we being attacked with a DDoS? Did 100k of our closest friends decide to use our site at once? Ultimately traffic was the cause, but the problem had to be resolved because we want and love site traffic!

2) Check the Sitecore Logs - With checking the Sitecore logs we noticed some negative behavior. Always look for the log that is the most recent and named log.{date*}.txt What we noticed was a plethora of errors that had been happening. We at first thought, "death by a thousand cuts", but after fixing a few of the ones that were the ones causing the most log entries our problems still existed.

3) Analyze the threads of your application- We then took a DMP of the w3p.exe worker processor on the CD server itself. We had let the site run at 90-100% for about 10 minutes then we took the .dmp file.

TIP: To do this go to the task manager on the offending server and right-click on the w3p.exe process itself and choose create dmp file.

With debug diag, select the dmp file and choose ever option available when doing your analysis. This will create a .mht file that is ran in Internet Explorer. The trick is to zoom way out, then scroll until you see red words. Then zoom in and see what is happening at the thread level. We finally found our culprit!

4) Code Optimization - Ultimately there was custom code in the view pipeline from a previous vendor that was not efficient and it was causing the CPU spikes to remain very high. So we optimized the code internally and tried the site under a load again. The CPU was still spiking but not as high! We knew our job wasn't done yet!

5) Code placement - After many heads coming together, a bright one on the team recommended we call the pipeline method just after the ItemResolver (we were 301 redirecting some pages if conditions were true). The original offending code was placed just before the Language Resolver and it was causing race conditions with threading. We moved it to a little bit later in the pipeline event timeline, just after the Sitecore Item Resolver, we saw blue skies again.

Our Root Cause - Custom Code that was not efficient and Code placement of a burdensome method.

We hope this helps in your next journey of the mal-performing sites that occurs.