Catching the Alarm: The Moment the Cluster Flips
It was a typical Monday morning. I’d just finished a brief stand‑up and was heading back to my desk to dig through the day’s emails when a cascade of alerts popped up on the monitor. The sender was ELM Enterprise Manager 3.0, the tool we rely on to watch our SQL Server event logs. The message had been queued a half‑hour earlier, while I was still in the meeting room, but I didn’t see it until now. My eyes landed straight on the name of the server that sent the alerts: our production SQL Server cluster, the backbone of the ERP system that drives more than a billion dollars of revenue each year. The cluster, composed of two nodes, had just performed a failover from the primary to the secondary node.
At the instant I saw the message, my stomach tightened. The cluster had gone from 100+ active connections to a silent pause for a minute or two. I pulled up Cluster Administrator on the first desktop that was running Windows 2000 Advanced Server SP2 and SQL Server 7.0 SP3. The console confirmed that the cluster resources had indeed switched to the secondary node. It was a textbook failover: the primary node had lost service, the cluster service had tried to restart MSSQLSERVER three times, failed each time, and then handed the load over to the backup. The failover had finished, the resources were online, and users were, in theory, back to business.
My next step was to check the impact on the ERP users. I ran a quick test against the ERP login form; it timed out on the first request, but a subsequent login succeeded. The application didn’t automatically retry a broken connection, so users had to manually log out and back in. I immediately drafted an email to the help desk, which was in the same room, informing them of the failover and instructing them to tell all end users to re‑login. The help desk replied in less than a minute that they were receiving a flood of calls from frustrated customers. The cluster’s automatic failover worked, but the user experience was still broken.
After the alert, I met with the ERP manager and the team that runs the daily reports. I wanted to gauge whether the report workload could have been a trigger. The cluster had run for seven months without any automatic failovers; this was the first one since deployment. The servers were running 4 GB of RAM each, and we had no history of CPU saturation. The only recent change was a database backup that started at 8 AM, just a few minutes before the alert. I told the managers that we’d look into the logs and get back to them as soon as we had a hypothesis.
At 10 AM I logged onto the Windows Event Viewer on the primary node. The System log showed a clean shutdown of MSSQLSERVER, followed by a series of “service stopped” and “service started” events. Nothing dramatic. The Cluster log contained the standard pattern: service failure, restart attempts, then failover. The Database Engine log was empty, with no indications of deadlocks or I/O errors. I checked the backup log: the transaction log backup had begun at 8:02 AM, finished at 8:15 AM, and reported no errors. The cluster had handed over to the secondary node at 8:16 AM, just after the backup had finished.
I kept my notes: primary node, transaction log backup, failover at 8:16. I was still missing the root cause. The system logs, backup logs, and cluster logs all looked normal. I knew I had to dig deeper, and I had to act fast. I wanted to know why the primary node’s MSSQLSERVER service died and why the cluster thought it needed to fail over. I also wanted to ensure that we would not be caught off guard again. I grabbed a notebook and started outlining the investigative steps I would take. The first step was to gather as many logs as possible from the cluster, the database engine, and the backup process.
To do that, I called Microsoft Technical Support, because we had an Enterprise Agreement that covered SQL Server support. I explained that the cluster had failed over after a transaction log backup, that the logs didn’t show any obvious cause, and that I needed a deeper dive. The first‑tier support engineer asked me to run a diagnostic tool that would compress all of the relevant hardware, software, and log information into a single archive. I didn’t know this tool, but the engineer promised that they would review the data and call me back later that day. I ran the utility, which took about 30 minutes to gather and compress the data. I emailed the resulting archive to the support engineer and waited for the callback.
By 4 PM the engineer had not yet finished the review. I dug into the forums and other online resources, but I found no reference to a similar cluster failover in SQL Server 7.0 SP3. I called the support engineer again, and he listed three potential causes: insufficient RAM, a CPU bottleneck causing the “Is Alive” request to fail, and a corrupted database. I explained that the cluster had 4 GB of RAM per node and that the CPU usage was never near 100 %. I also ran DBCC CHECKDB a few hours before the failover, and it returned clean. The engineer promised to dig deeper and get back to me the next day.
At that point I had one clear thing: the failure had to be linked to the transaction log backup. I began thinking about the interaction between backup and cluster failure. Backups are a write‑heavy operation that can cause lock contention, but they should not kill the SQL Server process. The only clue I had was the timing: the backup finished at 8:15 AM, the failover began at 8:16 AM. That narrow window suggested that something triggered when the backup finished. I decided to keep a close eye on the cluster logs for any other anomalous events that might surface.
While waiting for the next day’s callback, I sent a quick note to the ERP manager: “We’ve confirmed the cluster has failed over and is back online. I’m still working on why it happened. Please let me know if you see any performance regressions after the failover.” I also started a task list for my next steps: examine the database engine error logs in more detail, review the backup log for I/O errors, and cross‑check the SQL Server configuration for parallelism settings.
By the time the support engineer returned with a partial analysis, I had already compiled a list of possibilities and started testing each one. The next few days would involve more in‑depth log analysis, a round of debugging, and a final decision on how to prevent this from happening again. The rest of this article follows that investigation, the discovery of a parallelism bug that surfaced during a particular report, and the ultimate resolution that kept our production environment stable and responsive.
Digging Into the Logs: From Silent Failure to Suspicious Clues
The second day began with a new batch of alerts. The monitoring system had detected another failover, this time from the secondary node back to the primary. My first instinct was to suspect a persistent problem on the primary. I opened Cluster Administrator and saw that all cluster resources were in a pending state for a long minute before transitioning to online. I couldn’t afford to let the ERP users hang in a half‑ready state for an hour. I pinged the help desk again and confirmed that the failover had completed successfully. The ERP login page was back, but the users were still confused because the application didn’t reconnect automatically.
Now I had to find why the cluster was flipping twice in one day. I turned to the Windows Event Viewer again, but the System log still looked clean. The backup log for the second failover, however, revealed something new: an application error message that read, “sqlservr.exe - Application Error : The instruction at '0x4115d3c0' referenced memory at '0x4115d3c0'. The required data was not placed into memory because of an I/O error status of '0xc000026e'.” That error code signaled a serious memory access violation, possibly caused by a corrupt memory page or a problem in the driver stack. The error happened right after the database engine had attempted to start a new instance on the primary node.
To dig deeper, I extracted the dump file that the SQL Server service had created when the process crashed. I opened the dump in WinDbg and looked at the call stack. I found that the fault had occurred inside a user‑defined stored procedure, one that had been written for a custom report. The procedure was being executed many times - over 200 times according to the stack trace. That was a huge clue: the procedure was running at a very high frequency during the backup window. Perhaps the stored procedure was not written with concurrency in mind, and the high number of executions caused a resource bottleneck that triggered the crash.
I had a hunch that the issue might be related to query parallelism. SQL Server’s parallel execution engine splits queries into multiple threads to utilize all available CPUs. In our cluster, each node had four cores, so the expectation was that queries would be split into no more than four threads. But the dump file suggested that the stored procedure had spawned more than 200 threads. That was far beyond the number of logical processors. The sudden surge in thread count would have overwhelmed the OS scheduler, consumed all available memory, and caused the SQL Server process to terminate. The cluster, in turn, would detect the service crash and fail over to the other node.
I cross‑checked the server configuration. The ‘max degree of parallelism’ (MAXDOP) setting was at the default of 0 (unlimited), which meant SQL Server could use as many threads as it deemed necessary. On the development server, which had the same configuration, the same stored procedure ran without incident. The development server, however, was under light load most of the time, so the parallel query optimizer rarely attempted to split the procedure into more than four threads. The production server was always busy, and the query optimizer tried to maximize throughput. That meant the stored procedure was being run with hundreds of threads, which exposed a bug that had been fixed in later service packs but resurfaced when MAXDOP was left unlimited.
To confirm the hypothesis, I temporarily set MAXDOP to 1 on the production server and re‑ran the report. The report ran, albeit slower, and the cluster remained stable. I also ran the same report on the development server with MAXDOP set to 200, and the server crashed immediately. That reproduced the failure in a controlled environment. I was now certain that the root cause was a parallelism bug triggered by an unbounded query parallelism setting combined with a heavily recursive stored procedure.
At that point, I had a clear fix path but also a set of trade‑offs. The most obvious solution was to revert MAXDOP to a safe value for the entire cluster - either set it to 1 or to the number of cores per node. That would prevent the system from ever spawning hundreds of threads for any query, including the problematic stored procedure. The downside would be that other queries that could benefit from parallel execution would run slower. We also considered adding a MAXDOP hint to the specific stored procedure. That would require code changes on the production system, which we wanted to avoid because the report is a critical business function. Another possibility was to patch SQL Server, but that would require downtime and was not feasible for the 7.0 SP3 environment.
Given the urgency and the fact that the stored procedure had already been the cause of two consecutive failovers, we opted for the safest, most conservative approach: we turned off parallelism on the entire production cluster by setting MAXDOP to 1. We documented the change in our configuration management database and added a note to the ERP report documentation explaining the performance trade‑off. We also scheduled a review of all stored procedures for parallelism suitability, but that was a lower priority.
With the MAXDOP change in place, we ran the daily transaction log backup and the report again. No failover occurred. The backup completed cleanly, and the report finished within the expected 13 minutes. Users logged back in without issue, and the cluster remained online. We monitored the cluster for an additional week to confirm stability. No more failovers occurred during that period.
Turning the Tide: Fixing the Parallelism Bug and Securing Stability
After we disabled parallelism on the production cluster, the next challenge was to ensure that the change would not create performance regressions on other parts of the system. The ERP package relies heavily on analytical queries that were previously tuned to run in parallel. We ran a series of performance benchmarks on a staging environment that mirrored the production load. The tests showed that most queries took about 20–30 % longer to execute, but the overall throughput remained within acceptable limits. We adjusted the scheduling of heavy nightly jobs to spread them out and reduce contention. That mitigated the impact on end‑user performance.
We also revisited our backup strategy. The original cause of the failover was the transaction log backup that had finished just before the cluster attempted to restart MSSQLSERVER. We moved the backup window to a slightly later time slot, avoiding the peak hours when the report was running. The new schedule gave the backup process a few minutes of quiet time, ensuring that it would not overlap with heavy query activity. We also added a monitoring rule to alert us if a backup took longer than its expected duration, which would signal a potential performance issue before a failover could happen.
Parallelism is a powerful feature of SQL Server, but it can also be a double‑edged sword. In this case, the default configuration of unlimited parallelism allowed a single, highly recursive stored procedure to spawn hundreds of threads, exhausting system resources and triggering a cluster failover. The resolution was to constrain the degree of parallelism for the entire cluster, a decision that prioritized stability over raw performance. While the performance hit was measurable, it was acceptable compared to the cost of repeated outages.
We documented the entire incident in our incident‑management system. The incident record includes the timeline, the root cause analysis, the actions taken, and the outcome. It also lists the preventative controls we added: a rule to monitor the number of active threads per query, a monitoring alert for abnormal memory usage, and a scheduled review of all critical stored procedures for parallelism suitability.
Finally, we shared the lessons learned with the wider DBA community. We posted a detailed case study on our company blog, inviting other DBAs to comment on how they would handle a similar situation. The article was well‑received, and many readers shared their own experiences with parallelism bugs. The knowledge base grew, and we now have a reference that will help future teams avoid the same pitfall.
By turning off parallelism, adjusting the backup schedule, and enhancing our monitoring, we eliminated the immediate cause of the cluster failover. The system is now more resilient, and we are better prepared to handle any future anomalies. The experience also reinforced the importance of thorough testing when making configuration changes, especially in a production environment where the impact of a misstep can be costly.





No comments yet. Be the first to comment!