I was recently on the receiving end of that horrible sinking feeling for anyone involved with databases - a mission-critical production SQL Server just stopping doing its thing. Eventually, I ended up killing the SQLSERVR.EXE process in the task manager, and it came up again after some rollbacks and roll forwards.
Delving into the logs, I found a couple of things:
- There were a few recent instances of access violations which matched this description pretty closely: http://support.microsoft.com/kb/821548. The problem was with parallel query execution in the ParallelPageSharedData::AddRef function. In the text dump files in the MSSQL directory, I found the query producing the access violation, and indeed it was reproducible. Interestingly, it still returned correct results though.
- One of the occurrences of this came about two minutes before the first of several messages along the lines of "The Scheduler 0 appears to be hung. SPID 7, ECID 0, UMS Context 0x045530B8." This is a four-processor hyper-threaded machine, and the eight schedulers dropped off one by one until the server stopped. This looked very much like http://support.microsoft.com/kb/815056.
Things were back up and running for the time being, but clearly we couldn't let this happen again. The conclusion was to install service pack 4:
http://support.microsoft.com/kb/888799. After a couple of days running on one of our test servers, and some general testing, we were ready to come in at the weekend and install it on the products server.
All went well with the installation, and things seemed to be running. First sign of trouble was some slowness when one of the remote offices started work late Sunday evening. By the time the server was under the normal load on Monday morning, it was clear things weren't going well. Things were slow all round, and the disk I/O queues on the server data partition were looking terrible. Clearly something had changed, and looking at the list of bug fixes subsequent to SP4, it looked like there was some scope for whatever was causing the problem to be fixed: http://support.microsoft.com/kb/894905. In particular, we were clearly suffering from our 16GB server only using 8GB of memory (http://support.microsoft.com/kb/899761) when it usually uses around 14GB.
This was particularly worrying, because it was clear going into the SP4 installation that rolling it back was not going to be easy. Of course, we'd backed up the master, model and msdb databases before the installation, but it did look like rolling back would be quite painful and time-consuming, with downtime we would really want to avoid.
I eventually fastened on a hotfix rollup to bring SQL Server to build 2187: http://support.microsoft.com/kb/916287. Things were a little beyond the stage of careful, reasoned testing, so after running the rollup on the test server for a couple of hours, we scheduled some downtime and installed the rollup. Thing improved immediately, and continued to be back to the state before the nightmare that is SP4 was installed.
I was more than a little surprised that there didn't seem to be much about these particular problems with SP4 on the web. It does perhaps only affect reasonably big servers - lots of memory and multiple processors - but even so, there wasn't really a lot to warn us off the idea that SP4 would be a stable, tested update. Something of a lesson learned - a day of barely-suppressed panic to get back to where we were at the start of the day - although I'm not sure that it would have been easy to be any wiser before the event.