Thursday, January 11, 2007

Service Unavailable - Application pool crash

HOWTO: Understand and Diagnose an Application Pool Crash
Problems statements similar to the following questions pop up all the time on various IIS newsgroups, and the user usually claims that they have either seen (or not seen) many posts that look like theirs, and never any concrete solutions. I am going to try and explain the whole thought process, why things work the way it does, as well as useful next steps.

Question:
#1

Our production server has recently started experiencing AppPool crashes. These seem to occur sporadically. sometimes three times a day, sometimes not for a couple of days. The error manifests itself to the client as "Service Unavailable". In the system event log we see the following:
A process serving application pool 'DefaultAppPool' suffered a fatal communication error with the World Wide Web Publishing Service With error number: 8007006d

At the same time (but not always) we see this error in the Application Log
"Faulting application w3wp.exe, version 6.0.3790.0, faulting module kernel32.dll, version 5.2.3790.0, fault address 0x000249d3"

The application does usually start working again after anywhere between 2 and 15 minutes (although no worker process restarts or recycles appear in our perf logs).

This problems seems to have started occuring following the latest MS patches being applied to our server. It may just be a coincidence though.

We are running Windows Server 2003 Standard. We are about to apply SP1 in an attempt to solve this problem, but I wanted to find out if anyone else has had this problem and what the solution was. I have seen several similarish posts but nothing concrete as a solution.

I have seen this article (http://support.microsoft.com/Default.aspx?id=885654) but it doesn't quite match our situation as we are not running as a domain controller. We haven't yet tried running registry monitor as I wanted to find out if error 8007006d always equates to a registry permission problem?? before installing
3rd party freeware onto our production server.

Any help is much appreciated

#2

There is a similar question posted today, but the event ID is different

I see the following error anywhere from 2-3 times a day on Windows 2003 server.
Event ID- 1009, a process serving application pool 'Q' terminated unexpectedly. The process ID was 'xxxx' . The process code was 'xxx'

So far we have not noticed anything on the application side, but once in a while W3WP.EXE starts using up all the CPU and the server comes to a halt. This is not associated with the application pool terminating, but wondering if it is conntributing it.

I have applied SP1 for Windows 2003, that did not make any difference.

Any help is apprciated.

#3

Hello Together!

I've got a problem with my Windows Server 2003 SP 1 Web-Edition

If I look in the event Viewer, I allways find the following Error:

Event Type: Error
Event Source: Application Error
Event Category: (100)
Event ID: 1000
Date: 28.08.2005
Time: 22:00:26
User: N/A
Computer: {removed by the author}
Description:
Faulting application w3wp.exe, version 6.0.3790.1830, faulting module , version , fault address 0x.

Why does this error occur? What is it and what can I do to resolve this problem?

Please help me

Answer:
You are looking at what is commonly refered to as a "crash" or "access violation" on the server. To be clear, this is different from a "hang" on the server, though the web browser may appear to "hang" or the browser's icon motions for some time (the length of time depends on various network connection timeout periods as well as whether the server-side connection stays open or closed) before reporting some random error response relating to missing server, DNS, or other service error.

Crash vs. Hang
What distinguishes between a crash and a hang on the server? Ok, for the astute reader in the back, just be quiet with this simplification. We want folks to to internalize and understand the issue, not regurgitate dry documentation. :-)

Practically speaking, a crash is something that will simply wipe out its host process; this will stop whatever server side work and response generation that the process was supposed to perform. Now, prior to IIS6, this same process also held the connection open, so as soon as a user-mode crash happens, IIS will go down and the client browsers see a disconnected connection and usually report some sort of service disconnected/not-found error. With IIS6 Worker Process Isolation Mode, HTTP.SYS holds the connections in kernel-mode, so regardless if user code running in w3wp.exe crashes, the connection stays connected and IIS starts up a new process to handle future requests - so browser clients will NOT see a disconnection for a crash. Unfortunately, the request caught by the crash cannot be re-executed (suppose that request was a bank withdraw...) unless you implemented Transaction semantics, so any unsent response is lost.

A hang, on the other hand, will usually keep its host process around but the hang prevents any real work from being done. There are many possible ways to do this - user code can be waiting for a lock that never gets released, either because it was leaked or there was a logical deadlock or livelock, or it could be in a clever infinite loop, etc. Once again, any unsent response will never happen once your code gets hung.

As you can see, from the perspective of the client browser, both a crash and a hang on the server can prevent a complete HTTP response from being sent back, so they can LOOK similar. Add to the fact that browsers may have bugs that cause itself to either crash and hang, and sound security practices on the server should limit information disclosure of errors to the client... so I never use browser behavior or returned HTML to diagnose server-side issues - I always diagnose based on information from various server log files.

About Diagnosing AppPool Failures...
Unfortunately, there are multiple event log entries from IIS that indicate a "crash" has happened, so you cannot just key-in on any particular event ID for a resolution pattern. For example, the earlier questions illustrate two such events, and there are other related ones.

Now, we did not intentionally try to make crashes harder to diagnose by making them appear as different events. What is happening is that the W3SVC component of IIS and the user-code execution component of IIS run in separate processes which execute independently of each other, yet asynchronously pass messages back and forth to indicate status. Suppose something crashes the process responsible for user-code execution...

Sometimes, IIS first notices it when a ping response fails to arrive; other times, IIS notices that the process handle has gone away... and while IIS understands that these are both catestrophic events that should be reported in the event log, it maintains good system design by reporting them uniquely. It is the responsibility of any analysis layers on top of IIS to abstract such detailed reporting and present a logical "process crashed" information to the user so that they can take action.

Unfortunately, that analysis layer is frequently the user's brain, who may not be able to abstract the details... and gets confused. But hey, I do not think that IIS should stop giving the details. On the contrary, I think you just have to look at the problem harder. :-) Or complain loudly and get us to provide a better debugging tool (like DebugDiag...).

What is a Crash
You can consider a crash as an unrecoverable event that resulted from some bug in the program that is executing - in the case of IIS, it simply provides a thin process and support infrastructure for your user code to run. And a bug is basically a logical flaw that results in unintended behavior given some arbitrary set of inputs. Notice that the behavior is unintended, and the set of causes is arbitrary. This basically means that bugs can cause crashes that happen sporadically or periodically... all depending on the set of causes, which is arbitrary!

Now, since the set of causes to a given bug is arbitrary, I would also caution against trying to "fix" crashes by blindly installing hotfixes, Service Packs, or making configuration changes. Crashes are caused by bugs, which are logical flaws, and the only way to "fix" the situation is to either:

fix the logical flaw itself, which requires diagnosing the crash to figure out the root of the problem.
change software configuration to avoid the logic flaw causing the crash, which can also require diagnosing the crash to figure out what is causing the issue.
To make things even more interesting... a variety of logic flaws can cause crashes, all of which look the same from an IIS and event log entry perspective (to IIS, the crash ended the process; does not matter what; so just report it).

Thus, you may see similar looking events, sometimes with similar looking error codes, but no single concrete solution. The reason should be clear to you now - one flaw causing a crash with a certain error code is NOT the same as another flaw causing a crash with an identical error code. You are talking about two different flaws, and depending on the code path taken by your server configuration choices, you may need to do different things.

So, the take-away here is that the event log entry, the error code, and any other details you may discern are simply good clues to what is going on, but none are independently reliable for diagnosis. Treat them as pieces of a puzzle that you need to put together to correctly diagnose the issue, which is ultimately what you are trying to identify and resolve.

For example, I treat these events as simply crashes that need to be caught and their stack trace logs analyzed to determine further action. I would not immediately pattern match crashes to "solutions" without other information, and I certainly would not change any system configuration in response.

Frankly, I think it is rash for users to attempt to resolve their crashes without diagnosing the cause. However, most users seem to love pattern matching problem symptoms and event log entries with supposed solutions and blindly try them all, hoping some might work... all the while sinking deeper into some other problem due to their random changes. And the rationale should be clear now - if you do not know the bug causing the fault, how can you determine the configuration change to avoid that bug's path, or find the right patch to fix the bug?

How to approach the Crash
Ok, now that I have thoroughly trashed most people's usual methods of "dealing" with a crash, let me walk through another troubleshooting pattern on IIS. :-) I realize that you are probably under the gun to take some action to fix a problem, so you are willing to try anything and if it works, great; if it doesn't, then there is always product support or newsgroup/forums support to fall back on. I just want to propose a more actionable way to deal with a crash, so that you may be able to take care of things yourself... doesn't that feel good? :-)

Since you never notice a server crash until you interact with it (usually with a browser), when something unexpected happened and you do not think it is a bug in the application ("hanging" browser or server-error responses are reasonable clues), it is time to look for any signs of a crash on the server. When IIS6 runs in its default Worker Process Isolation Mode, you will see event log entries similar to the sort given in the questions - either IIS noticed the w3wp.exe handle signaled process exit/crash of some sort, or the w3wp.exe fails to respond to a ping, or Windows notices that a process crashed. In IIS5 Compatibility Mode, you will probably notice other event log entries, either saying that the IISADMIN or W3SVC service crashed and has done this for # times, or that dllhost.exe has crashed, etc. Anyways, all these events talk about something related to IIS crashing; you have no idea whether it is due to your code or not.

Now that you have identified that your issue belongs to a crash, it is time to set up debugging traps so that you can catch the NEXT crash and diagnose it. Yes, you heard me - you cannot do anything about the crash that has already happened - and since you have no idea what it is, you CANNOT change any server settings to avoid it. So, the only reasonable thing you can do is to set up debugging monitors like IIS State or DebugDiag on the necessary processes running code that is crashing and then WAIT for the next crash. Hopefully, you can trigger the failure condition easily, to shorten this wait.

On this next crash, these tools will produce a stack trace log as well as memory dumps to allow debugging, and you want to either analyze the stack trace log yourself, pay someone to perform an analysis (for example, Microsoft PSS) or post to a newsgroup like microsoft.publit.inetserver.iis to see if anyone will do a free analysis. Only after analysis of the crash can you determine what is truly going wrong

Hopefully, you only have one crash happening on your server, but even if there are multiple crashes on your server, you simply apply the same technique in serial. You catch one crash, resolve it, get a patch/fix, and run again with debugging enabled to catch the next crash, get it resolved and a patch/fix, etc... As developers will say, crashes are the most straight forward to diagnose and fix - they are truly the low-hanging-fruit of bad behavior on the server, so you should pick them off real early...

Now, I dissuade against tracking down multiple crashes in parallel because you simply do not know if the crashes are caused by the same or different bugs. If it is caused by the same bug, you are just wasting time diagnosing the other crashes. If it is caused by different bugs, you have no idea whether the bugs interact with each other or not in causing the crash. Ideally, you want to find and fix non-interacting bugs in parallel because interacting bugs may not be real bugs after you fix the original issue (so you would be just wasting resources again). In short, tracking parallel can be complicated and the pay-off is not certain... you have been warned!

Yes, this is very similar to how the IIS product team approaches bug fixing during our stress/reliability test runs. We start up all IIS processes under the debugger and monitor it for any problems, and as soon as anything crashes/hangs, the debugger is already there so we can diagnose the first occurrence instead of waiting for a second occurrence. And as soon as we get a fix, we just crank everything up again under debuggers and wait for the next failure. Highly efficient and no wasted effort.

Conclusion
I hope that this helps clarify what is a crash on IIS and how to best deal with it.

Resist your natural urge to pattern match event log messages or failure codes to a solution, and do not be discouraged if you do not find your particular failure code or if others have similar but supposedly unsolved issues. Crashes are usually arbitrary and requires stack trace analysis to determine the real cause and the next step. If you do not use tools like IIS State or DebugDiag to catch the crash, you will rarely figure out the real culprit and the correct next step. Pattern matching is nothing more than a random guess, so do not take chances.

Personally, I always attach a debugger to gather a stack trace whenever I suspect a crash. Depending on your debugging skills, this can tell you a whole lot of info, sometimes sufficient to directly fix the issue. Guessing at solutions based on non-specific symptomes can never do this reliably. It is all up to you. :-)

//David


Thanks to David

No comments: