Monday, February 26, 2007

Monsterbugs: Any silver bullets?


A few weeks ago our client began to report errors with our Windows Forms (.NET 2.0) application when she'd retrieve certain records. It was a strange bug that would hang the program for about 60 seconds and then crash without reporting errors. We could reproduce it on the test machine, but in development even while pointing to the same database, everything worked. Perfectly.

I looked in the Event Log and noticed the following error:
"Faulting application epicenter.exe, version, stamp 45df84bf faulting module kernel32.dll, version 5.2.3790.2756, stamp 44c60f39, debug? 0, fault address 0x00015e02"
Really, really helpful stuff. And especially because this wasn't something we could get in development, on any of our boxes, it was confusing. Google searches weren't giving specific answers - the error seemed to general in the realm of .NET 2.0 to make much of -

So how do you pin something like this down?

My approach was as follows:

// code to load data
MessageBox.Show("Loaded data");
// code to display data
MessageBox.Show("Displayed data");

Our application is not small. The form that displays data contains 3300 lines of non-wizard generated code. Data is loaded from a set of methods in a separate library which itself may have upwards of 3000 lines, not to mention that that library references yet more libraries*... my point here is that it's not a small script to throw a dialog up after each line or call.

In the end I spent about 4 hours stepping through each of the major calls via dialog messages. I couldn't debug since our test machine doesn't have Visual Studio or other debugging tools on it (like all the other machines in their offices).

Although I think we've done a good job breaking the logic into pieces (ie. one method for getting data, one for display that is broken into methods for customizations on each object) it was still difficult to discover.

And in the end, of course, it was something small and subtle: calling the AutoColumnsResize method of a DataGridView can sometimes throw exceptions - because of the recursion of the resizing combined with the paint operation of the Windows Forms application. What was probably most annoying was that it wasn't even my code that was responsible for the hanging exception, it was the framework's inability to recover from an internal exception.

When I first started troubleshooting I thought I should have elaborated more on a "tracing" level in our application - we trace some exceptions but not all to the database. But when I finally found the bug, especially because it wasn't in the code we were responsible for writing, I doubt it would have done more than just save time. A stack trace would probably have helped as well, if caught at the point of the exception. But my big question is whether there are any well trodden techniques for finding and dealing with bugs like this, especially if they are in the framework and not in one's own code. How do you deal with stuff like this, or is it always a slog for which there is not silver bullet?

*I think the design of the application is okay (duh, I'm the one responsible) but it may sound more convoluted in that statement than it actually is. We've got a DataHelper library that is covered by unit tests which does all of our data access at the database level. It runs stored procedures, deals with parameters, and so on. One level of abstraction higher we've got a set of business layer objects for dealing with the entities related to our application: loans, disbursements, checks and so on. The Windows Forms application is what we use to let the user display/enter/modify data. We've got a few additional libraries for tracing, configuration, format, data validation, and Crystal Reports. Not so bad I hope...



Mike said...

There are no errors like unreproducable errors. :)

Not sure there is a silver bullet for these situations short of building everything yourself, and that is obviously overkill here considering how productive the MS tools can make you the 99.9% of the time they aren't breaking.

Coming from the web world, there are lots more options in which libraries you use for things, and very often this decision comes down to who documents their libraries the best. With a single-vendor system like Windows Forms it is more complicated since you either deal with it or choose a new line of work.

Did you find out why the development systems didn't break in the same way when looking at the same database? That is curious.

David Seruyange said...

I'm in the hunt for the answer to your question Mike - I'm trying write something that tries to emulate that paint/autosize clash just for curiosity's sake.