A few weeks ago our client began to report errors with our Windows Forms (.NET 2.0) application when she'd retrieve certain records. It was a strange bug that would hang the program for about 60 seconds and then crash without reporting errors. We could reproduce it on the test machine, but in development even while pointing to the same database, everything worked. Perfectly.
I looked in the Event Log and noticed the following error:
"Faulting application epicenter.exe, version 184.108.40.206, stamp 45df84bf faulting module kernel32.dll, version 5.2.3790.2756, stamp 44c60f39, debug? 0, fault address 0x00015e02"Really, really helpful stuff. And especially because this wasn't something we could get in development, on any of our boxes, it was confusing. Google searches weren't giving specific answers - the error seemed to general in the realm of .NET 2.0 to make much of -
So how do you pin something like this down?
My approach was as follows:
// code to load data
// code to display data
Our application is not small. The form that displays data contains 3300 lines of non-wizard generated code. Data is loaded from a set of methods in a separate library which itself may have upwards of 3000 lines, not to mention that that library references yet more libraries*... my point here is that it's not a small script to throw a dialog up after each line or call.
In the end I spent about 4 hours stepping through each of the major calls via dialog messages. I couldn't debug since our test machine doesn't have Visual Studio or other debugging tools on it (like all the other machines in their offices).
Although I think we've done a good job breaking the logic into pieces (ie. one method for getting data, one for display that is broken into methods for customizations on each object) it was still difficult to discover.
And in the end, of course, it was something small and subtle: calling the AutoColumnsResize method of a DataGridView can sometimes throw exceptions - because of the recursion of the resizing combined with the paint operation of the Windows Forms application. What was probably most annoying was that it wasn't even my code that was responsible for the hanging exception, it was the framework's inability to recover from an internal exception.
When I first started troubleshooting I thought I should have elaborated more on a "tracing" level in our application - we trace some exceptions but not all to the database. But when I finally found the bug, especially because it wasn't in the code we were responsible for writing, I doubt it would have done more than just save time. A stack trace would probably have helped as well, if caught at the point of the exception. But my big question is whether there are any well trodden techniques for finding and dealing with bugs like this, especially if they are in the framework and not in one's own code. How do you deal with stuff like this, or is it always a slog for which there is not silver bullet?
*I think the design of the application is okay (duh, I'm the one responsible) but it may sound more convoluted in that statement than it actually is. We've got a DataHelper library that is covered by unit tests which does all of our data access at the database level. It runs stored procedures, deals with parameters, and so on. One level of abstraction higher we've got a set of business layer objects for dealing with the entities related to our application: loans, disbursements, checks and so on. The Windows Forms application is what we use to let the user display/enter/modify data. We've got a few additional libraries for tracing, configuration, format, data validation, and Crystal Reports. Not so bad I hope...