An article which was published today at the Cincinnati Post Web site provides some interesting details of a software failure in Comair's pilot scheduling software:
How it happenedAccording to the article, Comair is running a 15-year old scheduling software package from SBS International (www.sbsint.com) on an IBM AIX server. The software has a hard limit of 32,000 schedule changes per month. With all of the bad weather last week, Comair apparently hit this limit and then was unable to assign pilots to planes.
It sounds like 16-bit integers are being used in the SBS International scheduling software to identify transactions. Given that the software is 15 years old, this design decision perhaps was made to save on memory usage. In retrospect, 16-bit integers were probably not a wise choice.
An anonymous message posted to Slashdot the day after Christmas first described the software failure at Comair:
http://slashdot.org/comments.pl?sid=134005&cid=11185556Earlier this year, a similar counter overflow problem in a Windows-based FAA server shut down air traffic control over southern California for 3 hours:
Microsoft server crash nearly causes 800-plane pile-upThe FAA server likely ran into a known design flaw in the Windows operating system. This design flaw affects all versions of Windows including WinXP and Windows Server 2003. The GetTickCount Win32 API function overflows a 32-bit DWORD after 49.7 days. Here's how Microsoft puts it:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.aspGetTickCount really should be returning a 64-bit integer. I am surprised that this one got by the original Win32 API designers. Any application code which calls GetTickCount needs to be very carefully written to not assume that GetTickCount will always return an ever increasing number. Microsoft provides no warning about this issue as far as I can see.
"The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days. If you need a higher resolution timer, use a multimedia timer or a high-resolution timer."
When GetTickCount overflows, it can easily make applications stop working properly. In addition, the design flaw can break the Windows operating system itself including Windows 2000: