Did a 16-bit counter overflow shut down Comair?



By Richard M. Smith of www.ComputerBytesMan.com
December 28, 2004

On Christmas Day last Saturday, Comair Airlines had to completely stop flying all of its planes due to computer problems. Comair blamed the computer problems on their pilot scheduling software being overloaded after bad weather earlier in the week forced many flights to be rescheduled. Comair now hopes to have all of its 1,100 daily flights restored by December 29th.

An article which was published today at the Cincinnati Post Web site provides some interesting details of a software failure in Comair's pilot scheduling software:

How it happened
http://www.cincypost.com/2004/12/28/comp12-28-2004.html
According to the article, Comair is running a 15-year old scheduling software package from SBS International (www.sbsint.com) on an IBM AIX server. The software has a hard limit of 32,000 schedule changes per month. With all of the bad weather last week, Comair apparently hit this limit and then was unable to assign pilots to planes.

It sounds like 16-bit integers are being used in the SBS International scheduling software to identify transactions. Given that the software is 15 years old, this design decision perhaps was made to save on memory usage. In retrospect, 16-bit integers were probably not a wise choice.

An anonymous message posted to Slashdot the day after Christmas first described the software failure at Comair:

http://slashdot.org/comments.pl?sid=134005&cid=11185556
Earlier this year, a similar counter overflow problem in a Windows-based FAA server shut down air traffic control over southern California for 3 hours:
Microsoft server crash nearly causes 800-plane pile-up
http://www.techworld.com/opsys/news/index.cfm?NewsID=2275
The FAA server likely ran into a known design flaw in the Windows operating system. This design flaw affects all versions of Windows including WinXP and Windows Server 2003. The GetTickCount Win32 API function overflows a 32-bit DWORD after 49.7 days. Here's how Microsoft puts it:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp

"The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days. If you need a higher resolution timer, use a multimedia timer or a high-resolution timer."
GetTickCount really should be returning a 64-bit integer. I am surprised that this one got by the original Win32 API designers. Any application code which calls GetTickCount needs to be very carefully written to not assume that GetTickCount will always return an ever increasing number. Microsoft provides no warning about this issue as far as I can see.

When GetTickCount overflows, it can easily make applications stop working properly. In addition, the design flaw can break the Windows operating system itself including Windows 2000:

http://support.microsoft.com/default.aspx?scid=kb;en-us;823273

Additional links