FAA Software Testing and its New Safety Culture

The most frustrating thing about working for the Federal Aviation Administration (FAA) as an air traffic controller is dealing with the organization itself and its nonsensical attitude and policies.

We just had another team meeting at work.  Team meetings are held occasionally by the supervisor with his working crews theoretically to discuss items of import but they usually tend to digress into rant sessions from frustrated air traffic controllers (whose concerns are routinely ignored).

FAA management is forcing controllers to watch their “Leading Edge” series of videos (which controllers view as wasted propaganda), wherein FAA managers talk about how great the latest “new” FAA programs are and how they’re going to change the organization for the better.

The only problem is controllers have heard this so many times before without seeing any real changes it just becomes a joke to hear about new and improved programs within the FAA.

After watching the video one of the other controllers asked our supervisor if he actually believed what they were saying in the video.  The supervisor’s response?  “Some of it.”…

Anyone who has worked for the FAA for long knows the FAA never really changes; even many of the managers (the FAA’s so-called “leaders”) who are ultimately responsible for ensuring that change don’t believe change is possible or likely.  They’re just going through the motions of an appearance of a commitment to safety that’s all too obvious to controllers.

The video we watched was yet another about the new alleged safety culture/”just culture” in the FAA including the ATSAP program.  I’ve written about the ATSAP program before, and have filed numerous reports under the ATSAP program, all of which have been ignored.  One issue was elevated a bit, but the final course of action was to simply brief the controllers (again) on the problem instead of correcting it.

(I’ve used this before here but it’s so accurate I can’t help use it again.  It’s the old joke:  Patient: “Doctor it hurts when I do this.”  Doctor:  “Then don’t do that.” solution.  Unfortunately it’s the standard FAA “fix” for known problems.)

Remember the ending scene in the movie, “Raiders of the Lost Ark”?  What happens to the Ark is what essentially happens to ATSAP reports.

Here’s the standard ATSAP form message basically saying, “We’ve looked at your problem; are doing nothing about it and are filing it away to be ignored”:

Thank you for participating in the ATSAP Program. The Event Review Committee (ATO, NATCA, and AOV) has discussed your report. The information you provided will help us identify the threats and errors that have an impact on the safety of our air traffic system.   The information gained through ATSAP reports might not be discovered by any other means. ATSAP report data leads the way to positive changes in procedures and training.

The ERC unanimously agreed your event can be closed and will be maintained in our database for future analysis.

If you have any questions, you may contact the ATSAP Manager at 1-866-384-0157.

The Event Review Committee

That’s made me give up on the ATSAP program because it’s clear that it’s no different than any of the safety programs in the FAA that proceeded it.  It’s another “all talk, no action” program that will do little, if anything to improve safety within the FAA but looks good on paper and is great for public relations.

One of the biggest indicators is that there still is a flawed/faulty safety culture within the FAA is that the managers still routinely blame errors on controllers, citing performance shortcomings with the individual.

They’ve changed somewhat because they can no longer punish controllers for their mistakes, but blaming controllers is simply an easy out when it comes to explaining errors but fails to acknowledge deeper problems within the air traffic system.

That’s hardly the “safety culture/just culture” they keep talking about, because in blaming controller performance for errors they fail to own up to the real root cause of many errors.

The root cause of many air traffic errors (especially airspace deviations) is poor, confusing or broken procedures and/or automated systems.

But the FAA continues to use and even create new procedures that are prone to introducing human error into the system, while failing to improve or introduce new methods to reduce human error.

The FAA loves to use the “Swiss cheese” safety model/analogy.  The “Swiss cheese” safety analogy shows the air traffic system as layers of “Swiss cheese” with holes in each layer.  When a problem goes undetected and uncorrected and is able to make it through a hole all the layers the result is an error or accident.

The last defensive layer to be able to prevent a problem or correct a mistake is the air traffic controller.  And the controller has always taken the brunt of blame whenever there is an error, in spite of the fact that the chain of events leading to the error may have started because of some other shortcoming in the organization.

That’s part of why it’s frustrating to be an air traffic controller for the FAA.  Regardless of the poor policies, procedures and equipment they are forced to work with, at the end of the day if something goes wrong the controller will be blamed.

Currently the FAA is developing new computer software called En Route Automation Modernization (ERAM) contracted out to Lockheed Martin.

In a February 19, 2009 press release the FAA stated:

ERAM is on budget and ahead of schedule. The system has been installed by Lockheed Martin at 20 en route centers six months ahead of schedule — meeting a major milestone in the FAA’s Flight Plan.

But contrary to the FAA claims, the deployment schedule for ERAM has actually been delayed significantly by major problems while the FAA continues to “re-baseline” the program and claim it’s on time and on budget.

Good luck even finding an ERAM timetable from the FAA anywhere (including on their own website)!  We used to have a bulletin board with the timetables at our facility (with a nifty LCD countdown clock too) but the dates have changed so much and with such frequency it has since disappeared.

That’s because currently ERAM is so unreliable and bug-ridden it’s unsuitable for use in a live air traffic environment.

So you would think the FAA’s “safety culture” would ensure that ERAM wasn’t used for keeping airplanes full of people apart until it was ready, right?

Wrong.  The FAA continues to plow straight ahead with the ERAM program, in spite of the fact that it has enough serious bugs that it can’t even keep run for an entire day without significant problems (keep in mind that air traffic is a 24/7 business; the computers controllers use run continuously), and there continue to be major problems like the data tags that controllers use to differentiate between the different flights not staying on the proper targets.  (In other words controllers would have no way to be sure which aircraft/flight was which on their displays.)

A recent overnight test of ERAM on live air traffic starting at midnight (during a time when there is greatly reduced traffic compared to during the day) at Salt Lake Center that was meant to run for 24 hours only ran about 10 hours before they aborted it due to major problems that started occurring in the morning about the time air traffic started building.

The word is that the contractor, Lockheed Martin, apparently doesn’t think any of the troubles with ERAM are really that significant.  That’s no surprise because they don’t have to try to keep airplanes apart with the system; that’s the air traffic controllers’ problem.

But the bigger problem is the FAA managers that are allowing this program to continue expanding and go into wider testing at more facilities knowing how bug-ridden it is.

Our facility (as well as others) continues to “test” ERAM too in the wee hours of the morning, forcing controllers to work on our backup computer system while they do testing.  Our backup system has much fewer tools, causes a higher workload for controllers and thus reduces the margin for error while working live traffic.

But most of the problems that are occurring during the testing are bugs they already know about and have already experienced at other facilities.  No one knows the purpose of replicating these known problems other than to meet deployment/testing timetables.

The FAA insists on deploying and testing ERAM on a wider and wider scale, affecting the workload of controllers working live air traffic who are forced to work on either the buggy ERAM software or the current backup computer system, even though ERAM isn’t even working at the first few facilities it was installed at.

In doing so safety margins in a significant part of the air traffic system are being degraded for a widespread software test affecting real aircraft with real people on board.

No one really can explain why the FAA agreed to or chose to test and deploy ERAM the way they are, instead of insuring that it works at a few select facilities first.  Wouldn’t it make more sense to get the ERAM software stable and working at one facility first before forcing other facilities to start testing it too?

The entire ERAM debacle is a glaring demonstration that the “safety culture” that the FAA claims to have doesn’t exist at all.

But at the end of the day it’s the air traffic controllers who are going to be left holding the bag, just like always.

Leave a Reply

Your email address will not be published. Required fields are marked *