More Signficant ERAM Problems

Salt Lake Center (ZLC) reverted back to the HOST computer system last night due to major problems after starting an ERAM run last week that was supposed to be permanent.

I’m sure the FAA and the contractor Lockheed Martin will write it off as just another “glitch” (i.e. part of the development cycle), but it’s another glaring demonstration of how unreliable the ERAM software still is, even though the FAA continues to test it on live traffic, expecting air traffic controllers to simply work around its many problems and keep aircraft safely separated nonetheless.

ZLC started running ERAM on what was supposed to be a permanent basis on the morning of Wednesday, February 17.

They had previously completed an an eight day test that ended the first week of February, followed by a two week delay in which Lockheed Martin was supposed to correct the (known) bugs in the software before ZLC began using the new version permanently.

The latest failure shows that in spite of the software updates that obviously ERAM still has a long way to go before it’s fit to use on live traffic 24/7.

Notably the event marking the first enroute center to transition to ERAM full time came and went quietly.  Instead of calling in the media and having a press release (and having sheet cake), the FAA barely noted the occasion.

The complete lack of fanfare noting the first enroute center to start running ERAM full time shows that the FAA knows full well how unreliable/unstable the ERAM software still is.  At this point it’s clear they’re making deliberate efforts to not call any attention to the ERAM project.

After lots of boastful press from the FAA over ERAM early last year, including statements of how the program was on budget and ahead of schedule (even though it wasn’t), the FAA abruptly stopped talking about ERAM after significant problems running it at ZLC in a test last fall.

The FAA apparently learned its lesson then and now isn’t going to mention ERAM at all, instead choosing to continue testing and deploying ERAM quietly and keeping its fingers crossed that it won’t cause a news event.

Every time the FAA and Lockheed Martin complete another test without significant problems they seem to convince themselves the project is doing just fine.  After the eight day ZLC test they were convinced the software was ready for permanent use after just a little “tweaking”, even though it’s now clear that was far from the truth.

Last fall one of the problems that resulted in the aborted ZLC test was datablocks (the tag that displays the aircraft call sign and altitude as well as other information) wouldn’t track properly and sometimes ended up tagging up on the wrong target.

Guess what?  That problem still exists many months (and many updates) later.

The data block/tracking functionality is fundamental to an air traffic display system and is thereby safety-critical.  It’s disturbing that at this stage this basic functionality is still so unreliable in ERAM.

This may not be simply due to software bugs either; there may be some significant problems with the software tracking algorithms within ERAM, which from what I’ve heard are radically different from those used in the HOST computer system.

Here’s a list of some of the latest bigger problems with ERAM (and note that some of them, especially the tracking problems, aren’t new):

Interim Flight Plans – If a controller starts an interim flight plan (datablock only, no beacon code or routing) ERAM aggressively searches for the first target of opportunity to track. It may be a primary, or a beacon belonging to another aircraft.
Track Un-Pairing – Arbitrarily the datablock will disassociate from the beacon target. We are unable to determine what seems to cause it. We looked at RADAR sort boxes and ASR terminal RADAR feeds, and who knows what else. ERAM will not automatically re-pair the datablock and the target like HOST does. We see this happen frequently around SLC where limited datablocks create a bright large yellow spot over the airport. You can’t shut them off and it is easy for the un-paired datablock to disappear into the blob.
Track Swap – We had some instances of departures where ERAM switched datablocks on aircraft on completely different routes and entering different sectors.
Bogus Beacon Codes – Frequently ERAM will flash in the third line a bogus beacon code (like the aircraft is squawking an incorrect code) for one sweep and then it disappears.
Track Pairing – If ERAM associates a full datablock with an incorrect beacon, you have to track the datablock at least 32 miles away from the incorrect beacon for ERAM to accept the disassociation. Approximately 30 seconds has to pass before you can pair it with the correct beacon target.
Bogus Alerts – We see significant numbers of bogus alerts; MSAW, conflict probe in EDST (URET replacement), aircraft working is SUAs.
Inter Facility Handoffs to Vertically Stratified Sectors – If an aircraft changes altitude 30 minutes prior to exiting the facility, and the new altitude causes the aircraft to enter a different sector in the receiving facility, ERAM will hand the aircraft to the incorrect sector if you use the auto addressed handoff option (single alpha character followed by CID). You have to manually address the handoff to the correct sector.

Apparently the latest software version yet to be put into use isn’t intended to fix many of the aforementioned problems either; instead it addresses other bugs.

It will be interesting to see how the latest episode affects the entire ERAM project.

One way or the other it’s going to result in the project falling further behind schedule.

But I doubt very much that it will convince the FAA to stop testing the software on live traffic.

16 comments

  1. Given the fact tat ZLC had ran an 8-day test and then started what was to be the permeate transition from Host to ERAM. What issue or issues was the final straw? You provide an excellent list, but the problems you list were there during the 8-day test? Why did FAA management finally relent and fall back to Host.

    You are correct about the difference between Host and ERAM tracking. ERAM uses Lockheed’s proprietary tracker developed for Micro-EARTS. LMCO just used a big hammer to pound it into the Host environment. You are also correct about the algorithms they are embedded into the tracker itself will be difficult (not impossible) to change. LMCO totally snowed the FAA program office about the superiority of “their” tracker and that the Host tracker was junk. There were how many mid-airs in the last 25 years using the Host tracker? anyone, anyone at all?

  2. Thanks for the information about the flight tracking algorithms, George!

    This is part of an email from the ZLC ERAM Rep that explains why they decided to fallback to HOST at ZLC:

    This was brought on by a pairing issue and an FDM PAS/SAS failure.

    There have been several pairing issues, but the one that made our ATM decide to fallback was two SLC departures. A DAL tagged onto a SKW.
    The DAL was an east gate departure and the SKW was a south gate departure.
    LM could not give a reason why this happened.

    While the ATM was waiting for LM to tell her if they could find out why the two departures had the wrong datablock we had the FDM hit.
    When we lost the FDM the flight plan information on the d-side blued out and and we had our ACL repopulate.
    The FDM recovered in about 1 minute and 30 seconds.
    The underlying approach controls got a flight plan dump, which caused many strips to be printed.
    S56 had the added problem of having all the new flight plans issued PDRs for the wrong runway.

    With all this and a split beacon issue from a failed radar site earlier in the day the decision to have an orderly fallback was made.

    I don’t what what FDM (Flight data management?) and PAS/SAS are though and/or how they relate to the ERAM system. Obviously both are related to flight plan information processing and display.

    Can someone else elaborate?

  3. ERAM has two different paths or channels that run in one of four modes: active mode, test mode, backup mode, or pending state. The operational ERAM system (HOST active) is the operational Primary Address Space (PAS) or active mode. Since ERAM consists of nearly 90 RISC servers instead of two HOST processors an address space is simply enough memory carved out of the system (across many servers) to run what you need to. You can think of it as being functionally equivalent to the Operational HOST or NAS processor.

    The ERAM software on the non-operational channel (NAS Standby) would be the Standby Address Space (SAS) for whoever is in backup mode. ERAM allows change in modes via SMGT message which is the ERAM equivalent of the NAS monitor.

    Flight Data Management (FDM) is one of the operational units (OUs)/functional groups (FGs), which you can think of as NAS program elements or Programs like DAM. FDM contains the requirements for processing requests which modify fields in the Flight Object. This is analogist to HOST Flight Data Processing where controllers at either a “D” or “A” position makes modifications to flight plans, for example amend an aircraft’s final altitude. Actually FDM is responsible for breaking apart Flight Plans just as FDP does today. No matter what source the FP originated from, Bulk, Interfaculty, FSS, etc.

  4. Awesome stuff. Keep up the good work.

    It’s a shame NATCA has lost its guts over the years. Several years ago, we’d be at airports handing out flyers telling the passengers they are essentially beta testers for new, known-to-not-fucking-work software.

    Today, we’re… well, I’m not sure what we’re doing. Not enough, is what it appears to me.

  5. When NAS was first fielded (think 1970s…) it was commissioned and then pulled back out of the system because of various serious bugs. That put everyone back on shrimp boats and broadband radar. Now it’s the gold standard – go figure. Nobody thought they were fielding a dud system, but sometimes remaining problems that weren’t (or couldn’t be) found during testing only come to light when a system is exposed to the real world. The trick is to have a fallback position if things don’t go as expected, and the ERAM program seems to be managing to maintain that balance. HOST software still has occasional bug fixes, and it’s been out for 20 years – it shouldn’t be a surprise that ERAM still needs work. FAA tested TCAS to death, until finally directed to field version 6 (!) by Congress, and once V6 got fielded, operational experience and exposure led to V7 – which has been pretty stable and by observation seems to be good at preventing midairs. If Congress had let the FAA wait until TCAS was perfect, they’d be testing version 14 at the Tech Center and we’d still be having airline midairs with nothing fielded.

    Sometimes big changes are harder than they look. Eventually you have to stop thinking and testing and start doing – and be prepared to deal with the unexpected when it occurs. It’s a lot easier to just keep testing, until perfection is reached – because that will save you the embarrassment of fielding an imperfect system. It’ll also keep you from making any real progress.

  6. OldTimer,

    There is a vast difference in perspective between someone who’s actually required to use a defective tool to do their jobs safely, and everyone else.

    The problem isn’t that ERAM needs some work. It needs a lot of work.

    At this point it seems pretty clear that ERAM is being driven more by timelines and the associated payout bonuses to the contractor than whether or not it’s ready for use with air traffic.

    And since TCAS was never intended as a primary method of keeping airplanes apart either you’re talking apples and oranges by bringing it up here. TCAS was intended as a fail-safe separation system.

    By contrast ERAM is a primary separation system. And it’s clearly not even close to being up to the task.

    No one is asking for perfection in ERAM. But basic functionality that is reasonably reliable would be nice before we are forced to start using it 24/7.

    (Anyway, it’s nice to know that TCAS does work now that we’re beta testing ERAM on the flying public…)

    As for the fallback position maintaining balance in the ERAM program, that might be true if we could instantly fallback to HOST.

    But the transition back and forth from ERAM to HOST is far from quick or easy, simply because it wasn’t designed to do that.

    An orderly fallback to HOST takes something on the order of 30 minutes or more, which is why they waited until traffic abated in the evening to perform the fallback at ZLC. They’ve been using USB flash drives at ZMP to import the flight plan information from ERAM back into the HOST when we’ve been transitioning from ERAM back to HOST after testing, and all the flight plan information isn’t necessarily preserved in that process either.

    It’s another kludge operation because ERAM wasn’t intended to run with HOST; ERAM is supposed to be the backup for ERAM.

    Otherwise the real-time backup for ERAM right now is EBUS (or DARC) which completely eliminates all flight plan information. Reverting back to EBUS with any significant amount of traffic would be a nightmare (for controllers anyway).

    Ultimately ERAM doesn’t give controllers any significant functionality that enables them to do their jobs better compared to our current equipment.

    And right now ERAM doesn’t work well at providing that basic functionality.

    Track Un-Pairing – Arbitrarily the datablock will disassociate from the beacon target. We are unable to determine what seems to cause it.

    The HOST system associates datablocks with targets just fine (as does DARC). So where’s the “progress” exactly with ERAM?!

    Regardless, all the people making the decisions to “stop thinking and testing and start doing” aren’t the ones who are required to separate airplanes with that equipment. Nor are they the ones who have to “deal with the unexpected.”

    Those burdens rest solely on the shoulders of the air traffic controllers.

  7. Just a quick point of clarification. George is pretty good at defining ERAM functionality, bt PAS/SAS swaps occur between the primary and secondary address space on the same channel. Each channel has a redundant system running inside of it, which is the equivalent of the Host standby processor. If both PAS and SAS fail one can still then bring the Back-up Channel into Active Mode; and it too has multiple PAS/SAS boxes. So there has to be 4 failures before a system is completely toast. I am not fully aware of ZLC’s situation, but I believe all of this happened on Channel A. As mentioned, once an address space fails it has a recovery process, and in this instance took 00:01:30 to recover. I am not aware if they tried to bring “B” from Back-up to Active or not.

    The data block unpairing is a very significant issue, I agree with you that it is very disheartening to have a repeat of last October this many levels of software down the line.

  8. DARC started out as single-site only (no mosaic) showing nothing but limited datablocks, until full datablocks were instituted in something like version E or even later – and sometimes two different radars would put different flight IDs on the same aircraft as it left one coverage area and entered another. Stuff like that happens with new software – BTDT, got the radar T-shirt.

    A big chunk of the HOST tracker software (the part that maintains track identity, among other things) is old IBM legacy Jovial assembler code that nobody wants to touch because nobody understands it – and as long as nobody messes with it, it works. As a long-term strategy, sticking with inscrutable Jovial is not a viable plan, though – somebody has to write some new code that implements at least the same functions, and when you write new code, you get bugs. Visibly “perfect” to the user would simply be a system that does precisely what the old stuff did, without errors or any apparent operational improvement – but just getting there with new code that somebody actually understands and can maintain is a major improvement all by itself.

    I agree that the focus should be on ensuring that operational systems are as stable and correct as possible, but at some point somebody has to decide that it’s safe enough to use even if it’s not perfect… and there’s plenty of room for honest differences of opinion there.

  9. Why the hell is the damn hype loving media so slow to cover this huge issue?

    This makes the Toyota recall look like a hangnail.

    Where is the union whistle blower?

    Where is the member of Congress that wants to be a friggin hero? Where is the transparent FAA director?

    LM getting bonuses despite the product looking like crap would grab headlines.

    60 Minutes are you awake?

    Where is Huffington Post?

    Pathetic.

    I know someone well inside FAA who says he will not fly once ERAM is operational 24/7.

  10. I was a controller at ZAU during the transition to the NAS and I think OldTimer has a point. You reach a juncture with these new systems where it works “pretty well” and you just have to take the plunge. How else can you do it? Yes, you are putting a huge burden on the controller. Yes, you’re not going to sleep well for a year ot two. Yes, you’re gonna get the crap scared out of you on a regular basis, but that’s (unfortunatly) the nature of these transitions. There were a lot of older guys back then that found the transition too stressful and they left the boards. It’s just part of the job, I guess.

  11. I don’t think anyone is looking for perfection in ERAM right now. The big question is, is ERAM far enough along in development to be used on live traffic?

    The very fact that ZLC had to fallback from ERAM to HOST proves it’s not ready.

    The tracking problems, especially the one where the data block drops off the target (“unpairs”) and the flight plan is deleted are major bugs that they knew still existed when they decided to run ERAM 24/7 at ZLC. Most of those tracking bugs have been known problems since (at least) last fall that still haven’t been fixed.

    Those bugs alone should preclude it from being used on live traffic until they’re fixed, especially considering they don’t know what’s causing those problems and can’t seem to correct them.

    That’s not even considering the myriad of other bugs that ERAM still has.

    But it’s clear the FAA is willing to gamble (because it has done so in the past) that in spite of those known significant problems with ERAM that controllers will be able to work around them and keep the airplanes apart.

    So much for the FAA’s claims of, “Safety: The Foundation of Everything We Do”, “safety is our passion” as well as “Integrity is our character. We do the right thing, even when no one is looking.”

    ERAM is an ambitious and very complicated program. Knowing that, it’s reasonable to expect it’s going to take a while before it’s ready for use with live air traffic.

    Right now it’s obvious it’s not ready.

  12. I have to respectably disagree with “OldTimer” the arguments put forth are the same tired arguments I have heard from Washington for 30 years. They were wrong 30 years ago and are just as wrong today. Let’s review them in some cogent order.

    1. HOST/NAS “IBM legacy jovial assembler code that nobody wants to touch because nobody understands it” Boy where have I heard this before? I understand the HOST/NAS tracker, I can pick up the phone, today, right now, and call about a dozen people who totally understand the NAS tracker better them me. A few of them actually designed and coded the NAS tracker. The NAS tracker was changed modified, tweaked over the last 30 years and improved; versions went out with each new NAS/HOST system about every six months. Besides people who understand the tracker, it’s also well documented. If you are truly an old timer you are no doubt familiar with NAS/HOST documentation? NAS MD’s, PDS’s and SDD’s? they are even online at NAS DOCs. Each mathematical formula is broken down, the requirements are all there and the SDD’s provide each branch through the various modules. The current NAS/HOST tracker isn’t just understood at a conceptual level it’s understood down to its very sinew, we have to. Arguing otherwise just shows a lack of knowledge of the system. I’ve lived it, bleed over it and understand it.

    2. The second incorrect statement about Jovial, assembler etc. The best example I can give is how many people come off the street and you can sit down at an “R” position and say “there you go buddy, control traffic, you are now an air traffic controller”. Of course the answer is none that is what the academy is for and what OJT, learning, and working with experienced people. The point being, if we need the skill we can teach it. The military does it, many of us served in the armed forces how many of us came to boot camp with the knowledge we would need for our career fields? Probably zero. What did we receive? Training, why? because the ability to accurately fire an artillery piece has remained and isn’t available outside of the military. If you need it train to it, simple.

    A sub-point to this is some of the subsystems in NAS/HOST are written in the widely used computer language “C”. It was proposed that the reset of NAS/HOST be rewritten a subsystem at a time in “C” but was shot down by Washington because they said we will just train new Jovial programmers. Got to love it, now Washington argues the computer language is the problem and we needed a 3-billion dollar program with no functional improvement because the system is written in Jovial and no one understands it? I hope that isn’t the rational; really as a tax payer I hope there is more then that. Besides what does Lockheed do? Harvest the AAS ADA code for reuse in ERAM. When you want to talk about dead languages that never saw the light of day in the commercial world, ADA is at the top of the list but its part of ERAM now. Guess we’ll have to have an academy course to teach what? …ADA.

    3. Sorry for the length in summary. ERAM is here and it’s a reality we the FAA have to find a way to make it work, we have to find a way forward. How this is going to all play out? I have no idea but in the end I have a feeling it will be the controllers who take it on the neck. God forbid there is a serious incident; you will get blamed, maybe even fired for the sins of Washington management. Who laugh as they cash their performance bonus checks because soulless Lockheed has deliver another worthless document that nobody will every read while their software remains broken. There were much better ways forward years ago but we are long past the point of no return. No matter what you read on the web, ERAM must go forward. It’s going to be a hard row to hoe but press ahead we must.

    There is a lot of second hand information about the current HOST/NAS system which is just that, second hand, worthless.

    FAA management sold our collective souls to Lockheed and now the devil has come to collect.

  13. George – thanks for your in-depth explanations. I’d be interested in discussing ERAM architecture in more detail if you have time. My email is skycopatc at yahoo dot com. Thanks!

Leave a Reply to zack Cancel reply

Your email address will not be published. Required fields are marked *