Appendix B – Classical Engineering Issues

We shall now present several classical engineering issues on illustrative examples. These issues are indeed representative of the typical problems addressed by systems architecting. We classified them in two categories: on one hand, product problems, referring to purely architectural flaws leading to a bad design of the product, and on the other hand, project problems, referring to organizational issues leading to a bad functioning of the project. An overview of these different problems is presented in Table 13 below. The details of our examples & analyzes will be found in the sequel of this appendix.

Product problems
- Product problem 1 – The product system model does not capture reality
  - Typical issue: the system design is based on a model which does not match with reality
  - Example: the failure of Calcutta subway
- Product problem 2 – The product system has undesirable emergent properties
  - Typical issue: a complex integrated system has unexpected and/or undesired emerging properties, coming from a local problem that has global consequences
  - Example: the explosion of Ariane 5 satellite launcher during its first flight
Project problems
- Project problem 1 – The project system has integration issues
  - Typical issue: the engineering of the system is not done in a collaborative way
  - Example: the huge delays of the Airbus A380 project
- Project problem 2 – The project system diverts the product mission
  - Typical issue: the project forgets the mission of the product
  - Example: the failure of the luggage management system of Denver airport

Table 13 – Examples of typical product and project issues addressed by systems architecting

B.1 Product problem 1 – The product system model does not capture reality

To illustrate that first product architecture issue, we will consider the Calcutta subway case¹ which occurred when a very strong heat wave (45°C in the shadow) stroke India during summer time. The cockpit touch screens of the subway trains became then completely blank and the subway drivers were therefore not able anymore to pilot anything. As a consequence, the subway company stopped working during a few days, which lead moreover to a huge chaos in the city and to important financial penalties for the subway constructing company, until the temperature came back to normal, when it was possible to operate again the subways as usually.
To understand what happened, the subway designers tested immediately the touch screens, but these components worked fine under high temperature conditions. It took then several months to understand the complete chain of events that lead to the observed dysfunctioning, which was – quite surprisingly for the engineers who made the analysis – of systemic nature as we will now see. The analysis indeed revealed that the starting point of the problem were the bogies, that is to say the mechanical structure which is carrying the subway wheels. Each subway wagon is supported by two bogies, each of them with four wheels. The important point is that all these bogies were basically only made with metal. This metal expanded under the action of high heat, leading to an unexpected behavior of the bogies that we are now in position to explain.

One must indeed also know that to each bogie is attached a braking system. These braking systems are in particular regulated by the central subway computer where a control law was embedded. The control law obliges each local braking system on each bogie to exert a braking force which shall be between two safety lower and upper borders², when braking is initiated. The role of the central computer is thus to ensure that the two safety borders are always maintained during braking, which is achieved by relaxing or increasing the braking force on a given braking system. The key point was that the underlying control law was not valid at high temperature. This control law was indeed designed – and quite robust in that case – in a Western environment where strong heat never occurs. Hence nobody knew that it was not correct anymore in such a situation.

Figure 62 – The Calcutta subway case

What happened can now be easily explained. The high temperature indeed provoked the same metal expansion among the different subway bogies. Hence all bogies were continuously working out of their safety borders during braking. But the computer was not aware of that situation and continued to try to bring back all braking systems inside their safety borders, applying its fixed control law that was unfortunately false in this new context. As a consequence, there was a permanent exchange of messages between the central computer and the numerous braking systems along the subway. It resulted in an overload of the network which was dimensioned to support such a heavy traffic. The observed effect on the touch screen was thus just a side effect of this overload, due to the fact that the touch screen is also connected to the central computer by the same network, which explains why nothing wrong was found at the touch screen level.

We can thus see that this case highlights a typical modeling problem, here the fact that the braking control law was false in the Indian high temperature context, but also an integration issue³, that ultimately led to an operational failure through a “domino” effect⁴, where an initial local problem progressively propagated along the subway and resulted in a global breakdown of the system. One shall thus remember that it is a key good practice to permanently check and ensure the consistency between a system model and the reality it models, since reality will indeed always be stronger than any model, as illustrated by the Calcutta subway case.

Note finally that this kind of modeling problem is typically addressed by systems architecting, which proposes an answer through “operational architecting” (we refer to both Chapter 3 and Chapter 4 for more details). Such an analysis indeed focuses on the understanding of the environment of the concerned system. In the Calcutta subway, a typical operational analysis would have consisting in considering India as a key stakeholder of the subway and then trying to understand what is different in India, compared to the Western countries where the subway was initially developed. It is then easy to find that very strong heat waves do statistically occur in India each decade which creates an Indian specific “High temperature context” that shall be specifically analyzed. A good operational systems architecting analysis shall then be able to derive the braking system lifecycle, presented in Figure 63, with two states that are respectively modeling normal and high temperature contexts and two transitions that do model the events⁵ that create a change of state and of braking control law.

The missing operational analysis in the Calcutta subway case figure

Figure 63 – The missing operational analysis in the Calcutta subway case

Such a diagram is typically an (operational) system model. It looks apparently very simple⁶, but one must understand that introducing the “High temperature” context and the transition that leads to that state will allow avoiding a stupid operational issue and saving millions of euros…

B.2 Product problem 2 – The product system has undesirable emergent properties

The second product-oriented case study that we will discuss is the explosion of the very first satellite launcher Ariane 5 which is well known due to the remarkable work of the Lions commission, who published a public detailed and fully transparent report on this accident (see [56]). This case was largely discussed in the engineering literature, but its main conclusions were rather focused on how to better master critical real-time software design. We will here present a systems architecting interpretation of that case, which, in the best of our knowledge, was never made up to now.
Let us now remember what happened on June 4, 1996 for the first flight of Ariane 5. First of all, the flight of this satellite launcher was perfect from second 0 up to second 36 after take-off. At second 36.7, there was however a simultaneous failure of the two inertial systems of the launcher that lead at second 37 to the activation of the automatic pilot which misunderstood the error data transmitted by the inertial systems. The automatic pilot corrected then brutally the trajectory of Ariane 5, leading to a mechanical brake of the boosters and thus to the initiation of the self-destruction procedure of the launcher that exploded at second 39.

Figure 64 – The Ariane 5 case

As one can easily guess, the cost of this accident was tremendously high and probably reached around 1 billion euros. One knows that the direct cost due to the satellite load lost was around 370 million euros. But there was also an induced cost for recovering the most dangerous fragments of the launcher (such as the fuel stock) that crashed in the (quite difficult to access) Guyana swamps, which tool one month of work. Moreover there were huge indirect costs due to Ariane 5 program delaying: the second flight was only performed one year later and it took three more years to perform the first commercial flight of the launcher, by December 10, 1999.

As already stated, the reason of that tragic accident is fortunately completely analyzed through the Lions commission report (cf. [56]). The origin of the accident could indeed be traced back to the reuse of the inertial reference system (IRS)⁷ of Ariane 4. This critical complex software component perfectly worked on Ariane 4 and it was thus identically reused on Ariane 5 without being retested⁸ in the new environment. Unfortunately Ariane 5 was a much powerful launcher than Ariane 4 and the numerical values of Ariane 5 acceleration – which are the inputs of the inertial reference system – were five times bigger than for Ariane 4. These values were thus coded in double precision in the context of Ariane 5, when the inertial reference system was designed to only accept single precision integers as inputs. As a consequence, due to the fact that this software system was coded in C⁹, an overflow occurred during its execution. The error codes resulting from that software error were then unfortunately interpreted as flight data by the automatic pilot of Ariane 5, which corrected – one second after receiving these error codes – the trajectory of the launcher from an angle of more than 20°, resulting quite immediately in the mechanical breaking of the launcher boosters and one second after, in the initiation of the self-destruction procedure.
The Ariane 5 explosion is hence a typical integration issue¹⁰. All its components worked individually perfectly, but without working correctly altogether when integrated. Hence one shall remember that a component of an integrated system is never correct by itself. It is only correct relatively to the set of its interfaced components. When this set evolves, one must thus check that the target component is still properly integrated with its environment, since the fact that the IRS module fulfilled Ariane 4 requirements cannot ensure it fulfils Ariane 5 requirements.
This example also shows that an – usually not researched – emergent property of integration can be death. The Ariane 5 system was indeed incorrect by design since the launcher could only explode as it was integrated. In other words, Ariane 5 destruction was embedded in its architecture and it can be seen as a purely logical consequence¹¹ of its integration mode. This extremal – and fortunately rare – case illustrates thus well the real difficulty of mastering integration of complex systems!
Note finally that systems architecting can provide a number of methodological tools to avoid such integration issues. Among them, one can typically cite interface or impact analyses. In the Ariane 5 context, a simple interface type check would for instance allowed seeing that the input types of the inertial reference system were simply not compatible with the expected ones, which would probably permit avoiding a huge disaster!

B.3 Project problem 1 – The project system has integration issues

Our first “project architecture” issue is the initial Airbus 380 delivery delay, since it is mainly public (see [82] for an extensive presentation of that case study). Let us recall that this aircraft is currently the world’s largest passenger airliner. Its origin goes back in mid-1988 when Airbus engineers began to work in secret on an ultra-high-capacity airplane in order to break the dominance that Boeing had on that market segment since the early 1970s with its 747. It took however a number of years of studies to arrive to the official decision of announcing in June 1994 the creation of the A3XX program which was the first name of the A380 within Airbus. Due to the evolution of the aeronautic market that darkened in that moment of time, it is interesting to observe that Airbus decided then to refine its design, targeting a 15–20 % reduction in operating costs over the existing Boeing 747 family. The A3XX design finally converged on a double-decker layout that provided more passenger volume than a traditional single-deck design, perfectly in line with traditional hub-and-spoke theory as opposed to point-to-point theory that was the Boeing paradigm for large airliners, after conducting an extensive market analysis with over 200 focus groups.
In the beginning of 2000, the commercial history of the A380 – the new name that was then given to the A3XX – began and the first orders arrived to Airbus by 2001. The industrial organization was then put in place between 2002 and 2005: the A380 components are indeed provided by suppliers from all around the world, when the main structural sections of the airliner are built in France, Germany, Spain, and the United Kingdom, for a final assembly in Toulouse in a dedicated integration location. The first fully assembled A380 was thus unveiled in Toulouse by 18 January 2005 before its first flight on 27 April 2005. By 10 January 2006, it flew to Colombia, accomplishing both the transatlantic test, and the testing of the engine operation in high-altitude airports. It also arrived in North America on 6 February 2006, landing in Iqaluit, Nunavut in Canada for cold-weather testing. On 4 September 2006, the first full passenger-carrying flight test took place. Finally Airbus obtained the first A380 flight certificates from the EASA and FAA on 12 December 2006.
During all that period, orders continued to arrive from the airline companies, up to reaching a bit less than 200 cumulated orders, obtained in 2007. The first deliveries were initially – in 2003 – planned for end 2006, with an objective of producing around 120 aircrafts for 2009. Unfortunately many industrial difficulties – that we will discuss below – occurred and it was thus necessary to re-estimate sharply downward these figures each year¹² (cf. Figure 65). The very first commercial A380 was finally produced by end 2007 and instead of 120, only 23 airliners were delivered in 2009.

Figure 65 – The Airbus 380 case

These delays had strong financial consequences, since they increased the earnings shortfall projected by Airbus through 2010 to € 4.8 billion. It is thus clearly interesting to try to better understand the root causes of such an important failure.
The source of these delays seems to be connected to the incoherence of the 530 km (330 miles) long electrical wiring, produced in France and Germany. Airbus cited in particular as underlying causes the complexity of the cabin wiring (98,000 wires and 40,000 connectors), its concurrent design and production, the high degree of customization for each airline company, and failures in configuration management and change control. These electrical wiring incoherencies were indeed only discovered at the final integration stage in Toulouse¹³, which was of course much too late…
The origin of this problem could be traced back to the fact that German and Spanish Airbus facilities continued to use CATIA® version 4, while British and French sites migrated to version 5. This caused overall configuration management problems, at least in part because wire harnesses manufactured using aluminum, rather than copper, conductors necessitated special design rules including nonstandard dimensions and bend radii. This specific information was not easily transferred between versions of the software, which lead to incoherent manufacturing and at the very end, created the integration issue in Toulouse. On a totally different dimension, the strong customization of internal equipment also induced a long learning curve for the teams and thus other delays. Independently of these “official” causes, there are other plausible deep causes coming from cultural conflicts among the dual-headed French & German direction of Airbus and lack – or break – of communication between the multi-localized teams of the European aircraft manufacturer.

As systems architects, we may summarize such problems as typical “project architecture” issues. The issues finally observed at product level are indeed only consequences of lack of integration within the project, that is to say project interfaces – to use a system vocabulary – that were not coherent, which simply refer, in more familiar terms, to project teams or project tools that were not working coherently altogether. It is thus key to have a robust project architecture in the context of complex systems development since the project system is always at least as complex as the product system it is developing. Unfortunately it is a matter of fact that the energy spent with technical issues is usually much more equivalent than the energy spent on organizational issues, which often ultimately lead to obviously bad project architectures in complex systems contexts, resulting at the very end into bad technical architectures in such contexts.

B.4 Project problem 2 – The project system diverts the product mission

As a last example of different types of project issues, we will now consider the case of the Denver airport luggage management system failure, which is fortunately well known due to the fact that it is completely public (see for instance [27] or [72]).
Denver airport is currently the largest airport in the United States in terms of total land area and the 6 th airport in the United States (the 18th in the world) in terms of passenger traffic. It was designed in order to be one of the main hub for United Airlines and the main hub for two local airlines. The airport construction officially started in September 1989 and it was initially scheduled to open on October 29, 1993. Due to the very large distance between the three terminals of the airport and the need of fast aircraft rotations for answering to its hub mission, the idea of automating the luggage management emerged in order to provide quick plane inter-connections to travelers. United Airlines was the promoter of such a system which was already implemented in Atlanta airport, one of their other hubs. Since Denver airport was intended to be much wider, the idea transformed in using the opportunity of Denver’s new airport construction to improve Atlanta’s system in order to create the most efficient & innovative luggage management system in the world¹⁴. It was indeed expected to have 27 kilometers of transportation tracks, with 9 kilometers of interchange zones, on which were circulating 4.000 remote-controlled wagons at a constant transportation speed of 38 km/h for an average transportation delay of 10 minutes, which was completely unique.
The luggage management system project started in January 1992, a bit less than 2 years before the expected opening of the airport. During one year, the difficulties of this specific project were hidden since there were many other problems with more classical systems. However at beginning 1993, it became clear that the luggage management system could not delivered at schedule and Denver’s major was obliged to push back the opening date, first to December 1993, then to March 1994 and finally again, to May 15, 1994.

Unfortunately the new automated luggage management system continued to have strong problems. In April 1994, the city invited reporters to observe the first operational test of the new automated baggage system saw instead disastrous scenes where this new system was just destroying luggage, opening and crashing their contents before them. The major was then obliged to cancel sine die the opening date of the airport. As one can imagine, no airline – excepted United – wanted to use the new “fantastic” system, which obliged to abandon the idea of a global luggage management system for the whole airport. When the airport finally opened in February 28, 1995, thus only United Airlines terminal was equipped with the new luggage management system, when the other terminal¹⁵ that also opened was simply equipped with totally classical systems (that is to say tugs and carts).
In 1995, the direct additional costs due to this failure were of around 600 M$, leading to more than one billion dollars over cost at the very end. Moreover the new baggage system continued to be a maintenance hassle¹⁶. It was finally terminated by United Airlines in September 2005 and replaced by traditional handlers manually handling cargo and passenger luggage. A TV reporter who covered the full story concluded quite interestingly¹⁷ that “it took ten years, and tons of money, to figure out that big muscle, not computers, can best move baggage”.

The Denver luggage management system case figure

Figure 66 – The Denver luggage management system case

When one looks back to that case, it is quite easy to understand why this new automatic luggage system collapsed. There were first too many innovations¹⁸. It was indeed both the first global automated system, the first automatic system that was managing oversize luggage (skis!), the first system where carriages did not stop during their service¹⁹, the first system supported by a computer network and the first system with a fleet of radio-localized carriages. There was also an underlying huge increasing of the complexity: compared to the similar existing Atlanta system, the new luggage management system was 10 times faster, had 14 times the maximal known capacity and managed 10 times more destinations. Moreover the project schedule was totally irrelevant with respect to the state-of-the-art: due to the strong delay pressure, no physical model and no preliminary mechanical tests were done, when the balancing of the lines required two years in Atlanta.
It is thus quite easy to understand that the new luggage management system could only collapse. In some sense, it collapsed because the project team totally lost of sight the mission of that system, which was just to transport luggage quickly within the airport and at the lowest possible cost, with a strong construction constraint due to the fact that there were only less than 2 years to implement the system. A simple systems architecture analysis would probably concluded that the best solution was not to innovate and to use simply people, tugs and carts, as usual. This case study illustrates thus quite well a very classical project issue where the project system forgets the mission of the product system and replaces it by a purely project-oriented mission²⁰ that diverts the project from achieving the product system mission. Any systems architect must thus always have in mind this example in order to avoid the same issue to occur on its working perimeter.

REFERENCES

[27] de Neufville R., The Baggage System at Denver: Prospects and Lessons, Journal of Air Transport Management, Vol. 1, No. 4, Dec., 229-236, 1994

[28] de Weck O., Strategic Engineering – Designing systems for an uncertain future, MIT, 2006

[56] Lions J.L., Ariane 5 – Flight 501 Failure – Report by the Inquiry Board, ESA, 1996

[72] Schloh M., The Denver International Airport automated baggage handling system, Cal Poly, Feb 16, 1996