History of U.S. Weapons Proves Value of Realistic Operational Testing

By Michael Gilmore
The purpose of operational testing is to assure the military services field weapons that work in combat. This purpose has been codified in both Title X of the U.S. Code and in the Defense Department 5000-series regulations. 

Operational testing is intended to occur under “realistic combat conditions” that include operational scenarios typical of a system’s employment in combat, realistic threat forces and employment of the systems under test by soldiers, sailors, airmen and Marines, rather than by hand-picked or contractor crews.

The need for realistic operational testing prior to committing funds for full-scale production and fielding has been recognized for many years.  As far back as mid-1969, a blue ribbon defense panel was formed, headed by Gilbert W. Fitzhugh, to make a comprehensive examination of the organization and functions of the Defense Department. One of its major subordinate study groups was on the topic of operational testing.

Prior to 1969, the services had experienced several notable and expensive failures in weapons fielding before conducting sound operational tests. Two examples are the fielding of the M-16 rifle to infantry combat units in Vietnam and the fielding of the F-5 aircraft to operational Air Force units in combat. The M-16 failures in combat were egregious, resulting directly in American casualties and pressure being brought to bear on Congress to investigate.  Congress then required the Defense Department to conduct an operational test in 1967 and 1968 in the jungles of Panama. That test revealed serious deficiencies in the weapon that required modifications to the rifle, resulting in the M-16A1.

In the case of the F-5, a 12-aircraft squadron was deployed to Southeast Asia in 1965, where it was intended to be compared to the F-100 and F-4 aircraft that were already deployed. Many issues were found with the F-5 under these operational combat conditions, and were best summarized by an Air Force colonel who reported that, while the aircraft was a pleasure to fly, it could not go very far or carry enough ordnance to do much after it got there. One F-5 was lost to enemy fire during this deployment.

The group that investigated operational testing as a part of the Fitzhugh panel concluded that: operational testing can, and should, contribute significantly to decision-making within the department; its quality as it was conducted then was very uneven across the services; and results of the operational testing that had been done had not been made available to the senior decision makers within the office of the secretary of defense.

It also concluded that: operational testing within the services was best performed by an independent organization that reported directly to the chief of the service; that it was not adequately managed or supervised at the OSD level; and that there was a shortage of experienced and capable personnel directly involved with operational testing.

The Fitzhugh Panel in the summer of 1970 made several related recommendations: that there should be an assistant secretary of defense for test and evaluation to create and implement policy; that separate program elements for operational testing should be created within the services, with OSD responsible to ensure that service testing budgets are adequate; that a separate program category should be created in the defense budget for operational testing; and that a defense test agency should be established to design tests or review test designs, perform or monitor test performance, and evaluate the entire T&E program. 

This last recommendation was not implemented. The position of director of defense test and evaluation was created but placed within the directorate of defense research and engineering rather than being an assistant secretary.

Eighteen years after the Fitzhugh panel recommendations, Congress again became involved with operational testing. Motivated in part by the well-publicized issues surrounding the live-fire testing of the Bradley fighting vehicle, Sen. William Roth, R-Del., chairman of the Senate Committee on Governmental Affairs, held hearings in 1983.  Many witnesses testified that the conduct of operational testing within the department continued to be unsatisfactory, and the earlier Fitzhugh panel findings were reinforced. Particular criticisms were that systems were being procured in full-rate production before they were adequately tested, that existing testing was merely to confirm engineering requirements and was not conducted under realistic combat conditions or against realistic threats; and that it was still being supervised and conducted unsatisfactorily.

Examples in the Government Accounting Office testimony to the Roth hearings included the lack of realistic testing on the Army’s AN/TPQ-37 counter fire radar, performance limitations in the Navy’s land attack Tomahawk cruise missile, unsatisfactory range of the F/A-18 Hornet and operational suitability problems that plagued the Sergeant York air defense gun, which was ultimately terminated.

The hearings resulted in legislative language being drafted that led directly to the creation of DOT&E as it exists today. The director of operational test and evaluation (DOT&E) organization was established by Public Law 98-94 in 1983. The statutory language establishes a senior-level presidential appointee and spells out in clear terms the scope and extent of the director’s responsibilities in Title X, Section 2399 and Section 139. 

In 1986, the Packard Commission reinforced the earlier panel recommendations by calling for prototype development and operational testing prior to full scale production, and recommending “intensive” operational testing be conducted and evaluated before full-scale production decisions are made. In 1994, the statute was modified to give the director oversight of live fire test and evaluation, which had been embedded in the office of the undersecretary of defense, acquisition, technology and logistics.

More than 30 years later, the value of operational testing remains apparent. Many significant and substantial shortfalls in weapon system effectiveness and suitability continue to be found for the first time during testing conducted under realistic combat conditions, notwithstanding positive results derived from developmental testing generally conducted under limited and controlled conditions. 

Once made aware of shortfalls, senior decision makers in the Defense Department and Congress frequently choose to provide the resources and time needed to assure the problems revealed are fixed. DOT&E’s annual reports to Congress, as well as the office’s detailed assessments of individual systems, provide factual evidence of the adequacy of the testing conducted, as well as detailed explanations of the evaluations of weapon effectiveness, suitability and survivability. These evaluations are provided to the secretary of defense, Congress, and the operational and acquisition communities of the military services. Those reports and evaluations, as well as other material posted on the DOT&E website, provide many specific cases in which significant problems with weapon performance were discovered only as a result of the testing under realistic combat conditions as mandated by law. 

Notwithstanding the clear historical record, the need for and value of operational testing has periodically been questioned. Recently, there has been criticism that operational testing drives substantial cost increases and schedule slippage in programs and that its scope should be limited. The facts do not support these beliefs.

Six themes illustrate the value of conducting realistic operational testing:

• Testing limited to determining only whether technically focused key performance parameters (KPPs) have been satisfied does not capture improvements in mission accomplishment or achievement of intended combat capabilities in an operational environment.
• Operational testing should be performed against realistic operational threats expected to be encountered by the units employing the system.
• There are many examples in which operational testing has identified critical system performance problems, some only discovered when employed in an operationally realistic environment; such testing thereby provided opportunities for correction before a system entered full-rate production or was fielded.
• Operational testing can identify problems with systems’ combat effectiveness.  This information is then provided to acquisition authorities for decision-making. Decisions by those authorities can affect program schedules if they judge the associated fix as worth any delay.
• Operational testing has been responsive and supportive of the fielding of systems during wartime.
• Rigorous scientific and statistical methods can and should be used to determine the adequacy of operational testing. Sound statistical analysis of operational test results is key to providing credible evaluations of performance to decision-makers.

Specifications Alone Not Sufficient

Some officials have suggested that operational testing of weapons programs should be limited to a strict measurement of technical requirements only, specifically the key performance parameters approved by the Joint Requirements Oversight Council. To do so would be to repeat the mistakes of the past. KPPs are often cast as technical specifications such as the range of an aircraft measured under limited specific conditions that are necessary, but not nearly sufficient for defining an effective combat capability in a mission context.

For example, a combat aircraft that cannot fly or carry a weapons payload is clearly unacceptable. But an aircraft satisfying only those requirements will not be assured of penetrating enemy air defenses and destroying targets. So key performance parameters do not specify everything that we need a system to do. In particular, a system that does meet all of its KPPs is not assured to provide an operationally effective or suitable combat capability when all of the operational conditions and concept of operations are considered. 
Conversely, there are instances in which systems fail to meet all their KPPs but are nonetheless evaluated by DOT&E to be operationally effective.

While a test under realistic combat conditions will often evaluate more than the key performance parameters, such a test does not “test beyond the requirements.” The system’s full set of required combat capabilities comprises everything in the approved requirements documents — not only the KPPs — the services’ concepts of operations, threat assessment reports, and mission summary and operational profiles, as well as the combatant commanders’ war plans. Those sources in concert with the key performance parameters inform the design of an adequate operational test that enables meaningful determinations of effectiveness, suitability and survivability.

There are many examples in which tests limited to verifying only the key performance parameters would have been inadequate to draw conclusions about the performance of a system when deployed in combat against a realistic threat. An example is the Navy P-8A Poseidon, a maritime patrol aircraft used primarily to conduct anti-submarine warfare as well as reconnaissance missions. The key performance parameters alone did not provide any measure of the effectiveness of P-8A’s primary missions. Instead, the KPPs stipulated the ability to communicate with specific radios, and a range and endurance requirement while carrying a specified number of sonobuoys and torpedoes, but no requirement to actually employ them.
Key performance parameters should be evaluated as part of operational testing, and DOT&E always reports on whether or not they are achieved. Nonetheless, while those requirements are necessary, a test designed to evaluate only those KPPs would clearly have been inadequate to provide information to acquisition decision-makers and operators about the ability of a crew flying the P-8A to find and kill enemy submarines or conduct reconnaissance.

Other examples include the early infantry brigade combat team sensors, which satisfied all their key performance parameters but were evaluated by both the Army and DOT&E as providing no useful combat capability, and the Virginia-class submarine, which did not meet some of its requirements but was evaluated by DOT&E as an operationally effective replacement for the Los Angeles-class submarines. The MQ-1C Gray Eagle unmanned aerial vehicle did not meet its original reliability requirements, but was evaluated by DOT&E to be operationally effective and suitable.

Realistic Threats Necessary

An essential part of any system evaluation is the definition of the threat environment in which the system is expected to perform. A limited assessment against a benign or outdated threat does a disservice to the men and women who will employ these systems in a time of war, who must clearly understand the performance limitations that exist against the threats they would actually encounter in combat. This knowledge can mean the difference between life and death, as well as victory and defeat. Therefore, the threats employed during operational testing must be realistic and replicate the threats the units employing the system can be expected to meet in actual combat.

The Paladin integrated management family of vehicles program is a case in point. The requirements document for the PIM specified that the howitzer had to be capable of operating outside secured forward operating bases, and would be susceptible to mines and other underbelly improvised explosive devices. However, the technical specifications put forth in the PIM documentation were clearly inconsistent with the Army concept of operations and its associated requirements for protection and survivability. 

DOT&E advised the Army and Joint Staff to consider the requirements disconnect among the expected threats, the PIM concept of operations, and its system specifications. In response, the Army began to consider the development of an underbelly armor kit to provide increased protection against mines and IEDs. After considering DOT&E’s analysis and concerns, the Defense Department directed the Army to design and develop an add-on underbelly kit, and to test against these threats, in order to properly characterize vehicle performance under realistic conditions. And, contrary to claims that have been made repeatedly, this decision did not delay the previously planned schedule for a PIM full-rate production decision.

Identifying Problems Early

One of the primary purposes of operational testing is to identify critical problems that surface only when examined under the stresses of operational conditions, prior to the full-rate production decision. This early identification permits corrective action to be taken before large quantities of a system are procured and avoids expensive retrofit of system modifications. 

Operational testing of the Navy’s cooperative engagement capability on the E2-D Hawkeye aircraft revealed several deficiencies. The CEC created many more dual tracks compared to the baseline system, exhibited interoperability problems with the E-2D mission computer, and CEC’s ability to maintain consistent air tracks was degraded compared to the baseline E-2C version.  As a result of these discoveries, the Navy’s acquisition executive decided to delay the full-rate production decision until the root causes for the failures could be found and fixed. The Navy is now implementing fixes to address these problems and operational testing will be conducted to verify that those fixes corrected problems. While an OT is later in the process than otherwise desired to discover these system shortfalls, the value of such testing is abundantly clear.

Fixing, Not Testing, Delays Programs

Operational testing frequently reveals deficiencies with a system that require time and/or training to correct. The acquisition executives who are responsible for programmatic decisions then have to weigh whether the problems discovered are of sufficient magnitude to warrant delays to the program while they are fixed. The assertion that testing causes programmatic delays misses the essential point that fixing the deficiencies causes delays, not the testing. Furthermore, delaying a program to correct serious problems is exactly what we desire in a properly functioning acquisition system.

An example here is the Army’s war fighter information networking- tactical (WIN-T) increment 2 system. The Army planned to have a full-rate production decision in September 2012, after an operational test the previous May. The OT revealed several flaws in the system, including limited range, poor network stability, and poor reliability on several major components. Finding a solution for these problems caused the Defense Department’s acquisition executive and the Army to delay the program’s full-rate production decision by three years.

During this time, the program manager made multiple corrections to the system, then re-tested it in May 2013, when more problems were revealed, and another in October 2014. The full-rate production decision is now scheduled for May 2015.

Support to Wars
During the last 11 years of war, DOT&E has demonstrated an ability to rapidly conduct rigorous operational testing and evaluations. The best example of this is the mine resistant ambush-protected (MRAP) vehicle program. Multiple MRAP configurations had to be procured, tested and fielded very quickly. None of these vehicles were in military inventories at the beginning of the wars in Iraq and Afghanistan. When the urgent operational need was identified, the secretary of defense pushed this as his highest priority program. 
DOT&E, working with the Army and Marine Corps, was able to swiftly approve and evaluate 10 separate operational tests of these vehicles. DOT&E was able to report that 10 of 18 MRAP configurations were effective and suitable for combat use. In the remaining cases, findings were reported that permitted improvements to be made to the vehicle’s mobility and protection.

Rigorous Statistical Methods

DOT&E employs rigorous statistical methodologies in its assessments of test adequacy and in its operational evaluations in order to provide sound analysis of test results to decision makers. Those methods are well established in both industry and government. The pharmaceutical, automotive, agriculture, and chemical and process industries, where many of these techniques were originally developed, all use the same statistical methods for test design and analysis that DOT&E advocates. Other government agencies such as the Food and Drug Administration, Census Bureau, the National Laboratories and NASA, all rely on the use of methods similar to those DOT&E advocates. 

Several studies, most recently by the National Research Council and the Defense Science Board, have recommended such methodologies be used in operational testing.
These methods encapsulate the need to “do more without more,” especially in light of a constrained fiscal environment. Indeed, the final result of this policy is to ensure we make the absolute best use of scarce resources by neither over-testing nor under-testing weapon systems. One must provide sound rationales for the level of testing prescribed, and gather the data needed to provide war fighters with a basis for confidence in the performance of those weapon systems.

The benefits of these methodologies are seen in a Navy helicopter program, the multi-spectral targeting system. MTS enables helicopters to target fast small-boat threats and employ Hellfire missiles at distances that ensure safe-standoff. The testing thoroughly examined performance under the variety of operational and tactical conditions that a crew might expect to encounter. A simple analysis of the results combining all of the data together into a single average (a limiting, but unfortunately common analysis technique), suggested that the system was meeting requirements.

Only when the more complex and rigorous statistical analysis was employed did the testers discover that the system was significantly failing requirements in a subset of the operational conditions. The unique conditions where performance was poor revealed a weakness in the system, which can now be addressed by system developers. In other words, the global average analysis hid the problems in operational performance, and only with a more robust statistical analysis was DOT&E able to find and pinpoint where system improvement was needed.

History demonstrates that comprehensive and objective information on the expected performance of weapons in actual combat is critical to informed decision-making anytime, but particularly under constrained budgets. Nonetheless, there are some who argue that in the constrained fiscal environment created by sequestration, all testing should be cut.

We are fielding the weapons we continue developing to satisfy 100 percent of their concepts of operation against 100 percent of the actual threat. In particular, what constitutes adequate operational testing under realistic combat conditions is determined not by fiscal constraints, but by our war plans and the threats we face. The enemy always gets a vote. It would be a false economy and a disservice to the men and women we send into combat to make arbitrary budget-driven reductions to either developmental or operational testing.

J. Michael Gilmore has served as director of operational test and evaluation since September 2009.

Topics: Defense Department, DOD Policy

Comments (0)

Retype the CAPTCHA code from the image
Change the CAPTCHA codeSpeak the CAPTCHA code
Please enter the text displayed in the image.