Automation Key to Tackling Burdensome Big Data Problems

By Dan Parsons

When two bombs exploded last year near the finish line of the Boston Marathon, law enforcement officials were inundated with an unprecedented deluge of information in their search for suspects.

For the first time in the aftermath of a terrorist attack, investigators had almost immediate access to phone records, surveillance footage, first-responder communications, cell-phone photos and time-stamped high definition video from dozens of angles.

Information technology systems provided a bounty of data regarding the bombing scene but were not yet up to the task of digesting that data into useful intelligence, said Peder Jungck, chief technology officer for BAE Systems Intelligence & Security sector.

“Someone, somewhere, had an ‘Aha’ moment while looking at all of that information,” Jungck told National Defense. “The idea was, could they search the video and photographic data for people who were not running away from the bombs or looked unafraid? It had never been done before, but the technology was available. And it led to the two suspects.”

So-called activities-based intelligence (ABI) allows analysts to synthesize mountains of information and look for anomalies, rather than consuming video, audio or other data in linear, chronological order. BAE was not involved in the search for the bombing suspects, but has developed methods of automating the synthesis of ABI for the military and government agencies.

“Our focus is to help the government get value out of all that data being collected,” Jungck said. “How do you go and cull that information to help the government not be surprised by world events and save lives by taking human intelligence, signals intelligence, geo-intelligence and everything else, get them all in the same time and space and use them to your advantage?”

The Defense Advanced Research Projects Agency in January released a broad agency announcement seeking research proposals to tackle a similar issue in the study of cancer. The program, called the “Big Mechanism,” seeks to establish a system in which the wealth of complex and disparate medical research can be automatically analyzed for patterns and conclusions, much in the way the overlapping and interlocking data sets from the Boston Marathon finish line did.

“Some of the systems that matter most to the DoD are very complicated,” the solicitation said. “Ecosystems, brains, economic and social systems have many parts and processes, but they are studied piecewise, and their literatures and data are fragmented, distributed and inconsistent. It is difficult to build complete, explanatory models of complicated systems, so effects in these systems that are brought about by many interacting factors are poorly understood.”

The 42-month, $45 million project will develop technology to read research abstracts, extract valuable information from them and recognize related, interacting and overlapping information that can be assembled into larger conclusions. The system will first tackle cancer research and then be broadened in hopes of automating scientific study of other complex systems, according to DARPA.

The ultimate goal is “a new kind of science in which research is integrated more or less immediately — automatically or semi-automatically — into causal, explanatory models of unprecedented completeness and consistency,” the DARPA announcement stated.

“The collection of Big Data is increasingly automated, but the creation of Big Mechanisms remains a human endeavor made increasingly difficult by the fragmentation and distribution of knowledge,” the program description said. “To the extent that we can automate the construction of Big Mechanisms, we can change how science is done.”

Collecting ever more information is not necessarily advantageous if the systems to analyze it are not in place, Jungck said. There is simply too much data for human eyes to ingest and make sense of. Efficient exploitation of intelligence requires more sophisticated automation of the analysis process, Jungck said.

“The more data you collect, the more malaise and noise comes into the system,” he said. “Analysts spend 60 to 70 percent of the time trying to collect the data they need, let alone analyzing the data and … trying to find the needle in the stack of needles that you didn’t know about.”

The military and law enforcement agencies are familiar with recognizing patterns of behavior as a way of flushing out potential criminals and terrorists before they strike. After Timothy McVeigh blew up the Alfred P. Murrah building in Oklahoma City in 1995, federal agencies began monitoring fertilizer sources and other chemicals used to make the explosives. Local and federal law enforcement agencies regularly monitor purchases of precursors used to make crystal methamphetamine as a way of identifying manufacturers of the drug. Though successful, those methods are demanding of resources and need to be increasingly automated, Jungck said.

BAE is developing systems that automatically mine metadata, time markers, photographs, video, maps and other information to identify circumstances and behavioral patterns that precede a terrorist attack, then automatically alert authorities if similar situations appear within intelligence data sets.

“After the Oklahoma City bombing, we realized the predecessor would be people buying chemicals from feed stores,” Jungck said. “Now, I don’t know where someone might bomb. I don’t know that they are even going to do it. But I can manage the source of supplies. I have a higher chance of predicting that this is an anomaly.”

Similar systems need to be scaled up to where they can handle the infinite amounts of data currently available and recognize the smallest indications of a possible terrorist attacks or other preventable event before it happens, he said.

Information overload is not only a problem on the collection and exploitation end of intelligence software systems. Software designers are as swamped during development with the wealth of code that must be written and verified for complex modern computing systems that allow for such massive amounts of data to be collected, said Daniel Ragsdale, program manager for DARPA’s crowdsourced software verification program.

The agency is attacking data overload from a dozen different angles, one of which uses online video games to rid software of bugs and glitches.

In an effort to keep up with the white-hot speed of technological development, the Defense Department is relying more on integrating commercial hardware and software with its information technology infrastructure. But modern software is riddled with bugs — from one to five per million lines of code, Ragsdale said. A program consisting of 9,000 to 10,000 lines of code, which is rudimentary in modern terms, can take upwards of 22.5 man-years to debug, he added.

“This is a pretty serious problem,” he said. “They introduce a vulnerability that could be exploited, but could also just crash it or cause it to give incorrect answers. … It is extremely costly to verify even small-scale programs. For something that requires that level of effort, we’ve got to approach this in a new way.”

DARPA has several ongoing efforts to automate software development and cybersecurity assurance. The most novel among them uses online puzzle games to aid in the cumbersome task of software verification — searching digital code for bugs that could impair a program’s functionality or be exploited by an adversary.

Formal verification is one of the most difficult aspects of software development, Ragsdale said. Once a program is written, every line of its code must be combed for errors.

The process is so labor intensive it can monopolize the time of highly skilled, well-compensated software engineers.  The volume of software being produced makes it nearly impossible to verify without massive outpouring of resources in time and money, he said.

DARPA’s first endeavor to crowdsource data analysis through video games resulted in Foldit, an online game in which players help understand the process of intercellular protein folding.

“Many of us have played mindless games, and at the end of playing, those are two hours of my life that I can’t get back,” Ragsdale sad. “That does not happen here. When a user is done playing, they know they have accomplished something in the real world.”

Five puzzle games — collectively known as Verigames — were created by university-academia teams. Each game takes a different approach to automating the verification process, but all cause answers to the online puzzles to filter through a conversion software that automates the identification of errors in software. Engineers then must go in and correct the errors, Ragsdale said.

“We didn’t know what was in the realm of the possible. The software part of this is effectively hidden from players, so the programs could have taken any form,” Ragsdale said. “All they have to know is there is a game with a strategy and scoring. Now we’re learning things that you could not learn unless you go beyond a concept and have people try to bring that concept to life.”

Verigames was formally launched in December to a “fair amount of fanfare,” Ragsdale said. Each game has a different theme — one involves piecing together robots in space, another involves plant biology — but accomplishes the same basic function. Participation has grown steadily since the games went live, but it is too early to gauge exactly how many users have accessed the site or how much code they have helped debug, he said. The site peaked at about 5,000 unique visitors on a single day.

If Verigames, or some iteration of the program, is proven to be a viable approach to formal software verification and is shown to reduce the burden placed on software engineers, it will be made available to industry, Ragsdale said. Officials across the Defense Department, in other government agencies and the commercial sector are enthused about the efficiency that automated, crowdsourced verification promises, Ragsdale said.

“Commercial entities could bring their software and translate it into a form that can be used in games, which would reduce the level of effort required for whoever owns that software,” he said. “In a perfect world it will be a fee-for-service” model. The long-range goal is to dump in software on one end, and out the other end would come a game for verification.

Topics: Infotech, Science and Engineering Technology, DARPA

Comments (0)

Retype the CAPTCHA code from the image
Change the CAPTCHA codeSpeak the CAPTCHA code
Please enter the text displayed in the image.