Behind the data

Knowledge discovery provides context for world events

Thanks to the digital revolution, the information available on any given topic has morphed from a sporadic stream of text and photos into a nonstop maelstrom of multimedia shrapnel. Traditional news reports and transcripts are now terabytes of pages, posts and tweets. A billion smartphones upload pix and vids to the net 24/7, and GPS data ties people to places, adding a spatial component to the puzzle.

The need to parse, understand and develop predictions from this barrage of information, particularly for national security purposes, is what the field of knowledge discovery (KD) is all about.

Social media posts often contain location information. Here, tweets generated in the Washington, DC, area are clustered around subway stations. Image: Chad Steed and Chris Maness

"Knowledge discovery involves working our way from individual pieces of information toward an understanding of what's going on," says Robert Patton, a computer scientist on ORNL's Intelligent Computing Research team. "We try to build a context around events." That context is provided by analyzing various types of data that provide indirect evidence of other events.

"For example," Patton says, "when the US military found pictures, letters and other documents on Osama bin Laden's hard drives, that data provided part of the context for his life in Pakistan."

Expanding that context may have required finding out: Who wrote the letters? To whom? Who took the photos? What was in the photos? Where were they taken? KD analysts answer questions like these by applying a variety of techniques to understand the relationships among people, places and information and to recreate the story behind the data.

"Usually our projects start when an organization comes to us with a large set of data they want to use to answer certain questions," Patton says. "They're pretty sure the answers are in there, but they don't know how to find them, where to start, or even what parts of the information are relevant."

Plotting patterns

One of the main ways Patton and his ICR colleagues extract context from data is to look for patterns. For example, email traffic among members of a group is used to construct network diagrams. The relationships found in the diagram can help identify key people within the group.

"We might look at the diagram and realize that one guy has sent tons of e-mail, but he never gets any responses," Patton says. "This might suggest he's a leader because no one is questioning his authority."

GPS data is especially helpful in KD because it enables analysts to link information to a location. For example, if there were an explosion at a particular location at a particular time, analysts would want to gather other location-specific data surrounding the event—mobile phone calls, security camera video, eyewitness accounts—and display it on a map to see if it reveals a pattern. The ability to discern consistent patterns in and around events is a step toward being able calculate the likelihood of similar events—or even to predict when they will occur.

Accuracy and speed

Two aspects of knowledge discovery that are critical for national security applications are accuracy and speed.

"If I want to know what the weather will be like next week, but all I know is what the weather is like today, I won't be able to predict a whole lot," Patton explains. "But let's say I have data for the whole eastern seaboard. That gives me more data, and I can improve my prediction. Generally speaking, the more data you accumulate, the more accurate your predictions become."

However, accumulating data can be a two-edged sword. The more data you have, the harder it is to process it quickly. Patton points to the role played by social media in the recent "Arab Spring" uprisings across the Middle East.

"Social media outlets were the channels through which masses of people were communicating and coordinating their actions," Patton says. "We need analytical systems that can keep up with this volume of data if we want to understand events and be able to respond in time to protect our national interests—or to have the opportunity to influence events in some way. Applying high-performance computing to KD enables us to meet this need."

Speed isn't always the primary consideration for KD, but it often is. Sometimes ICR's national security customers have a very short turnaround for questions that involve analyzing new tactical data. In these cases meeting the deadline is more important than being extremely accurate, so analysts have to consider as much data as they can within the timeframe and give their best answers. Other customers have strategic concerns that operate on longer timelines, so they have the luxury of considering all available data.

"For example," Patton says, "some of our customers who have concerns about social and political changes in another country want to be able to look at the terabytes of data that document the history of the country or the social and political movements involved, consider how similar situations transpired, and be able to project the likelihood of certain outcomes—as well as evaluating things we could do to influence or respond to these outcomes."

Looking forward

In the course of their work on ORNL's Jaguar supercomputer, Patton and his colleagues are breaking new ground in applying KD tools and techniques to unprecedentedly large data sets. Because most of their customers don't have access to computers like Jaguar, Patton likens the customers' relationships with ICR to that between an auto manufacturer and its racing division. The research his group does for customers' specialized, high-performance KD applications today will be applied to improve the KD software used on desktop computers a few years down the road.

"Working on Jaguar today—and on its successor, Titan, that will be available this fall—provides us with an opportunity to see how KD algorithms and techniques perform at scales that are way beyond what anyone else is doing right now," Patton notes.

"For example, last year we tried pushing a huge number of documents through our document-clustering tool called Piranha. It broke at half a million. Once we looked into the problem, It was easy to understand and easy to fix, but because we had never tried to process that many documents before, we never knew the problem was there. If we learn to fix these glitches now, then when our customers need to process that volume of data, the software will be ready."

Getting ahead of the game

Having access to Titan will provide ICR with an environment that can support the massive uptick in the quantity of online data.

"Many of our current algorithms were not designed for handling information on that scale," Patton says. "So when they try to process petabytes of data, they start breaking down. Titan will help us develop new ways to handle that volume of information."

Patton notes that the volume of online data is sometimes so large that it can't be stored now and analyzed later. As a result, next-generation KD applications will not only have to process huge amounts of information, but also do it "on the fly," making the most of the one look they'll get as the data stream passes.

"Titan is going to enable us to keep up with the flow of online information and push our current applications to the breaking point and beyond," Patton says. "It's critical that we get ahead of the game by learning how to handle data on this scale now, so we'll have that capability ready for our national security customers when they need it tomorrow." —Jim Pearce