Big Data/BI Zone is brought to you in partnership with:

Ravi Kalakota is a Partner at LiquidHub, a next generation IT Services firm. Ravi’s focus is on a new division called LiquidAnalytics which is ! does analytics and data consulting, solution development and outsourcing. Prior to LiquidHub, Ravi was a Managing Director with Alvarez & Marsal Business Consulting, a premier restructuring and performance improvement firm. Prior to A&M, Dr. Kalakota was the CIO/CTO for Marsh McLennan. Ravi has co-authored 10 books on e-commerce, e-business, mobile, web services, and global outsourcing. Ravi received his Ph.D. from the ! University of Texas at Austin. Ravi is a DZone MVB and is not an employee of DZone and has posted 31 posts at DZone. You can read more from them at their website. View Full User Profile

NSA PRISM – The Mother of all Big Data Projects

06.17.2013
| 4604 views |
  • submit to reddit

As a data engineer and scientist, I have been following the NSA PRISM raw intelligence mining program with great interest.  The engineering complexity, breadth and scale is simply amazing compared to say credit card analytics (Fair Issac) or marketing analytics firms like Acxiom.

Some background… PRISM is a top-secret data-mining “connect-the-dots” program aimed at terrorism detection and other pattern extraction authorized by federal judges working under the Foreign Intelligence Surveillance Act (FISA).  PRISM allows the U.S. intelligence community to look for patterns across multiple gateways across a wide range of digital data sources.

PRISM is unstructured big data aggregation framework — audio and video chats, phone calls, photographs, e-mails, documents,  internet searches, Facebook Posts, mobile logs and connection logs – and relevant analytics that enable analysts to extract patterns. Save and analyze all of the digital breadcrumbs people don’t even know they are creating.

The whole NSA program raises an interesting debate about “Sed quis custodiet ipsos custodes.” (“But who will watch the watchers.”)

What is the PRISM Program?

The program is called PRISM, after the prisms used to split light, which is used to carry information on fiber-optic cables.  Think of this as a massive aggregate of aggregates.

Each vendor Facebook, Google, LinkedIn etc. collects a incredible amount data across their portfolio of properties and applications.  What the NSA has done is take this to the next level by creating a massive Mashup of all the sources to look for end-to-end patterns and relationships.

The challenge that NSA is tackling is look-forward real-time intelligence. Can you predict in almost real-time a potential threat … intercepting a mobile phone call while someone is on the move towards a target and being able to create a rapid response to avert the threat.  This is not a trivial problem to solve (but essential in the world we live in where soft targets are increasingly being chosen).

Connecting the dots from the information is the essence of PRISM.  A slide briefing about the program outlines its effectiveness and features the logos of the companies involved (not sure how this ppt got out to the Post?). These slides posted by The Washington Post and the Guardian, represent a selection from the overall document, and certain portions are redacted.

Prism1

The program is using two types of data collection methods: Upstream from the switches themselves (raw feeds) and downstream from the various providers (contextual feeds).

Mobile data collection is the new growth area. People were already walking sensor platforms.  Every mobile phone generates a significant data exhaust.

new prism slideMonitoring a target’s communication —  This slide shows how the bulk of the world’s electronic communications move through companies based in the United States. Most of the data goes through bulk taps in switches at ATT and Verizon making it relatively easy to capture.

Prism 2

Providers and data — the PRISM program collects a wide range of data from the nine companies, although the details vary by provider. One of the NSA’s research projects aim is to forecast, on the basis of telephone data and Twitter and Facebook posts, when uprisings, social protests and other events will occur. The agency is also researching new methods of analysis for surveillance videos with the hopes of recognizing conspicuous behavior before terrorist attacks are committed.

Prism3

Participating providers —  This slide shows when each company joined the program, with Microsoft being the first, on Sept. 11, 2007, and Apple the most recent, in October 2012.

prism4

Apparently the data  is extracted, transferred and loaded into servers at the Utah Data Center in Bluffdale (shown below). According to Der Spiegel, there enough capacity to store a Yottabyte of data… large enough to store all the electronic communications of all of humanity for the next 100 years. 

Why do you need to store everything? Ira Hunt, CTO for the Central Intelligence Agency, said in a speech at the GigaOM Structure: Data conference that “The value of any piece of information is only known when you can connect it with something else that arrives at a future point in time.”

NSA Phone Records

The Skillset, Toolset and Dataset behind PRISM

I am extrapolating from multiple sources but PRISM has to do several things:

  1. integrate disparate data sources, providing common views of unified data;
  2. conduct relational, temporal, geospatial, statistical, and network analysis in one unified analytical framework (potentially using a federated model – as no tool can do everything)
  3. identifying non-obvious relationships or connections in the data and supporting visualization and exploratory visual analysis;
  4. share investigations and analytic insights/discoveries in a secure broadcast environment to enable situational awareness and collective understanding.

The target goal is to enable analysts to conduct rich, iterative cross-channel investigations that span many large datasets of different formats which originate from various internal or external sources. To enable this you need indexing and hypothesis testing capabilities.

Indexing….Hadoop on steroids…According to InformationWeek, the centerpiece of the NSA’s data-processing capability is Accumulo, a highly distributed, massively parallel processing key/value store capable of analyzing structured and unstructured data. Accumolo is based on Google’s BigTable data model, but NSA came up with a cell-level security feature that makes it possible to set access controls on individual bits of data. Without that capability, valuable information might remain out of reach to intelligence analysts who would otherwise have to wait for sanitized data sets scrubbed of personally identifiable information.

Slicing and dicing…hypothesis testing…. Once ingested into and/or connected to “PRISM” framework, data is quickly accessible to analysts in a rich data model that contains metadata, temporal, statistical, geospatial, and relational-behavioral information.

According to a NSA presentation a Carnegie Mellon technical conference,  Graph search, in particular, is a powerful tool for investigation. In an in-depth presentation about the 4.4-trillion-node graph database it’s running on top of Accumulo. Nodes are essentially bits of information — phone numbers, numbers called, locations — and the relationships between those nodes are edges. NSA’s graph uncovered 70.4 trillion edges among those 4.4 trillion nodes. That’s an ocean of information, but just as Facebook’s graph database can help you track down a long-lost high school classmate within seconds, security-oriented graph databases can help spot threats.

The underlying architecture probably looks something like this… (again extrapolated from In-Q-Tel funded company Palantir’s documentation available on the Web. In-Q-Tel is a Intelligence agency venture fund).

Palantir

In Summary

A fuller picture of the exact operation of Prism will emerge in the coming weeks and months. Stay tuned as i explore what Prism is – and, crucially, isn’t.  I am really curious about the architecture and techniques being used to extract patterns.

Notes and References
  1. PRISM not the only Big Data analytics program out there. Recently, the Guardian released details of another N.S.A. data-mining program, called Boundless Informant.  This data mining tool appears to record and analyze where intelligence comes from; it can show on a map the amount of intelligence the N.S.A. collects from every country in the world.
  2. According to the Guardian, in March 2013, the N.S.A. collected 97 billion pieces of intelligence; over a separate 30 day period ending in March, the agency collected almost 3 billion pieces of intelligence from within the United States.
  3. http://en.wikipedia.org/wiki/PRISM_(surveillance_program)
  4. GigaOM cited a report from Federal Computer Week which said the Central Intelligence Agency has contracted Amazon Web Services to build a private cloud. Neither the CIA nor Amazon has confirmed the deal, which the report said was worth $600 million over 10 years.
  5. NSA has shared Accumulo with the Apache Foundation, and the technology has since been commercialized by Sqrrl, a startup launched by six former NSA employees. Sqrrl has supplemented the Accumulo technology with analytical tools including SQL interfaces, statistical analytics interfaces, text search and graph search engines.
  6. Informationweek source: http://www.informationweek.com/big-data/news/big-data-analytics/defending-nsa-prisms-big-data-tools/240156388
  7. NSA Graph http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
Published at DZone with permission of Ravi Kalakota, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)