A Peek Inside Cisco’s Hadoop Security Machine
The Internet is the ultimate invention of man, a creation that will forever change how humans work, live, and play. But for all the good it’s capable of, the Internet has also created a comfortable home for cybercriminals, who use increasingly sophisticated techniques to siphon hundreds billions of dollars from the global economy. One company that’s upping the ante in the battle against cybercriminals is Cisco, which is using a 60-node Hadoop cluster to separate criminal signals from the Internet’s noise.
Cisco started using Apache Hadoop about three years ago to provide a centralized repository for storing and analyzing various pieces of security data. Like all big Internet security firms, Cisco collects data from its customers’ firewalls, IPSs, and security appliances so it can better understand and react to evolving security threats, such as zero-day vulnerabilities, spear phishing attempts, and other cybercriminals techniques. After analyzing this log data (among other sources), Cisco updates its customers’ security devices, as part of one big feedback loop.
Prior to Hadoop, each of Cisco‘s security product lines collected and analyzed data individually, says Jisheng Wang, technical leader of Threat Research, Analysis, and Communications (TRAC) group at Cisco. “Before we had this platform, it was very hard for people to work together to derive intelligence” from the security data, Wang tells Datanami. “We had some intelligence there, but it was mostly isolated. This Hadoop solution is a kernel of the whole security ecosystem within Cisco.”
Today, Cisco’s Global Security Intelligence Operations (SIO) group operates a 60-node, 1,000-core Hadoop cluster based on MapR Technologies’ M7 distribution. Every day, about 20 TB of raw log data lands in Global SIO’s Hadoop cluster in the Silicon Valley from local SIO’s and data centers around the world. The data includes telemetry data collected from Cisco’s IPS, firewall, email, and Web application logs; freely sourced data from the Internet, such as data from Whois, GeoIP, and botnet/darknet data; and malware sandboxing, fire repudiation, and end-user logs from SourceFire FireAMP currently hosted on Amazon Web Services.
All told, Cisco expects to collect up to a million events per second from nearly 100 different channels over tens of thousands of distributed sensors. Making sense of all this structured, semi-structured, and unstructured data is not an easy task, but Hadoop makes it easier, Wang says.
Signal Detection
The company has four primary ways of using Hadoop to manipulate, parse, and analyze the data for security threats:
- Stream processing. The Global SIO uses a combination of stream processing products for complex event processing, including Storm, Truviso, and Spark Streaming. The first priority here is extracting the right piece of data so it be aggregated in the right way in Cisco’s data warehouse, based on HBase. “Stream processing is used to do real-time detection or alerting of the threats,” Wang says. “When the data comes in, if you have known pattern of the threat you are looking for, you can do the real-time classification or detection there.”
- Interactive SQL. There are two ways Cisco uses interactive SQL, including for end-customer reporting and threat detection. The Hadoop cluster is a handy place to store customer’s individual log data in the cloud, and Cisco makes it available for SQL-based reporting. Data scientists and analysts in Global SIO also write complicated SQL queries (using Cloudera Impala, Drill, and Splice Machine) to flesh out new security or threat patterns.
- Batch processing. Data scientists and analysts in Global SIO using a variety of batch-oriented, MapReduce-based technologies, including Mahout and 0xdata machine learning algorithms, to build threat detection models, with the aim of identifying emerging security and threat detection patterns across large amounts of data.
- Graph processing. The company has recently started exploring the possibly of graph processing in Hadoop. It uses graph analytics technology from GraphLab, as well as the open source Titan and Gremlin graph technologies. Graph processing is being used for end-customer reporting, as well as for accelerated detection of cybercriminal signals by the internal Global SIO team.
The Hadoop stack at Cisco Global SIO |
Wang is particularly excited about the potential for graph databases and graph analytics to assist with Global SIO’s mission of identifying the bad guys lurking on the Internet. Just as social networks like Facebook use graph databases to build connections between users, Cisco uses its graph database to determine connections between known cybercriminals, Internet domains, IP addresses, and other entities on the Web.
Wang says Cisco has solid information on about 25 to 30 million Internet domains. They are known entities, as to whether they are run by above-board organizations or people, or whether they are run by criminals. However, it doesn’t have solid information for the world’s other 180 million domains. But with its graph database, it’s able to make connections about the domains that would otherwise be difficult to make.
“The idea is to find the unknown bad domains,” Wang says. Cisco is able to trace the connections from the known bad domains to determine if there are any similarities or common aspects to the unknown domains. It can even identify passive domains before they have been used for any attacks, Wang says.
“We are absolutely certain we’re the first industry group that use graph analytics in security,” Wang says. “The value for graph is also to speed up or to ease the investigation, either from the data science point of view, or from the end-customer point of view, to make that job easier and to make it possible for us to look at a much larger scale data for correlation.”
Signal Amplification
Cisco chose MapR Technologies distribution due to the strong performance of HBase on its Hadoop implementation, Wang says. Because of the high volume and variety of data coming into the Global SIO, the company relies heavily on HBase as an aggregated data store. “The challenge comes when the size of the data grows to some scale, hundreds of PB,” Wang says. “You can ingest the data into HDFS, but you can’t load it out of HDFS efficiently for any of the use cases. The volume of the data is so big, and you can’t organize or aggregate the data in a proper way for use case.”
To deal with this challenge, Cisco is aggregating the data before loading it as a value into HBase. This makes for more efficient down-stream processing of the data. “We did some benchmarking test internally between MapR HBase versus Apache HBase. MapR HBase had outperformed Apachde HBase in terms of the throughput of averages, the latency, and also more importantly the durability of the concurrent reads and writes,” Wang says.
In the end analysis, Cisco’s 200-person Global SIO group is finding cybercriminals, malware, and compromised websites at a much faster rate than the company’s individual teams could before, and much of that improvement can be attributed to Hadoop. With Hadoop, Cisco’s Global SIO has a fighting chance to detect security events in time to help protect its customers.
It’s all about signal implication, Wang says. “If you have some weak signals across different dimensions, like Web and email, if you look at different individual dimensions, you cannot catch that signal. It’s too weak,” he says. “With this Hadoop-based big data infrastructure, we can integrate massive data from different dimensions together and finally identify this weak signal.”
Cisco doesn’t pretend that its Hadoop security machine can catch 100 percent of the security threats that exist on the Web. While Hadoop gives Cisco a big advantage, cyber criminals are already trying to find some way to mask their signals to evade detection. “It’s definitely a cat and mouse game,” Wang says. “We’re not going to solve 100 percent of the problems, even with Hadoop. But what we can do is raise the bar to make their jobs harder and harder. At the end of the day, you can block most of the attackers–the normal or above average attackers , and identify the remaining .1 percent to 1 percent of suspicious activity, which will be finally investigated and confirmed by security experts.”
Related Items:
How Analytics is Driving Military Intelligence
I Didn’t Know Big Data Could Do That!