A Deep Learning Approach for Detecting Unknown Malware
All of the major antivirus vendors at this point are moving towards machine learning approaches to keep up with the evolving threat landscape. That’s the good news. However, with upwards of 1 million new pieces of malware released into the wild per day, traditional machine learning approaches may be not be up to the task. Now a company called Deep Instinct is hoping to take malware detection to the next level by using deep learning.
In the cat and mouse game that is Internet security, cybercriminals and bad actors constantly try to pull one over on the rest of us. If they can sneak a new piece of malicious code past our endpoint detection systems, they can reap the financial rewards.
But here’s the thing: cybercriminals don’t need new code every time. They can use an old piece of malware and make some slight tweaks to get it past security software. Or they can create a new exploit for an old vulnerability, which was the technique used in May’s WannaCry attacks, which impacted 350,000 systems across the world.
Tracking vulnerabilities and the exploit code that hackers write is a huge task that falls to researchers in the cybersecurity industry. In the beginning, signature-based approaches that looked for snippets of code dominated the malware detection racket. When cybercriminals caught on to that approach, security companies were forced to adopt more complex rules-based approaches. But the bad guys got smart to that approach, too.
The next evolution in malware detection involved machine learning. Symantec uses its “advanced machine learning” (AML) to learn to identify attributes of malicious software, while McAfee prefers its approach to “human-machine teaming” to boost malware detection. Kaspersky Labs has been using machine learning to bolster malware detection in its software for about 10 years.
However, the number of new pieces of malware being released continues to skyrocket. In 2015, Symantec said it detected 317 million new pieces of malware the previous year, or nearly 1 million per day. In 2016, Kaspersky Labs said that it was detecting about 323,000 new malware files per day, up from about 70,000 in 2011, according to a story in Dark Reading. In its recent McAfee Labs Threats Report for the third quarter of 2017, the vendor said it detected 57.6 million new samples, or about 640,000 per day.
The exact number of new malware samples generated each day is not important. What is important to a civil online society is the good guys have a way to detect malware before the bad guys have a chance to do much damage with them.
Going Deep
Three years ago, a pair of Israeli cybersecurity researchers, including Guy Caspi and Eli David, founded the company Deep Instinct with a daring plan to utilize emerging deep learning techniques to improve malware detection capabilities. They idea was to build a system that could scale at the same staggering rate as new malware is being generated.
The scalability advantages of deep learning compared to traditional machine learning made it a good fit for this work, said Yaniv Shechtman, Deep Instinct’s director of product management.
“If you’re looking at hundreds of millions of files per day, and you need to process this data to get a profound understanding of what that is, and if it needs to be highly accurate, the traditional machine learning frameworks cannot meet that requirement,” he said.
It took more than two years to develop Deep Instinct’s deep learning framework from scratch. “We’re not utilizing TensorFlow Caffe or any other third-party deep learning libraries provided by Google, Facebook, or Baidu,” Shechtman told Datanami. “We developed our own libraries from scratch because utilizing the deep learning for cybersecurity is far more sophisticated than using it for voice recognition or image processing, or even for autonomous cars.”
Getting the training data prepped and labeled was the biggest challenge in building deep learning cybersecurity framework. The training data, which was sourced from public repositories, third-party vendors, and even the Dark Web, had to be hammered into similar sizes for the neural work to correctly process them. That’s a challenge when file sizes are all over the map, from a benign sample that weighs in at 50KB to a malware sample that’s 100MB. (Luckily, the data scientists didn’t need to extract features, since that part is handled automatically by the neural network).
“This is the challenge that we faced for first two years of the company,” Schectman said. “But it’s not only developing the framework that was the challenge – it’s how you train it.”
The company found that training their “Deep Brain” (as they call their deep learning engine) with an acceptable number of samples would take up to two months using standard CPU-based servers. So the company got in touch with NVIDIA and built its own GPU cluster. As a result, the company has lowered its training time to 48 hours.
Real World Impact
The company began selling its product about six months ago, and today its software is protecting about 70,000 endpoints for 20-plus customers. These customers get the pointy end of the spear, as it were – a slim piece of Windows software that weighs in at 20MB to 30MB.
This piece of software uses the information gleaned from the deep learning training to conduct interference on new files in the wild. The software takes a 1 percent hit on the PC’s CPU and adds about 20 to 30 milliseconds of latency to file-access requests, which is not enough to really notice.
The company claims that its deep learning approach gives it better performance than its competitors who are using more traditional machine learning approaches. It says its threat detection accuracy is more than 98% compared to less than 62.5% for its competitors. It says its false positive rate is less than 0.01% on a data set with 100,000 files; comparatively, it claims its false positive rate is 2.5% to 5% for its competitors.
Because Deep Instinct’s framework used deep learning techniques to identify malware based on a multitude of similarities to billions of previous malware samples, the system is fairly self-contained, and only needs to be retrained just once every six to eight months. That means its endpoint protection is nearly always up-to-date, needing an update just once or twice per year, while its machine learning-based competitors must check in daily for updates.
This approach enabled Deep Instinct’s software agent to detect the WannaCry and NotPetya cryptoworms without ever having seen them before, Schectman said. “They were detected by a Deep Brain that was trained one year earlier than the actual attack,” he said. “Of course, we had newer versions by then. But if you look at the level of accuracy over time, we were far more accurate than the others, even if it was trained a year before that.”
The positive results have not gone unnoticed by NVIDIA, which named Deep Instinct its “most Disruptive Startup” aware at its 2017 Inception Awards. NVIDIA also participated last year in Deep Instinct’s $32 million Series B funding round, which was led by NCTP.
As the quantity and quality of malware and advanced persistent threats (APTs) continues to change, cybersecurity firms will need new tools to stay on top of it. Traditional machine learning, once looked at as a must-have tool to stay ahead of cybercriminals, may not be enough, especially as the evidence mounts that cybercriminals are using machine learning themselves.
“Hackers are becoming more and more sophisticated and there is a need for a new technology to evolve in order to keep up with the amount of new malware threats that are introduced into the wild,” Schectman said. “Our core competency is detecting the unknown. Most of the attacks today are unknown attacks. This is their main challenge.”
Related Items:
How ‘Purple Rain’ Bolsters Security Intelligence for Capital One
Machine Learning ‘Arms Race’ Ahead, McAfee Warns
Machine Learning’s Big Role in the Future of Cybersecurity