Malicious IP Address Prediction

133396-Thumbnail Image.png
Description
IP blacklisting is a popular technique to bolster an enterprise's security, where access to and from designated IP addresses is explicitly restricted. The fundamental idea behind blacklists is to continually add IP addresses that reputable entities, such as security researchers,

IP blacklisting is a popular technique to bolster an enterprise's security, where access to and from designated IP addresses is explicitly restricted. The fundamental idea behind blacklists is to continually add IP addresses that reputable entities, such as security researchers, have labeled as malicious to the list. Currently IP blacklisting is a reactive method, where malicious IP addresses are identified after their engagement in malicious activities is detected (e.g. hosting malware samples or sending spam emails). This thesis project aims to address this issue, by laying the groundwork for a machine learning tool that proactively identifies malicious IP address. The ground truth data derives from VirusTotal, a company that synthesizes security knowledge from prominent sources, such as Symantec, Fortinet, and ESET. I passed 307,621 IP addresses found in posts on the D2web (deep and dark web) through VirusTotal. If at least one detected URL associates with the IP address and VirusTotal deems it positive, I accordingly label the IP address as positive (malicious), and negative (non-malicious) otherwise. To give some insight into the ground truth, 6,147 IP addresses were identified as positive from the original 307,621. Furthermore, in order to quantify the prediction capabilities of our models, I introduce a metric called lead time. Lead time represents the difference between the date an IP address was first seen on the D2web and its earliest date on VirusTotal. For example, if an IP address was mentioned on the D2web on 1/5/2017 and mentioned on VirusTotal on 1/25/2017, then its lead time is 20 days. After feature selection, where I handpicked features from the data mined from the D2web, I attempted various combinations of classifiers and feature sets in order to create the best model. The final machine learning models implement temporal cross validation - where I train a model on data from 1/1/2016 up until a testing month in 2017, and test on data from the testing month - with a Random Forest classifier. The following are results from a model that was tested on January 2017, which exhibits median performance among the final models. The true positive rate is 0.2558, the false positive rate is 0.3612, and the average lead time (for leading true positives) is 193 days, where the model picks up 33.33% of all leading true positives. Although the model finds a respectable number of true positives, it picks up too many false positives. Thus, my approach is ineffective at predicting malicious IP addresses in their current state, meaning additional efforts will be required to transform the current work into a viable tool
Date Created
2018-05
Agent