Welcome! The long-term goal of this research is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.
Our work is ongoing, so please feel free to check back for updates. In the meantime, you can download an anonymized version of our ICML-09 data set as described in the Data Sets section. You can also download sample code for the factor analysis approach we described in our AISTATS-10 paper in the Code section.
Justin Ma, Alex Kulesza, Mark Dredze, Koby Crammer, Lawrence K. Saul, and Fernando Pereira,
Exploiting Feature Covariance in High-Dimensional Online Learning
Proceedings of the International Conference on Artificial Intelligence and
Statistics (AISTATS), pages 493-500, Sardinia, Italy, May 2010.
Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker,
Identifying Suspicious URLs: An Application of Large-Scale Online Learning
Proceedings of the International Conference on Machine
Learning (ICML), pages 681-688, Montreal, Quebec, June 2009.
Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker,
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs
Proceedings of the ACM SIGKDD Conference, pages 1245-1253, Paris, France, June 2009.
An anonymized 120-day subset of our ICML-09 data set is available from the following links:
The data set consists of about 2.4 million URLs (examples) and 3.2 million features. If you use the data set in published work, please cite the ICML-09 paper in which it was introduced and first described.
Description of Data (Matlab)
The file url.mat contains variables which we describe as follows:
Description of Data (SVM-light)
Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:
Here is our code for approximating the covariance matrix using the factor analysis-based approach we introduced in our AISTATS-10 paper. It contains code to run synthetic experiments over different online algorithms. [Download]