Detecting Malicious URLs

Welcome! The long-term goal of this research is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.

Our work is ongoing, so please feel free to check back for updates. In the meantime, you can download an anonymized version of our ICML-09 data set as described in the Data Sets section. You can also download sample code for the factor analysis approach we described in our AISTATS-10 paper in the Code section.

Justin Ma, Alex Kulesza, Mark Dredze, Koby Crammer, Lawrence K. Saul, and Fernando Pereira,
Exploiting Feature Covariance in High-Dimensional Online Learning
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 493-500, Sardinia, Italy, May 2010.
Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker,
Identifying Suspicious URLs: An Application of Large-Scale Online Learning
Proceedings of the International Conference on Machine Learning (ICML), pages 681-688, Montreal, Quebec, June 2009.
Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker,
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs
Proceedings of the ACM SIGKDD Conference, pages 1245-1253, Paris, France, June 2009.

An anonymized 120-day subset of our ICML-09 data set is available from the following links:

URL Data Set (Matlab) (470 MB)
URL Data Set (SVM-light) (234 MB)

The data set consists of about 2.4 million URLs (examples) and 3.2 million features. If you use the data set in published work, please cite the ICML-09 paper in which it was introduced and first described.

Description of Data (Matlab)

The file url.mat contains variables which we describe as follows:

FeatureTypes --- A list of column indices for the data matrices that are real-valued features.
DayX (where X is an integer from 0 to 120) --- A struct containing the data for day X.
- DayX.data --- an N x D data matrix where N is the number of URLs (rows), and D is the number of features (columns).
- DayX.labels --- an N x 1 label vector where 1 indicates a malicious URL and 0 indicates a benign URL.

Description of Data (SVM-light)

Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:

FeatureTypes --- A text file list of feature indices that correspond to real-valued features.
DayX.svm (where X is an integer from 0 to 120) --- The data for day X in SVM-light format. A label of +1 corresponds to a malicious URL and -1 corresponds to a benign URL.

Here is our code for approximating the covariance matrix using the factor analysis-based approach we introduced in our AISTATS-10 paper. It contains code to run synthetic experiments over different online algorithms. [Download]

UCSD Computer Science and Engineering

Detecting Malicious URLs

Introduction

Publications

People

Data Sets

Code

Affiliations