Crowdsourcing is thriving as a low-cost outsourcing mechanism. Unfortunately, many Web service abusers are using crowdsourcing to mount attacks on popular Web services such as Google, Yahoo and Facebook. Thus, crowdsourcing sites and Web service providers need ways to control the influx of abuse-related jobs. In this project, we use topic modeling to identify abuse-related job postings on, a popular site for crowdsourcing.

In our AISec-11 paper , we explored the use of latent Dirichlet allocation (LDA) on our Freelancer data set. Our analysis suggests that LDA can provide an effective and largely automated tool for monitoring abuse jobs.

Going forward, we are applying more sophisticated topic models to the data set. Please feel free to check back for updates. The Freelancer data set is available in the Data Set section.



Data Set

From the job postings on , we built a term-document matrix of 27,600 terms and 355,386 documents. For preprocessing procedures, please refer to our AISec-11 paper . If you use the data set in published work, please cite the AISec-11 paper .

Description of the Dictionary File

Each line of the file contains a term and its document frequency. The term and frequency is separated by a tab character.

Description of the Term-document Matrix Files

The first line of the file contains the number of documents, and the second line has the number of terms. Next, each document is represented as:
      # of distinct terms in the document
      index of the first term: term frequency of the first term
      index of the second term: term frequency of the second term


UCSD Computer Science and Engineering