Datasets for recommender systems research

5 minute read

Published:

In this post, I will present some benchmark datasets for recommender system, please note that I will only give the links of those datasets. Courtesy of entaroadun.

DataSets

Amazon Product Data:

Mobile Recommendation:

Movies Recommendation:

  • MovieLens - Movie Recommendation Data Sets link
  • Yahoo! - Movie, Music, and Images Ratings Data Sets link
  • Cornell University - Movie-review data for use in sentiment-analysis experiments link
  • Netflix Prize Dataset link
  • MovieTweetings - link

Joke Recommendation:

  • Jester - Movie Ratings Data Sets (Collaborative Filtering Dataset) link

Music Recommendation:

  • Last.fm - Music Recommendation Data Sets link
  • Yahoo! - Movie, Music, and Images Ratings Data Sets link
  • Audioscrobbler - Music Recommendation Data Sets link
  • Amazon - Audio CD recommendations link

Books Recommendation:

  • Institut für Informatik, Universität Freiburg - Book Ratings Data Sets link

Food Recommendation:

  • Chicago Entree - Food Ratings Data Sets link

Merchandise Recommendation:

  • Amazon - Product Recommendation Data Sets link

Healthcare Recommendation:

  • Nursing Home - Provider Ratings Data Set link
  • Hospital Ratings - Survey of Patients Hospital Experiences link

Dating Recommendation:

  • www.libimseti.cz - Dating website recommendation (collaborative filtering) link

Scholarly Paper Recommendation:

  • National University of Singapore - Scholarly Paper Recommendation link

POI recommendation:

  • LBSN - https://github.com/rahmanidashti/LBSNDatasets

Criteo

A Kaggle dataset for Criteo display advertising challenge

Criteo is a personalized retargeting company that works with Internet retailers to serve personalized online display advertisements to consumers. The goal of this Kaggle challenge is to predict click-through rates on display ads. It offers a week’s worth of data from Criteo’s traffic. In the labeled training set over a period of 7 days, each row corresponds to a display ad served by Criteo. The samples are chronologically ordered. Positive (clicked) and negatives (non-clicked) samples have both been subsampled at different rates in order to reduce the dataset size. There are 13 count features and 26 categorical features. The semantic of these features is undisclosed. Some features may have missing values. The dataset is currently available for downloading at AWS.

Note that we only use the labeled part of data as our benchmarking set. We split the data sequentially similar to the challenge. Since the timestamps are not shown, we use the same number of samples of the test set in the challenge. Therefore, we take the last 6,042,135 samples (w.r.t. the last day) for testing and the remaining samples for training and validation (w.r.t. the first 6 days).

Data fields:

  • Label - Target variable that indicates if an ad was clicked (1) or not (0).
  • I1-I13 - A total of 13 columns of integer features (mostly count features).
  • C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.

Avazu

A Kaggle dataset for Avazu CTR prediction challenge

Avazu is one of the leading mobile advertising platforms globally. This Kaggle competition targets at predicting whether a mobile ad will be clicked and has provided 11 days worth of Avazu data to build and test prediction models. It consists of 10 days of labeled click-through data for training and 1 day of unlabeled ads data for testing.

Note that we only use the first 10 days of labeled data as our benchmarking set. We split the data sequentially similar to the challenge. That is, the first 9 days of data for training and validation (20141021~20141029), and the last day of data for testing (20141030).

Data fields:

  • id: ad identifier
  • click: 0/1 for non-click/click
  • hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
  • C1 – anonymized categorical variable
  • banner_pos
  • site_id
  • site_domain
  • site_category
  • app_id
  • app_domain
  • app_category
  • device_id
  • device_ip
  • device_model
  • device_type
  • device_conn_type
  • C14-C21 – anonymized categorical variables

iPinyou

A real-time bidding algorithm competition dataset from iPinyou

Taobao

An ad display/click dataset from Taobao.com

Alimama

A dataset for sponsored product search in Alibaba

Tencent

A click dataset for KDD Cup 2012 from Tencent

TAAC

A dataset for Tencent’s ad algorithm competetion in 2018

TikTok

A short video understanding challenge from Bytedance’s TikTok App

  • Recommendation
    • [Taobao]
    • Amazon: A dataset of product reviews and metadata from Amazon
    • [MovieLens]
    • [Yelp]
    • Frappe: A dataset for context-aware app recommendation in Frappe
  • Online advertising
    • Criteo1TB: A Terabyte display advertising dataset from Criteo
    • Outbrain: A dataset of users’ page views and clicks in Outbrain
  • Web search / sponsored search
    • Yandex: A click dataset for personalized Web search challenge from Yandex
    • Avito: A dataset of contextual search ad clicks from Avito
  • CVR datasets
    • YooChoose: A sequence of click and purchase events in an e-commerce website from YooChoose
    • AliCCP: A click dataset gathered from the recommender system in Taobao
    • JData: A dataset for purchase prediction in JD.com

Reports

Leave a Comment