Datasets for recommender systems research

5 minute read

Published: August 01, 2019

In this post, I will present some benchmark datasets for recommender system, please note that I will only give the links of those datasets. Courtesy of entaroadun.

DataSets

Amazon Product Data:

Amazon product data link
SNAP snap

Mobile Recommendation:

Data Set for Mobile App Retrieval link
frappe link
Ali_Mobile_Rec link1 ; link2
Mobile App User Dataset link

Movies Recommendation:

MovieLens - Movie Recommendation Data Sets link
Yahoo! - Movie, Music, and Images Ratings Data Sets link
Cornell University - Movie-review data for use in sentiment-analysis experiments link
Netflix Prize Dataset link
MovieTweetings - link

Joke Recommendation:

Jester - Movie Ratings Data Sets (Collaborative Filtering Dataset) link

Music Recommendation:

Last.fm - Music Recommendation Data Sets link
Yahoo! - Movie, Music, and Images Ratings Data Sets link
Audioscrobbler - Music Recommendation Data Sets link
Amazon - Audio CD recommendations link

Books Recommendation:

Institut für Informatik, Universität Freiburg - Book Ratings Data Sets link

Food Recommendation:

Chicago Entree - Food Ratings Data Sets link

Merchandise Recommendation:

Amazon - Product Recommendation Data Sets link

Healthcare Recommendation:

Nursing Home - Provider Ratings Data Set link
Hospital Ratings - Survey of Patients Hospital Experiences link

Dating Recommendation:

www.libimseti.cz - Dating website recommendation (collaborative filtering) link

Scholarly Paper Recommendation:

National University of Singapore - Scholarly Paper Recommendation link

POI recommendation:

LBSN - https://github.com/rahmanidashti/LBSNDatasets

Criteo

A Kaggle dataset for Criteo display advertising challenge

Criteo is a personalized retargeting company that works with Internet retailers to serve personalized online display advertisements to consumers. The goal of this Kaggle challenge is to predict click-through rates on display ads. It offers a week’s worth of data from Criteo’s traffic. In the labeled training set over a period of 7 days, each row corresponds to a display ad served by Criteo. The samples are chronologically ordered. Positive (clicked) and negatives (non-clicked) samples have both been subsampled at different rates in order to reduce the dataset size. There are 13 count features and 26 categorical features. The semantic of these features is undisclosed. Some features may have missing values. The dataset is currently available for downloading at AWS.

Note that we only use the labeled part of data as our benchmarking set. We split the data sequentially similar to the challenge. Since the timestamps are not shown, we use the same number of samples of the test set in the challenge. Therefore, we take the last 6,042,135 samples (w.r.t. the last day) for testing and the remaining samples for training and validation (w.r.t. the first 6 days).

Data fields:

Label - Target variable that indicates if an ad was clicked (1) or not (0).
I1-I13 - A total of 13 columns of integer features (mostly count features).
C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.

Avazu

A Kaggle dataset for Avazu CTR prediction challenge

Avazu is one of the leading mobile advertising platforms globally. This Kaggle competition targets at predicting whether a mobile ad will be clicked and has provided 11 days worth of Avazu data to build and test prediction models. It consists of 10 days of labeled click-through data for training and 1 day of unlabeled ads data for testing.

Note that we only use the first 10 days of labeled data as our benchmarking set. We split the data sequentially similar to the challenge. That is, the first 9 days of data for training and validation (20141021~20141029), and the last day of data for testing (20141030).

Data fields:

id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 – anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 – anonymized categorical variables

Recommendation
- [Taobao]
- Amazon: A dataset of product reviews and metadata from Amazon
- [MovieLens]
- [Yelp]
- Frappe: A dataset for context-aware app recommendation in Frappe
Online advertising
- Criteo1TB: A Terabyte display advertising dataset from Criteo
- Outbrain: A dataset of users’ page views and clicks in Outbrain
Web search / sponsored search
- Yandex: A click dataset for personalized Web search challenge from Yandex
- Avito: A dataset of contextual search ad clicks from Avito
CVR datasets
- YooChoose: A sequence of click and purchase events in an e-commerce website from YooChoose
- AliCCP: A click dataset gathered from the recommender system in Taobao
- JData: A dataset for purchase prediction in JD.com

Reports

Share on

Twitter Facebook LinkedIn

Shuai Zhang

Datasets for recommender systems research

DataSets

Criteo

Avazu

iPinyou

Taobao

Alimama

Tencent

TAAC

TikTok

Reports

Share on

Leave a Comment

You May Also Enjoy

Knowledge graph datasets.

Suggestions on Computer Science Writing.

Useful links for research

Books

Summary of recommender systems Surveys in recent years

Shuai Zhang

DataSets

Criteo

Avazu

iPinyou

Taobao

Alimama

Tencent

TAAC

TikTok

More related datasets

Reports

Share on

Leave a Comment

You May Also Enjoy

Knowledge graph datasets.

Suggestions on Computer Science Writing.

Useful links for research

Books

Summary of recommender systems Surveys in recent years