Datasets for recommender systems research
Published:
In this post, I will present some benchmark datasets for recommender system, please note that I will only give the links of those datasets. Courtesy of entaroadun.
DataSets
Amazon Product Data:
Mobile Recommendation:
- Data Set for Mobile App Retrieval link
- frappe link
- Ali_Mobile_Rec link1 ; link2
- Mobile App User Dataset link
Movies Recommendation:
- MovieLens - Movie Recommendation Data Sets link
- Yahoo! - Movie, Music, and Images Ratings Data Sets link
- Cornell University - Movie-review data for use in sentiment-analysis experiments link
- Netflix Prize Dataset link
- MovieTweetings - link
Joke Recommendation:
- Jester - Movie Ratings Data Sets (Collaborative Filtering Dataset) link
Music Recommendation:
- Last.fm - Music Recommendation Data Sets link
- Yahoo! - Movie, Music, and Images Ratings Data Sets link
- Audioscrobbler - Music Recommendation Data Sets link
- Amazon - Audio CD recommendations link
Books Recommendation:
- Institut für Informatik, Universität Freiburg - Book Ratings Data Sets link
Food Recommendation:
- Chicago Entree - Food Ratings Data Sets link
Merchandise Recommendation:
- Amazon - Product Recommendation Data Sets link
Healthcare Recommendation:
- Nursing Home - Provider Ratings Data Set link
- Hospital Ratings - Survey of Patients Hospital Experiences link
Dating Recommendation:
- www.libimseti.cz - Dating website recommendation (collaborative filtering) link
Scholarly Paper Recommendation:
- National University of Singapore - Scholarly Paper Recommendation link
POI recommendation:
- LBSN - https://github.com/rahmanidashti/LBSNDatasets
Criteo
A Kaggle dataset for Criteo display advertising challenge
Criteo is a personalized retargeting company that works with Internet retailers to serve personalized online display advertisements to consumers. The goal of this Kaggle challenge is to predict click-through rates on display ads. It offers a week’s worth of data from Criteo’s traffic. In the labeled training set over a period of 7 days, each row corresponds to a display ad served by Criteo. The samples are chronologically ordered. Positive (clicked) and negatives (non-clicked) samples have both been subsampled at different rates in order to reduce the dataset size. There are 13 count features and 26 categorical features. The semantic of these features is undisclosed. Some features may have missing values. The dataset is currently available for downloading at AWS.
Note that we only use the labeled part of data as our benchmarking set. We split the data sequentially similar to the challenge. Since the timestamps are not shown, we use the same number of samples of the test set in the challenge. Therefore, we take the last 6,042,135 samples (w.r.t. the last day) for testing and the remaining samples for training and validation (w.r.t. the first 6 days).
Data fields:
- Label - Target variable that indicates if an ad was clicked (1) or not (0).
- I1-I13 - A total of 13 columns of integer features (mostly count features).
- C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.
Avazu
A Kaggle dataset for Avazu CTR prediction challenge
Avazu is one of the leading mobile advertising platforms globally. This Kaggle competition targets at predicting whether a mobile ad will be clicked and has provided 11 days worth of Avazu data to build and test prediction models. It consists of 10 days of labeled click-through data for training and 1 day of unlabeled ads data for testing.
Note that we only use the first 10 days of labeled data as our benchmarking set. We split the data sequentially similar to the challenge. That is, the first 9 days of data for training and validation (20141021~20141029), and the last day of data for testing (20141030).
Data fields:
- id: ad identifier
- click: 0/1 for non-click/click
- hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
- C1 – anonymized categorical variable
- banner_pos
- site_id
- site_domain
- site_category
- app_id
- app_domain
- app_category
- device_id
- device_ip
- device_model
- device_type
- device_conn_type
- C14-C21 – anonymized categorical variables
iPinyou
A real-time bidding algorithm competition dataset from iPinyou
Taobao
An ad display/click dataset from Taobao.com
Alimama
A dataset for sponsored product search in Alibaba
Tencent
A click dataset for KDD Cup 2012 from Tencent
TAAC
A dataset for Tencent’s ad algorithm competetion in 2018
TikTok
A short video understanding challenge from Bytedance’s TikTok App
More related datasets
- Recommendation
- Online advertising
- Web search / sponsored search
- CVR datasets
Leave a Comment