Evaluating recommender systems

Evaluating recommender systems involves assessing their performance in providing accurate and relevant recommendations to users.


Types of

  • Predictive Metrics: These metrics focus on how accurately the recommender predicts user preferences or ratings. Examples include:
    • Accuracy: Measures the correctness of predicted ratings.
    • Precision @K: Evaluates the proportion of relevant items in the top K recommendations.
    • Recall @K: Measures the proportion of relevant items retrieved among all relevant items.
    • F1 @K: Combines precision and recall.
  • Ranking Metrics: These metrics assess the quality of the recommended item ranking. Examples include:
    • NDCG (Normalized Discounted Cumulative Gain): Considers both relevance and position of recommended items.
    • MRR (Mean Reciprocal Rank): Measures the rank of the first relevant item.
    • MAP (Mean Average Precision): Computes the average precision across different positions.
  • Behavioral Metrics: These metrics go beyond accuracy and consider user experience. Examples include:
    • Serendipity: Measures the novelty of recommended items.
    • Novelty: Assesses how diverse the recommendations are.
    • Diversity: Evaluates the variety of recommended items.
  • Business Metrics: These metrics measure the actual impact of recommendations on business goals. Examples include:
    • Sales: Tracks revenue generated from recommended items.
    • Click-through rates (CTR): Measures user engagement with recommendations.
    • Conversions: Evaluates how many users take desired actions (e.g., making a purchase).
  • Other Methods:
    • Offline evaluation: Offline evaluation assesses recommender system performance using static data resources (such as historical user-item interactions) and predefined evaluation metrics.
    • ROC-AUC (Receiver Operating Characteristic - Area Under the Curve): Evaluates binary classification model performance.
      • ROC plots the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings.
    • PR-AUC (Area Under the Precision-Recall Curve): Focuses on ranking performance within top ranks.
      • Precision-Recall curve shows precision (positive predictive value) against recall (true positive rate) at different thresholds to evaluate how well a model retrieves test items within top ranks.

Considerations and Challenges:

  • Popularity Bias: Popularity bias occurs when a few highly popular items are recommended frequently, overshadowing less popular items. Impact:
    • Users are repeatedly exposed to popular items, leading to a self-reinforcing cycle.
    • Less-known or niche items receive less attention, affecting overall diversity.
    • Minority group preferences may be overlooked.
  • Position bias: Position bias arises from the order in which items are presented to users (e.g., top-ranked recommendations).
    • Items at the top of the list receive more attention.
    • Lower-ranked items may be unfairly neglected.
    • Algorithms may reinforce existing biases.
  • Degenerate Feedback loop: A feedback loop occurs when user reactions (clicks, ratings) influence future recommendations.
    • Positive feedback reinforces popular items, exacerbating popularity bias.
    • Negative feedback may lead to excluding certain items.
    • Can homogenize user experiences over time.