Predicting Airbnb Listing Popularity

A comprehensive machine learning analysis of NYC Airbnb data, predicting listing popularity using reviews per month as a proxy. From exploratory data analysis to web deployment.

Dataset:48,895 listings

Features:16 variables

Year:2019

Location:NYC

Download Original Notebook View Analysis

Key Results

Final Test R²

0.6956

LightGBM Regressor

Best Model

LightGBM

Gradient Boosting

Training Data

34,227

70% of dataset

Test Data

14,668

30% holdout set

Executive Summary

This project implements an end-to-end supervised machine learning workflow to predict Airbnb listing popularity in New York City using the AB_NYC_2019 dataset. The analysis covers comprehensive exploratory data analysis, feature engineering, model comparison, hyperparameter tuning, and model interpretation.

Key Achievement

The final tuned LightGBM model achieves a Test R² of 0.6956 on the held-out test set, demonstrating strong predictive performance for listing popularity.

Problem Statement

Predict Airbnb listing popularity using reviews per month as a proxy metric. This helps hosts and Airbnb understand what drives listing success and optimize rental strategies accordingly.

Technical Highlights

• Scikit-learn pipelines for robust preprocessing
• Feature engineering and selection techniques
• Cross-validation and hyperparameter optimization
• Model interpretation with SHAP values

Methodology Overview

Data Exploration

Comprehensive EDA revealing missing value patterns, feature distributions, and correlation analysis across NYC boroughs.

Feature Engineering

Created derived features like minimum payment, recency metrics, and binned categorical variables to improve model performance.

Model Selection

Compared multiple algorithms from linear regression to gradient boosting, with LightGBM emerging as the top performer.

Key Takeaways

Machine Learning Insights

This project demonstrates the iterative nature of ML development, highlighting important tradeoffs between model complexity, performance, and computational efficiency. The bias-variance tradeoff and performance-efficiency considerations guided our model selection process.

Most Important Features

• Recency: Days since last review
• Review Count: Total number of reviews
• Minimum Nights: Booking flexibility
• Neighborhood: Location impact

Model Performance

• LightGBM: R² = 0.686 (optimized)
• Random Forest: R² = 0.669
• Ridge Regression: R² = 0.498
• Decision Tree: R² = 0.636 (optimized)