Predicting Airbnb Listing Popularity
A comprehensive machine learning analysis of NYC Airbnb data, predicting listing popularity using reviews per month as a proxy. From exploratory data analysis to web deployment.
Key Results
Final Test R²
0.6956
LightGBM Regressor
Best Model
LightGBM
Gradient Boosting
Training Data
34,227
70% of dataset
Test Data
14,668
30% holdout set
Executive Summary
This project implements an end-to-end supervised machine learning workflow to predict Airbnb listing popularity in New York City using the AB_NYC_2019 dataset. The analysis covers comprehensive exploratory data analysis, feature engineering, model comparison, hyperparameter tuning, and model interpretation.
Key Achievement
The final tuned LightGBM model achieves a Test R² of 0.6956 on the held-out test set, demonstrating strong predictive performance for listing popularity.
Problem Statement
Predict Airbnb listing popularity using reviews per month as a proxy metric. This helps hosts and Airbnb understand what drives listing success and optimize rental strategies accordingly.
Technical Highlights
- • Scikit-learn pipelines for robust preprocessing
- • Feature engineering and selection techniques
- • Cross-validation and hyperparameter optimization
- • Model interpretation with SHAP values
Methodology Overview
Data Exploration
Comprehensive EDA revealing missing value patterns, feature distributions, and correlation analysis across NYC boroughs.
Feature Engineering
Created derived features like minimum payment, recency metrics, and binned categorical variables to improve model performance.
Model Selection
Compared multiple algorithms from linear regression to gradient boosting, with LightGBM emerging as the top performer.
Key Takeaways
Machine Learning Insights
This project demonstrates the iterative nature of ML development, highlighting important tradeoffs between model complexity, performance, and computational efficiency. The bias-variance tradeoff and performance-efficiency considerations guided our model selection process.
Most Important Features
- • Recency: Days since last review
- • Review Count: Total number of reviews
- • Minimum Nights: Booking flexibility
- • Neighborhood: Location impact
Model Performance
- • LightGBM: R² = 0.686 (optimized)
- • Random Forest: R² = 0.669
- • Ridge Regression: R² = 0.498
- • Decision Tree: R² = 0.636 (optimized)