Analysis Walkthrough
A step-by-step journey through the machine learning pipeline
Download Original NotebookTable of Contents
1. Understanding the Problem
We chose to work on the Airbnb NYC regression problem, predicting listing popularity using reviews per month as a proxy metric.
Dataset Overview
The dataset contains 48,895 Airbnb listings from New York City in 2019, with 16 features including location, pricing, room type, and review metrics.
Key Insights from Initial Analysis
Missing values in the target variable correspond to listings with zero reviews
Geographic features show clear patterns across NYC boroughs
Price and room type are strong predictors of listing popularity
2. Data Splitting
We split the data into training (70%) and test (30%) sets with a fixed random state for reproducibility.
Training Set
34,227
70% of total data
Test Set
14,668
30% holdout for final evaluation
3. Exploratory Data Analysis
Our EDA revealed crucial insights about missing values, feature distributions, and relationships between variables.
Missing Value Pattern
Missing values in reviews_per_month
are Missing Not At Random (MNAR)- they systematically represent listings with zero reviews. We imputed these with 0.
Key EDA Findings
Geographic Distribution
- • Manhattan: Highest density, premium pricing
- • Brooklyn: Second largest market
- • Queens: More affordable options
- • Bronx & Staten Island: Smaller markets
Price Patterns
- • Wide price range: $0 - $10,000
- • Median price: ~$106
- • Entire homes cost more than private rooms
- • Manhattan commands premium prices
4. Feature Engineering
We created several derived features to improve model performance and capture domain-specific insights.
New Features Created
- min_payment: price × minimum_nights
- recency: Days since last review
- price_binned: Categorical price ranges
- min_nights_binned: Booking flexibility categories
Feature Selection
- Removed redundant ID columns
- Excluded text fields (name, host_name)
- Kept geographic coordinates
- Retained all engineered features
5. Model Comparison & Optimization
We compared multiple algorithms and applied hyperparameter optimization to find the best performing model.
LightGBM
0.686
Best performer
Random Forest
0.669
Strong baseline
Decision Tree
0.636
After optimization
Ridge Regression
0.498
Linear baseline
Final Model Performance
The optimized LightGBM model achieved R² = 0.6956 on the test set, demonstrating excellent generalization performance.
Hyperparameter Optimization
We used RandomizedSearchCV to optimize key hyperparameters for the top-performing models.
6. Final Results & Interpretation
Test R²
0.6956
Final model performance
Cross-Validation R²
0.686
Training performance
Model
LightGBM
Gradient boosting
Most Important Features
Top Predictors
- 1. Recency: Days since last review
- 2. Number of reviews: Total review count
- 3. Minimum nights: Booking flexibility
- 4. Neighborhood group: Location impact
Key Insights
- • Recent activity drives popularity
- • Review history is crucial
- • Flexible booking increases appeal
- • Location significantly matters
Model Interpretation
SHAP analysis revealed that recency (days since last review) is the most influential feature, with recent activity strongly predicting higher review rates. This aligns with business intuition about listing momentum and visibility.
Key Takeaways
Project Success
This analysis successfully demonstrates the complete machine learning workflow, from data exploration to model deployment, achieving strong predictive performance with clear business insights.
Technical Achievements
- • Robust preprocessing pipeline
- • Effective feature engineering
- • Comprehensive model comparison
- • Hyperparameter optimization
- • Model interpretation with SHAP
Business Insights
- • Recent activity drives popularity
- • Location significantly impacts success
- • Booking flexibility matters
- • Review history is crucial
- • Price optimization opportunities