Analysis Walkthrough

A step-by-step journey through the machine learning pipeline

1. Understanding the Problem 2. Data Splitting 3. Exploratory Data Analysis 4. Feature Engineering 5. Model Comparison 6. Final Results

1. Understanding the Problem

We chose to work on the Airbnb NYC regression problem, predicting listing popularity using reviews per month as a proxy metric.

Dataset Overview

The dataset contains 48,895 Airbnb listings from New York City in 2019, with 16 features including location, pricing, room type, and review metrics.

Total Listings:48,895

Features:16

Target:reviews_per_month

Problem Type:Regression

Key Insights from Initial Analysis

Missing values in the target variable correspond to listings with zero reviews

Geographic features show clear patterns across NYC boroughs

Price and room type are strong predictors of listing popularity

Dataset Loading and Initial Explorationpython

2. Data Splitting

We split the data into training (70%) and test (30%) sets with a fixed random state for reproducibility.

Training Set

34,227

70% of total data

Test Set

14,668

30% holdout for final evaluation

Data Splitting Implementationpython

3. Exploratory Data Analysis

Our EDA revealed crucial insights about missing values, feature distributions, and relationships between variables.

Missing Value Pattern

Missing values in reviews_per_month are Missing Not At Random (MNAR)- they systematically represent listings with zero reviews. We imputed these with 0.

Key EDA Findings

Geographic Distribution

• Manhattan: Highest density, premium pricing
• Brooklyn: Second largest market
• Queens: More affordable options
• Bronx & Staten Island: Smaller markets

Price Patterns

• Wide price range: $0 - $10,000
• Median price: ~$106
• Entire homes cost more than private rooms
• Manhattan commands premium prices

Missing Value Analysispython

4. Feature Engineering

We created several derived features to improve model performance and capture domain-specific insights.

New Features Created

min_payment: price × minimum_nights
recency: Days since last review
price_binned: Categorical price ranges
min_nights_binned: Booking flexibility categories

Feature Selection

Removed redundant ID columns
Excluded text fields (name, host_name)
Kept geographic coordinates
Retained all engineered features

Feature Engineering Implementationpython

5. Model Comparison & Optimization

We compared multiple algorithms and applied hyperparameter optimization to find the best performing model.

LightGBM

0.686

Best performer

Random Forest

0.669

Strong baseline

Decision Tree

0.636

After optimization

Ridge Regression

0.498

Linear baseline

Final Model Performance

The optimized LightGBM model achieved R² = 0.6956 on the test set, demonstrating excellent generalization performance.

Hyperparameter Optimization

We used RandomizedSearchCV to optimize key hyperparameters for the top-performing models.

Model Training and Optimizationpython

6. Final Results & Interpretation

Test R²

0.6956

Final model performance

Cross-Validation R²

0.686

Training performance

Model

LightGBM

Gradient boosting

Most Important Features

Top Predictors

1. Recency: Days since last review
2. Number of reviews: Total review count
3. Minimum nights: Booking flexibility
4. Neighborhood group: Location impact

Key Insights

• Recent activity drives popularity
• Review history is crucial
• Flexible booking increases appeal
• Location significantly matters

Model Interpretation

SHAP analysis revealed that recency (days since last review) is the most influential feature, with recent activity strongly predicting higher review rates. This aligns with business intuition about listing momentum and visibility.

Final Model Evaluationpython

Key Takeaways

Project Success

This analysis successfully demonstrates the complete machine learning workflow, from data exploration to model deployment, achieving strong predictive performance with clear business insights.

Technical Achievements

• Robust preprocessing pipeline
• Effective feature engineering
• Comprehensive model comparison
• Hyperparameter optimization
• Model interpretation with SHAP

Business Insights

• Recent activity drives popularity
• Location significantly impacts success
• Booking flexibility matters
• Review history is crucial
• Price optimization opportunities