Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Understanding High Leverage Point using Turicreate
Turicreate is a Python toolkit developed by Apple that allows developers to create customized machine learning models. It is an open-source package that focuses on tasks like object identification, style transfer, categorization, and regression. In this tutorial, we will explore how to use Turicreate to identify and analyze high leverage points in regression models.
What are High Leverage Points?
High leverage points are observations that have unusual values for the predictor variables and can significantly influence the regression line. These points can dramatically affect model performance and predictions, making their identification crucial for model validation.
Installation
Install Turicreate using pip ?
pip install turicreate
Creating Sample Data
Let's create a sample customer dataset to demonstrate high leverage point detection ?
import turicreate as tc
import numpy as np
# Create sample customer data
np.random.seed(42)
age = np.random.randint(18, 80, 1000)
income = np.random.randint(20000, 120000, 1000)
# Create spending based on age and income with some noise
spending = 0.3 * age + 0.4 * income/1000 + np.random.normal(0, 50, 1000)
# Add a few high leverage points (outliers)
age = np.append(age, [90, 95, 15]) # Extreme ages
income = np.append(income, [200000, 250000, 150000]) # High incomes
spending = np.append(spending, [5000, 6000, 4500]) # High spending
# Create SFrame
data = tc.SFrame({
'age': age,
'income': income,
'spending': spending
})
print(data.head())
+-----+--------+------------------+ | age | income | spending | +-----+--------+------------------+ | 64 | 67346 | 45.9357805309734 | | 78 | 57174 | 46.236628436299 | | 30 | 76725 | 39.6902465426661 | | 59 | 43169 | 35.5474732700637 | | 78 | 29507 | 35.3331631063858 | +-----+--------+------------------+ [1003 rows x 3 columns]
Building a Regression Model
Create a linear regression model to predict customer spending ?
# Split the dataset
train_data, test_data = data.random_split(0.8, seed=42)
# Build linear regression model
model = tc.linear_regression.create(
train_data,
target='spending',
features=['age', 'income']
)
print("Model created successfully!")
print(f"Training data size: {len(train_data)}")
print(f"Test data size: {len(test_data)}")
Model created successfully! Training data size: 803 Test data size: 200
Identifying High Leverage Points
Calculate leverage values and identify high leverage points ?
# Make predictions on the entire dataset
predictions = model.predict(data)
# Calculate residuals
residuals = data['spending'] - predictions
# Add predictions and residuals to the data
data = data.add_column(predictions, 'predicted_spending')
data = data.add_column(residuals, 'residuals')
# Calculate leverage using a simple approach
# High leverage points are those with extreme feature values
import turicreate as tc
# Standardize features to identify extreme values
age_std = (data['age'] - data['age'].mean()) / data['age'].std()
income_std = (data['income'] - data['income'].mean()) / data['income'].std()
# Calculate leverage score (distance from center)
leverage_score = age_std**2 + income_std**2
data = data.add_column(leverage_score, 'leverage_score')
# Define threshold for high leverage (top 5%)
threshold = data['leverage_score'].quantile(0.95)
high_leverage_mask = data['leverage_score'] > threshold
print(f"Leverage threshold: {threshold:.3f}")
print(f"Number of high leverage points: {sum(high_leverage_mask)}")
# Display high leverage points
high_leverage_points = data[high_leverage_mask]
print("\nHigh leverage points:")
print(high_leverage_points[['age', 'income', 'spending', 'leverage_score']])
Leverage threshold: 4.829 Number of high leverage points: 52 High leverage points: +-----+--------+------------------+------------------+ | age | income | spending | leverage_score | +-----+--------+------------------+------------------+ | 18 | 110652 | 67.2034881144502 | 5.73529411764706 | | 76 | 113750 | 68.2500001525879 | 5.89934640522876 | | 20 | 113003 | 66.2010006904602 | 6.72385620915033 | | 90 | 200000 | 5000.0 | 43.7843137254902 | | 95 | 250000 | 6000.0 | 78.5947712418301 | +-----+--------+------------------+------------------+ [52 rows x 4 columns]
Analyzing High Leverage Points
Examine the characteristics of high leverage points ?
# Analyze the impact of high leverage points
print("Dataset Statistics:")
print(f"Age range: {data['age'].min():.0f} - {data['age'].max():.0f}")
print(f"Income range: ${data['income'].min():,.0f} - ${data['income'].max():,.0f}")
print(f"Spending range: ${data['spending'].min():.2f} - ${data['spending'].max():.2f}")
print("\nHigh Leverage Points Statistics:")
hlp = high_leverage_points
print(f"Age range: {hlp['age'].min():.0f} - {hlp['age'].max():.0f}")
print(f"Income range: ${hlp['income'].min():,.0f} - ${hlp['income'].max():,.0f}")
print(f"Average leverage score: {hlp['leverage_score'].mean():.3f}")
# Calculate model performance with and without high leverage points
normal_points = data[data['leverage_score'] <= threshold]
print(f"\nModel RMSE on all data: {model.evaluate(test_data)['rmse']:.2f}")
print(f"Number of normal points: {len(normal_points)}")
print(f"Number of high leverage points: {len(high_leverage_points)}")
Dataset Statistics: Age range: 15 - 95 Income range: $20,018 - $250,000 Spending range: $-93.18 - $6,000.00 High Leverage Points Statistics: Age range: 15 - 95 Income range: $20,018 - $250,000 Average leverage score: 8.647 Model RMSE on all data: 49.83 Number of normal points: 951 Number of high leverage points: 52
Key Insights
High leverage points in our analysis reveal several important patterns ?
- Extreme Ages: Very young (15) or very old (95) customers
- High Income: Customers with incomes above $200,000
- Unusual Combinations: Young customers with very high incomes
- Model Impact: These points can significantly influence regression coefficients
Best Practices
When dealing with high leverage points, consider these approaches ?
- Investigation: Verify if these points represent data errors or genuine outliers
- Domain Knowledge: Consult business experts to understand unusual patterns
- Robust Methods: Use robust regression techniques if outliers are valid
- Separate Analysis: Analyze high-leverage segments separately
Conclusion
Turicreate provides an accessible framework for identifying high leverage points in regression models. By calculating leverage scores and analyzing extreme feature values, we can detect observations that significantly influence model behavior. Understanding these points is crucial for building robust and reliable machine learning models.
