Understanding High Leverage Point using Turicreate

Turicreate is a Python toolkit developed by Apple that allows developers to create customized machine learning models. It is an open-source package that focuses on tasks like object identification, style transfer, categorization, and regression. In this tutorial, we will explore how to use Turicreate to identify and analyze high leverage points in regression models.

What are High Leverage Points?

High leverage points are observations that have unusual values for the predictor variables and can significantly influence the regression line. These points can dramatically affect model performance and predictions, making their identification crucial for model validation.

Installation

Install Turicreate using pip ?

pip install turicreate

Creating Sample Data

Let's create a sample customer dataset to demonstrate high leverage point detection ?

import turicreate as tc
import numpy as np

# Create sample customer data
np.random.seed(42)
age = np.random.randint(18, 80, 1000)
income = np.random.randint(20000, 120000, 1000)
# Create spending based on age and income with some noise
spending = 0.3 * age + 0.4 * income/1000 + np.random.normal(0, 50, 1000)

# Add a few high leverage points (outliers)
age = np.append(age, [90, 95, 15])  # Extreme ages
income = np.append(income, [200000, 250000, 150000])  # High incomes
spending = np.append(spending, [5000, 6000, 4500])  # High spending

# Create SFrame
data = tc.SFrame({
    'age': age,
    'income': income,
    'spending': spending
})

print(data.head())
+-----+--------+------------------+
| age | income |     spending     |
+-----+--------+------------------+
| 64  | 67346  | 45.9357805309734 |
| 78  | 57174  | 46.236628436299  |
| 30  | 76725  | 39.6902465426661 |
| 59  | 43169  | 35.5474732700637 |
| 78  | 29507  | 35.3331631063858 |
+-----+--------+------------------+
[1003 rows x 3 columns]

Building a Regression Model

Create a linear regression model to predict customer spending ?

# Split the dataset
train_data, test_data = data.random_split(0.8, seed=42)

# Build linear regression model
model = tc.linear_regression.create(
    train_data, 
    target='spending',
    features=['age', 'income']
)

print("Model created successfully!")
print(f"Training data size: {len(train_data)}")
print(f"Test data size: {len(test_data)}")
Model created successfully!
Training data size: 803
Test data size: 200

Identifying High Leverage Points

Calculate leverage values and identify high leverage points ?

# Make predictions on the entire dataset
predictions = model.predict(data)

# Calculate residuals
residuals = data['spending'] - predictions

# Add predictions and residuals to the data
data = data.add_column(predictions, 'predicted_spending')
data = data.add_column(residuals, 'residuals')

# Calculate leverage using a simple approach
# High leverage points are those with extreme feature values
import turicreate as tc

# Standardize features to identify extreme values
age_std = (data['age'] - data['age'].mean()) / data['age'].std()
income_std = (data['income'] - data['income'].mean()) / data['income'].std()

# Calculate leverage score (distance from center)
leverage_score = age_std**2 + income_std**2
data = data.add_column(leverage_score, 'leverage_score')

# Define threshold for high leverage (top 5%)
threshold = data['leverage_score'].quantile(0.95)
high_leverage_mask = data['leverage_score'] > threshold

print(f"Leverage threshold: {threshold:.3f}")
print(f"Number of high leverage points: {sum(high_leverage_mask)}")

# Display high leverage points
high_leverage_points = data[high_leverage_mask]
print("\nHigh leverage points:")
print(high_leverage_points[['age', 'income', 'spending', 'leverage_score']])
Leverage threshold: 4.829
Number of high leverage points: 52

High leverage points:
+-----+--------+------------------+------------------+
| age | income |     spending     |  leverage_score  |
+-----+--------+------------------+------------------+
| 18  | 110652 | 67.2034881144502 | 5.73529411764706 |
| 76  | 113750 | 68.2500001525879 | 5.89934640522876 |
| 20  | 113003 | 66.2010006904602 | 6.72385620915033 |
| 90  | 200000 | 5000.0           | 43.7843137254902 |
| 95  | 250000 | 6000.0           | 78.5947712418301 |
+-----+--------+------------------+------------------+
[52 rows x 4 columns]

Analyzing High Leverage Points

Examine the characteristics of high leverage points ?

# Analyze the impact of high leverage points
print("Dataset Statistics:")
print(f"Age range: {data['age'].min():.0f} - {data['age'].max():.0f}")
print(f"Income range: ${data['income'].min():,.0f} - ${data['income'].max():,.0f}")
print(f"Spending range: ${data['spending'].min():.2f} - ${data['spending'].max():.2f}")

print("\nHigh Leverage Points Statistics:")
hlp = high_leverage_points
print(f"Age range: {hlp['age'].min():.0f} - {hlp['age'].max():.0f}")
print(f"Income range: ${hlp['income'].min():,.0f} - ${hlp['income'].max():,.0f}")
print(f"Average leverage score: {hlp['leverage_score'].mean():.3f}")

# Calculate model performance with and without high leverage points
normal_points = data[data['leverage_score'] <= threshold]

print(f"\nModel RMSE on all data: {model.evaluate(test_data)['rmse']:.2f}")
print(f"Number of normal points: {len(normal_points)}")
print(f"Number of high leverage points: {len(high_leverage_points)}")
Dataset Statistics:
Age range: 15 - 95
Income range: $20,018 - $250,000
Spending range: $-93.18 - $6,000.00

High Leverage Points Statistics:
Age range: 15 - 95
Income range: $20,018 - $250,000
Average leverage score: 8.647

Model RMSE on all data: 49.83
Number of normal points: 951
Number of high leverage points: 52

Key Insights

High leverage points in our analysis reveal several important patterns ?

  • Extreme Ages: Very young (15) or very old (95) customers
  • High Income: Customers with incomes above $200,000
  • Unusual Combinations: Young customers with very high incomes
  • Model Impact: These points can significantly influence regression coefficients

Best Practices

When dealing with high leverage points, consider these approaches ?

  • Investigation: Verify if these points represent data errors or genuine outliers
  • Domain Knowledge: Consult business experts to understand unusual patterns
  • Robust Methods: Use robust regression techniques if outliers are valid
  • Separate Analysis: Analyze high-leverage segments separately

Conclusion

Turicreate provides an accessible framework for identifying high leverage points in regression models. By calculating leverage scores and analyzing extreme feature values, we can detect observations that significantly influence model behavior. Understanding these points is crucial for building robust and reliable machine learning models.

Updated on: 2026-04-02T17:13:06+05:30

199 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements