Skip to content

Instantly share code, notes, and snippets.

@kashaziz
Last active December 3, 2023 14:02
Show Gist options
  • Save kashaziz/74443bb1c2d33cbf0da5229a0d3830ad to your computer and use it in GitHub Desktop.
Save kashaziz/74443bb1c2d33cbf0da5229a0d3830ad to your computer and use it in GitHub Desktop.
Using Logistic Regression to identify Customer Retention on e-commerce site
"""
This Python script demonstrates the usage of logistic regression to predict whether customers will make the next purchase on an e-commerce site.
The code performs the following steps:
1. Load and Preprocess Data:
- Loads an e-commerce dataset containing customer features such as 'time_on_site', 'total_spent', 'is_returning_customer', and 'will_make_next_purchase'.
- Splits the data into training and testing sets.
2. Model Training:
- Creates a logistic regression model using scikit-learn.
- Trains the model on the training set, where 'will_make_next_purchase' is the target variable.
3. Model Evaluation:
- Predicts the target variable on the testing set and calculates accuracy.
- Displays the confusion matrix to provide a detailed view of model performance, including true positives, true negatives, false positives, and false negatives.
4. Making Predictions on New Data:
- Demonstrates how to use the trained model to make predictions on new data.
- Creates a new DataFrame ('new_data') with hypothetical customer features, including 'time_on_site', 'total_spent', 'is_returning_customer'.
- Outputs predictions for whether these new customers will make the next purchase.
Note: The script assumes that 'will_make_next_purchase' is a binary target variable (0 or 1) indicating whether a customer makes the next purchase.
Additionally, 'customer_id' has been added as a feature for prediction, considering it might contribute to purchase behavior.
"""
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
# Load the dataset
ecommerce_data = pd.read_csv('data/shopping_data.csv')
# Assume 'will_make_next_purchase' is the target variable, and others are features
X = ecommerce_data[['customer_id', 'time_on_site', 'total_spent', 'is_returning_customer']]
y = ecommerce_data['will_make_next_purchase']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Logistic Regression model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Display the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Display results
print(f'Training Accuracy: {model.score(X_train, y_train):.2f}')
print(f'Test Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
# Now, let's make predictions on new data
# Assuming 'new_data' is a DataFrame with columns 'customer_id', 'time_on_site', 'total_spent', 'is_returning_customer'
# You should replace this with your actual new data
new_data = pd.DataFrame({
'customer_id': [9, 10, 11],
'time_on_site': [12, 8, 15],
'total_spent': [60, 30, 75],
'is_returning_customer': [1, 0, 1]
})
# Make predictions on the new data
new_data_predictions = model.predict(new_data)
# Display predictions for the new data
new_data_with_predictions = new_data.copy()
new_data_with_predictions['will_make_next_purchase'] = new_data_predictions
print('Predictions for New Data:')
print(new_data_with_predictions[['customer_id', 'will_make_next_purchase']])
customer_id time_on_site total_spent is_returning_customer will_make_next_purchase
1 10 50 1 1
2 15 75 0 1
3 8 30 1 0
4 20 100 1 1
5 5 20 0 0
6 12 60 1 1
7 18 90 1 1
8 7 35 0 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment