Sales prediction model for State of Connecticut Cannabis Retail Sales¶
This data set contains preliminary weekly retail sales data for cannabis and cannabis products in both the adult-use cannabis and medical marijuana markets. The data reported is compiled at specific points in time and only captures data current at the time the report is generated. The weekly data set captures retail cannabis sales from Sunday through Saturday of the week. Weeks spanning across two different months only include days within the same month. The first and last week of each month may show lower sales as they may not be made up of a full week (7 days). Data values may be updated and change over time as updates occur. Accordingly, weekly reported data may not exactly match annually reported data.
Source Data : https://catalog.data.gov/dataset/cannabis-retail-sales-by-week-ending
Return Home : https://johnkimaiyo.vercel.app/
Creating a prediction model using Python and Pandas involves several steps, including data preprocessing, exploratory data analysis, feature engineering, model selection, training, and evaluation.
Step 1: Import Necessary Libraries¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import joblib
Step 2: Load the Dataset¶
Cannibas_Sales_df = pd.read_csv(r"C:\Users\jki\Desktop\Data Scence Projects\Cannibas Retail Sales\Machine Learning\Source Data\Cannabis_Retail_Sales_by_Week_Ending.csv")
# Display the first few rows of the dataset
print(Cannibas_Sales_df.head())
Week Ending Adult-Use Retail Sales Medical Marijuana Retail Sales \ 0 01/14/2023 1485019.32 1776700.69 1 01/21/2023 1487815.81 2702525.61 2 01/28/2023 1553216.30 2726237.56 3 01/31/2023 578840.62 863287.86 4 02/04/2023 1047436.20 1971731.40 Total Adult-Use and Medical Sales Adult-Use Products Sold \ 0 3261720.01 33610 1 4190341.42 33005 2 4279453.86 34854 3 1442128.48 12990 4 3019167.60 24134 Medical Products Sold Total Products Sold \ 0 49312 82922 1 77461 110466 2 76450 111304 3 24023 37013 4 56666 80800 Adult-Use Average Product Price Medical Average Product Price 0 44.25 36.23 1 45.08 34.89 2 44.56 35.65 3 44.56 35.93 4 43.49 34.84
Step 3: Data Preprocessing¶
Before building the model, you need to preprocess the data. This includes handling missing values, converting data types, and encoding categorical variables if necessary.
# Check for missing values
print(Cannibas_Sales_df.isnull().sum())
# Convert 'Week Ending' to datetime format
Cannibas_Sales_df['Week Ending'] = pd.to_datetime(Cannibas_Sales_df['Week Ending'])
# Extract year, month, and day from the date
Cannibas_Sales_df['Year'] = Cannibas_Sales_df['Week Ending'].dt.year
Cannibas_Sales_df['Month'] = Cannibas_Sales_df['Week Ending'].dt.month
Cannibas_Sales_df['Day'] = Cannibas_Sales_df['Week Ending'].dt.day
# Drop the original 'Week Ending' column
Cannibas_Sales_df.drop('Week Ending', axis=1, inplace=True)
# Display the first few rows after preprocessing
print(Cannibas_Sales_df.head())
Week Ending 0 Adult-Use Retail Sales 0 Medical Marijuana Retail Sales 0 Total Adult-Use and Medical Sales 0 Adult-Use Products Sold 0 Medical Products Sold 0 Total Products Sold 0 Adult-Use Average Product Price 0 Medical Average Product Price 0 dtype: int64 Adult-Use Retail Sales Medical Marijuana Retail Sales \ 0 1485019.32 1776700.69 1 1487815.81 2702525.61 2 1553216.30 2726237.56 3 578840.62 863287.86 4 1047436.20 1971731.40 Total Adult-Use and Medical Sales Adult-Use Products Sold \ 0 3261720.01 33610 1 4190341.42 33005 2 4279453.86 34854 3 1442128.48 12990 4 3019167.60 24134 Medical Products Sold Total Products Sold \ 0 49312 82922 1 77461 110466 2 76450 111304 3 24023 37013 4 56666 80800 Adult-Use Average Product Price Medical Average Product Price Year \ 0 44.25 36.23 2023 1 45.08 34.89 2023 2 44.56 35.65 2023 3 44.56 35.93 2023 4 43.49 34.84 2023 Month Day 0 1 14 1 1 21 2 1 28 3 1 31 4 2 4
Step 4: Exploratory Data Analysis (EDA)¶
Perform some basic EDA to understand the data distribution and relationships between variables.
# Summary statistics
print(Cannibas_Sales_df.describe())
# Correlation matrix
print(Cannibas_Sales_df.corr())
# Plotting the correlation matrix
import seaborn as sns
sns.heatmap(Cannibas_Sales_df.corr(), annot=True, cmap='coolwarm')
plt.show()
Adult-Use Retail Sales Medical Marijuana Retail Sales \
count 1.290000e+02 1.290000e+02
mean 2.805301e+06 1.777271e+06
std 1.119186e+06 6.973442e+05
min 1.639950e+05 6.283767e+04
25% 2.005884e+06 1.458784e+06
50% 3.154663e+06 1.818867e+06
75% 3.781082e+06 2.365348e+06
max 4.495102e+06 3.085787e+06
Total Adult-Use and Medical Sales Adult-Use Products Sold \
count 1.290000e+02 129.000000
mean 4.582549e+06 71854.674419
std 1.560073e+06 30263.936939
min 2.268327e+05 4188.000000
25% 3.815815e+06 51174.000000
50% 5.385123e+06 81333.000000
75% 5.599181e+06 96544.000000
max 7.290974e+06 120223.000000
Medical Products Sold Total Products Sold \
count 129.000000 129.000000
mean 49059.937984 121017.155039
std 19173.146419 42855.211462
min 1916.000000 6104.000000
25% 41914.000000 96853.000000
50% 51266.000000 140225.000000
75% 62499.000000 148744.000000
max 86307.000000 199162.000000
Adult-Use Average Product Price Medical Average Product Price \
count 129.000000 129.000000
mean 39.163566 35.965271
std 1.661305 1.734351
min 35.550000 32.800000
25% 38.140000 34.750000
50% 39.080000 35.650000
75% 39.970000 36.830000
max 45.080000 41.830000
Year Month Day
count 129.000000 129.000000 129.000000
mean 2023.558140 6.325581 18.294574
std 0.571552 3.531472 9.731074
min 2023.000000 1.000000 1.000000
25% 2023.000000 3.000000 10.000000
50% 2024.000000 6.000000 19.000000
75% 2024.000000 9.000000 28.000000
max 2025.000000 12.000000 31.000000
Adult-Use Retail Sales \
Adult-Use Retail Sales 1.000000
Medical Marijuana Retail Sales 0.445148
Total Adult-Use and Medical Sales 0.916391
Adult-Use Products Sold 0.985865
Medical Products Sold 0.487862
Total Products Sold 0.914167
Adult-Use Average Product Price -0.388460
Medical Average Product Price -0.291913
Year 0.396208
Month 0.279192
Day 0.026906
Medical Marijuana Retail Sales \
Adult-Use Retail Sales 0.445148
Medical Marijuana Retail Sales 1.000000
Total Adult-Use and Medical Sales 0.766368
Adult-Use Products Sold 0.418163
Medical Products Sold 0.987573
Total Products Sold 0.736314
Adult-Use Average Product Price 0.252361
Medical Average Product Price 0.253649
Year -0.423685
Month -0.096965
Day 0.006971
Total Adult-Use and Medical Sales \
Adult-Use Retail Sales 0.916391
Medical Marijuana Retail Sales 0.766368
Total Adult-Use and Medical Sales 1.000000
Adult-Use Products Sold 0.894187
Medical Products Sold 0.791456
Total Products Sold 0.984970
Adult-Use Average Product Price -0.165875
Medical Average Product Price -0.096031
Year 0.094840
Month 0.156928
Day 0.022443
Adult-Use Products Sold \
Adult-Use Retail Sales 0.985865
Medical Marijuana Retail Sales 0.418163
Total Adult-Use and Medical Sales 0.894187
Adult-Use Products Sold 1.000000
Medical Products Sold 0.478523
Total Products Sold 0.920014
Adult-Use Average Product Price -0.438852
Medical Average Product Price -0.303896
Year 0.393841
Month 0.296591
Day 0.020368
Medical Products Sold Total Products Sold \
Adult-Use Retail Sales 0.487862 0.914167
Medical Marijuana Retail Sales 0.987573 0.736314
Total Adult-Use and Medical Sales 0.791456 0.984970
Adult-Use Products Sold 0.478523 0.920014
Medical Products Sold 1.000000 0.784075
Total Products Sold 0.784075 1.000000
Adult-Use Average Product Price 0.223924 -0.209915
Medical Average Product Price 0.159102 -0.143934
Year -0.364514 0.115209
Month -0.100276 0.164167
Day -0.002415 0.016054
Adult-Use Average Product Price \
Adult-Use Retail Sales -0.388460
Medical Marijuana Retail Sales 0.252361
Total Adult-Use and Medical Sales -0.165875
Adult-Use Products Sold -0.438852
Medical Products Sold 0.223924
Total Products Sold -0.209915
Adult-Use Average Product Price 1.000000
Medical Average Product Price 0.347670
Year -0.351466
Month -0.555304
Day -0.111863
Medical Average Product Price Year \
Adult-Use Retail Sales -0.291913 0.396208
Medical Marijuana Retail Sales 0.253649 -0.423685
Total Adult-Use and Medical Sales -0.096031 0.094840
Adult-Use Products Sold -0.303896 0.393841
Medical Products Sold 0.159102 -0.364514
Total Products Sold -0.143934 0.115209
Adult-Use Average Product Price 0.347670 -0.351466
Medical Average Product Price 1.000000 -0.619702
Year -0.619702 1.000000
Month -0.056917 -0.179758
Day -0.158841 -0.003103
Month Day
Adult-Use Retail Sales 0.279192 0.026906
Medical Marijuana Retail Sales -0.096965 0.006971
Total Adult-Use and Medical Sales 0.156928 0.022443
Adult-Use Products Sold 0.296591 0.020368
Medical Products Sold -0.100276 -0.002415
Total Products Sold 0.164167 0.016054
Adult-Use Average Product Price -0.555304 -0.111863
Medical Average Product Price -0.056917 -0.158841
Year -0.179758 -0.003103
Month 1.000000 -0.010315
Day -0.010315 1.000000
Step 5: Feature Engineering¶
Feature engineering involves creating new features or transforming existing ones to improve the model's performance.
# Create a new feature: Total Products Sold per Week
Cannibas_Sales_df['Total Products Sold per Week'] = Cannibas_Sales_df['Adult-Use Products Sold'] + Cannibas_Sales_df['Medical Products Sold']
# Display the first few rows after feature engineering
print(Cannibas_Sales_df.head())
Adult-Use Retail Sales Medical Marijuana Retail Sales \ 0 1485019.32 1776700.69 1 1487815.81 2702525.61 2 1553216.30 2726237.56 3 578840.62 863287.86 4 1047436.20 1971731.40 Total Adult-Use and Medical Sales Adult-Use Products Sold \ 0 3261720.01 33610 1 4190341.42 33005 2 4279453.86 34854 3 1442128.48 12990 4 3019167.60 24134 Medical Products Sold Total Products Sold \ 0 49312 82922 1 77461 110466 2 76450 111304 3 24023 37013 4 56666 80800 Adult-Use Average Product Price Medical Average Product Price Year \ 0 44.25 36.23 2023 1 45.08 34.89 2023 2 44.56 35.65 2023 3 44.56 35.93 2023 4 43.49 34.84 2023 Month Day Total Products Sold per Week 0 1 14 82922 1 1 21 110466 2 1 28 111304 3 1 31 37013 4 2 4 80800
Step 6: Splitting the Data¶
Split the data into training and testing sets
# Define features (X) and target (y)
X = Cannibas_Sales_df.drop(['Total Adult-Use and Medical Sales'], axis=1)
y = Cannibas_Sales_df['Total Adult-Use and Medical Sales']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)
(103, 11) (26, 11)
Step 7: Model Selection and Training¶
Choose a model and train it on the training data. For simplicity, we'll use a Linear Regression model.
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Step 8: Model Evaluation¶
Evaluate the model's performance on the test data.
# Make predictions
y_pred = model.predict(X_test)
# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# Plot the actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()
Mean Squared Error: 346692.3760245455
Step 9: Making Predictions¶
You can now use the trained model to make predictions on new data.
# Example: Predict on new data
new_data = pd.DataFrame({
'Adult-Use Retail Sales': [1500000],
'Medical Marijuana Retail Sales': [1800000],
'Adult-Use Products Sold': [30000],
'Medical Products Sold': [50000],
'Total Products Sold': [80000],
'Adult-Use Average Product Price': [40],
'Medical Average Product Price': [35],
'Year': [2024],
'Month': [1],
'Day': [15],
'Total Products Sold per Week': [80000]
})
# Save the model to a file
joblib.dump(model, 'cannabis_sales_model.pkl')
predicted_sales = model.predict(new_data)
print(f'Predicted Total Sales: {predicted_sales[0]}')
Predicted Total Sales: 3300001.5161294458
Summary¶
Import Libraries: Import necessary libraries like Pandas, NumPy, and Scikit-learn.
Load Data: Load the dataset into a Pandas DataFrame.
Preprocess Data: Handle missing values, convert data types, and create new features.
EDA: Perform exploratory data analysis to understand the data.
Feature Engineering: Create new features or transform existing ones.
Split Data: Split the data into training and testing sets.
Train Model: Choose a model and train it on the training data.
Evaluate Model: Evaluate the model's performance on the test data.
Make Predictions: Use the trained model to make predictions on new data