I. INTRODUCTION

I.1. Background

Faro, an off-price retailer that specialises in selling branded products at discounted prices. The business wants to understand which product types, brands, and
suppliers move fastest, how pricing affects inventory turnover, and how customer shopping habits vary by region and time.

I.2. Project Objective

The main objectives of this project are:

clean and prepare the data for analysis by identifying possible quality issues.
Perform a business-oriented exploratory data analysis (EDA) to identify key insights on how specific discounts and features influence the company’s sales, business performance, and customer behavior.
Build and evaluate a predictive model to predict whether a given product will be sold on clearance or not.

I. 3. Data Overview:

We load the dataset and assess its structure and summary statistics.

Dataset Description

TransactionID: Unique ID for each transaction
StoreID: Store location identifier
Date: Date of transaction
ProductID: Unique product identifier
Brand: Product brand
Supplier: Supplier of the product
Category: Top-level product category (e.g., Apparel, Footwear, Accessories, Kids)
Subcategory: Second-level product classification (e.g., Tops, Boots, Bags, Boys)
Type: Third-level, most granular product type (e.g., T-shirt, Sneakers, Clutch, Puzzle)
OriginalPrice: Original list price
DiscountedPrice: Actual sale price
Quantity: Number of units sold (can be negative for returns)
CustomerID: Encrypted customer ID
Region: Store region (e.g., GP, WC, KZN)

II. DATA PREPARATION

II.1. Loading the dataset

We load the dataset and display the first 10 transactions.

Code

import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import boxcox
from scipy.stats import skew

from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report


from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from sklearn.tree import DecisionTreeClassifier


from lets_plot import *
from plotnine import *

LetsPlot.setup_html()

df = pd.read_csv('../data/raw/offprice_transactions.csv')
df.head(5)

	TransactionID	StoreID	Date	ProductID	Brand	Supplier	Category	Subcategory	Type	OriginalPrice	DiscountedPrice	Quantity	CustomerID	Region	ClearanceFlag
0	T100000	S008	2024-04-12	P4655	Puma	Asos	Footwear	Sneakers	Lifestyle Sneakers	243.03	194.42	5	C18185	WC	No
1	T100001	S007	2024-04-12	P9119	Nike	Asos	Kids	Baby	Booties	144.83	130.35	5	C23031	KZN	No
2	T100002	S008	2024-04-09	P4186	Puma	SoleSupplier	Kids	Girls	Dress	60.00	36.00	4	C82590	GP	No
3	T100003	S008	2024-02-07	P2999	Ralph Lauren	Bloomingdale's	Footwear	Sandals	Flip Flops	20.22	8.09	2	C57438	GP	Yes
4	T100004	S009	2024-01-22	P2853	Michael Kors	Bloomingdale's	Footwear	Sandals	Flip Flops	21.98	19.78	2	C60203	WC	No

II.2. Dataset Inspection

We check the number of observations and features in the dataset.

Our dataset has 100000 transactions and 15 features.

Note

The dataset contains 100000 transactions (records) and is composed of 15 features.

Features types:

We check the feature types to find out what feature may need conversion.

Code

df.dtypes

TransactionID       object
StoreID             object
Date                object
ProductID           object
Brand               object
Supplier            object
Category            object
Subcategory         object
Type                object
OriginalPrice      float64
DiscountedPrice    float64
Quantity             int64
CustomerID          object
Region              object
ClearanceFlag       object
dtype: object

Features Insights

Date feature has to be converted into a datetime feature, and temporal features (day, week, month) extracted from it.
Create DiscountPercent feature from the price features.
Create Revenue feature based on both Price and Quantity. Refer to Caution 1.

Feature Information:

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 15 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   TransactionID    100000 non-null  object 
 1   StoreID          100000 non-null  object 
 2   Date             100000 non-null  object 
 3   ProductID        100000 non-null  object 
 4   Brand            100000 non-null  object 
 5   Supplier         100000 non-null  object 
 6   Category         100000 non-null  object 
 7   Subcategory      100000 non-null  object 
 8   Type             100000 non-null  object 
 9   OriginalPrice    100000 non-null  float64
 10  DiscountedPrice  100000 non-null  float64
 11  Quantity         100000 non-null  int64  
 12  CustomerID       100000 non-null  object 
 13  Region           100000 non-null  object 
 14  ClearanceFlag    100000 non-null  object 
dtypes: float64(2), int64(1), object(12)
memory usage: 11.4+ MB

Caution 1: Revenue formula

Some products are sold at normal price while others are sold at a discounted price.
Thus, the formula for the Revenue feature is:

\[ \text{Revenue} = \begin{cases} \text{Quantity} \times \text{OriginalPrice}, & \text{if } \texttt{ClearanceFlag} = \texttt{No} \\ \text{Quantity} \times \text{DiscountedPrice}, & \text{if } \texttt{ClearanceFlag} = \texttt{Yes} \end{cases} \]

II.3. Data Quality issues

We now check any data quality issues:

duplicate values
missing values
skewness of the distribution
outliers
incorrect feature formating etc

II.3.1. Duplicate values

Code

df.duplicated().any()

False

Note

There are no duplicate value in the dataset.

II.3.2. Missing value analysis

Below we check the number of missing value for each feature:

Code

df.isnull().any()

TransactionID      False
StoreID            False
Date               False
ProductID          False
Brand              False
Supplier           False
Category           False
Subcategory        False
Type               False
OriginalPrice      False
DiscountedPrice    False
Quantity           False
CustomerID         False
Region             False
ClearanceFlag      False
dtype: bool

Note

There are no missing value in the dataset. There is therefore no need for handling incomplete records.

II.3.3. Distribution analysis

We check the distribution of both the numerical and categorical features. But first we extract numerical and categorical features.

Code

numerical_columns = df.select_dtypes(include="number").columns.to_list()

categorical_columns = df.select_dtypes(exclude="number").columns.to_list()

print(f" **Numerical features in the dataset:** {numerical_columns}")

print(f" **Categorical features in the dataset:** {categorical_columns}")

 **Numerical features in the dataset:** ['OriginalPrice', 'DiscountedPrice', 'Quantity']
 **Categorical features in the dataset:** ['TransactionID', 'StoreID', 'Date', 'ProductID', 'Brand', 'Supplier', 'Category', 'Subcategory', 'Type', 'CustomerID', 'Region', 'ClearanceFlag']

Caution

Remember we have to convert the Date feature later on.

Listing 1: Numerical features description

Code

df.describe()

	OriginalPrice	DiscountedPrice	Quantity
count	100000.000000	100000.000000	100000.000000
mean	160.068899	107.431994	2.770260
std	80.798129	59.216377	1.698319
min	20.010000	8.000000	-2.000000
25%	89.720000	58.140000	2.000000
50%	160.180000	103.060000	3.000000
75%	230.060000	149.980000	4.000000
max	300.000000	270.000000	5.000000

We can also visualize the distribution below:

Code

numerical = ["OriginalPrice", "DiscountedPrice", "Quantity"]

for feat in numerical: 

    # Extract skewness
    print(f"{feat}:")
    skewness = skew(df[feat])

    # Print
    print(f"Skewness is {skewness}")
    
    # Generate plot
    p = (ggplot(df, aes(x=feat)) +
         geom_density(color='darkgreen', alpha=.7) +
         ggtitle(f"Distribution of {feat} (Skewness: {skewness:.2f})") +
         xlab(feat) + ylab("Density") +
         theme_minimal())

    # Show the plot
    display(p)

OriginalPrice:
Skewness is -0.0028062700583212194

DiscountedPrice:
Skewness is 0.35201956858296896

Quantity:
Skewness is -0.576037618829498

Numerical feature Insights

OriginalPrice feature is uniformly distributed meaning products are sold throughout the price range and no particular OriginalPrice is informative.
DiscountedPrice feature is skewed to the right meaning most products are sold at a lower DiscountedPrice, up to R140, compared to higher prices, above R150.
Quantity feature has a minimum of -2 meaning one or two items returned every now and then.

Code

df.describe(exclude="number")

	TransactionID	StoreID	Date	ProductID	Brand	Supplier	Category	Subcategory	Type	CustomerID	Region	ClearanceFlag
count	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000	100000
unique	100000	10	140	9000	10	7	4	16	51	60470	3	2
top	T100000	S002	2024-02-23	P2610	Tommy Hilfiger	Asos	Accessories	Hats	Shorts	C76591	KZN	No
freq	1	10102	776	27	10180	23446	25122	6469	3654	9	33460	79980

We can have a more detailed view of the categorical features below:

Code

categories = ["StoreID", "ProductID", "Brand", "Supplier", "Category", "Subcategory", "Type", "CustomerID", "Region"]

for col in categories:
    print(f"{col} unique values: {df[col].value_counts()}")
    print("--"*20)

StoreID unique values: StoreID
S002    10102
S007    10066
S008    10037
S003    10036
S001    10014
S005     9987
S010     9973
S004     9954
S009     9950
S006     9881
Name: count, dtype: int64
----------------------------------------
ProductID unique values: ProductID
P2610    27
P1976    26
P7930    25
P2591    25
P1640    25
         ..
P9560     2
P6268     2
P6560     2
P1952     2
P6495     2
Name: count, Length: 9000, dtype: int64
----------------------------------------
Brand unique values: Brand
Tommy Hilfiger    10180
Ralph Lauren      10129
Under Armour      10119
Adidas            10002
Zara               9992
Michael Kors       9932
DKNY               9932
Puma               9918
Guess              9913
Nike               9883
Name: count, dtype: int64
----------------------------------------
Supplier unique values: Supplier
Asos              23446
Nordstrom         15197
Bloomingdale's    14981
Macy's            14867
JD Sports         13441
Inditex            9992
SoleSupplier       8076
Name: count, dtype: int64
----------------------------------------
Category unique values: Category
Accessories    25122
Apparel        25022
Kids           25007
Footwear       24849
Name: count, dtype: int64
----------------------------------------
Subcategory unique values: Subcategory
Hats         6469
Boys         6416
Tops         6336
Flats        6311
Belts        6291
Toys         6258
Boots        6256
Jewelry      6248
Dresses      6246
Bottoms      6232
Outerwear    6208
Sneakers     6174
Baby         6173
Girls        6160
Bags         6114
Sandals      6108
Name: count, dtype: int64
----------------------------------------
Type unique values: Type
Shorts                3654
T-shirt               3648
Loafers               2148
Sneakers              2142
Puzzle                2130
Espadrilles           2125
Elastic Belt          2122
Running Shoes         2120
Casual Dress          2114
Booties               2106
Bracelet              2105
Sandals               2105
Knee-high Boots       2102
Canvas Belt           2093
Ankle Boots           2084
Action Figure         2084
Evening Dress         2083
Earrings              2079
Leather Belt          2076
Dress                 2074
Flip Flops            2072
Chelsea Boots         2070
Necklace              2064
Basketball Shoes      2052
Maxi Dress            2049
Slides                2045
Plush Toy             2044
Onesie                2038
Ballet Flats          2038
Rompers               2029
Lifestyle Sneakers    2002
Gladiator Sandals     1991
Leggings              1981
Fedora                1646
Cap                   1635
Skirt                 1623
Tank Top              1619
Bucket Hat            1611
Blouse                1604
Shirt                 1590
Clutch                1585
Beanie                1577
Trousers              1569
Puffer                1561
Jacket                1560
Tote                  1550
Blazer                1544
Coat                  1543
Jeans                 1535
Satchel               1516
Backpack              1463
Name: count, dtype: int64
----------------------------------------
CustomerID unique values: CustomerID
C76591    9
C36989    8
C68230    8
C42539    8
C99455    7
         ..
C23917    1
C19435    1
C39164    1
C63404    1
C90467    1
Name: count, Length: 60470, dtype: int64
----------------------------------------
Region unique values: Region
KZN    33460
GP     33438
WC     33102
Name: count, dtype: int64
----------------------------------------

Categorical features Insights

Some customers have multiple transactions, so CustomerID is not unique.
9000 unique products are sold
10 unique brands and stores
Products are bought in 7 different quantities, may need to investigate further for a correlation between quantity and price.
60470 unique customers, maybe investigate or cluster them based on their purchasing behavior.

Code

df["ClearanceFlag"].value_counts()

ClearanceFlag
No     79980
Yes    20020
Name: count, dtype: int64

Target feature Insight

Out of 100000 items sold:

79980, ~80%, were not on clearance sale
20020, ~20%, were on clearance sale
Stores sell mostly items at the normal rate.
A highly imbalanced dataset (1:4 ratio) for building a predictive model.

II.3.4. Data preparation

We prepare the data for further analysis.

A. Formatting features

We convert the date feature in a datetime type, and we ensure numerical features are properly set.

Code

df['Date'] = pd.to_datetime(df['Date'])

df['OriginalPrice'] = pd.to_numeric(df['OriginalPrice'], errors='coerce')
df['DiscountedPrice'] = pd.to_numeric(df['DiscountedPrice'], errors='coerce')
df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce')

B. Feature preparation

At this stage Listing 1, we observed that the lowest of value of the Quantity feature is -2. We must take care of returned products as well create a Revenue Caution 1.

Features

Every returned product was entered as a negative number. We create a new feature to describe if a product is return. e.g IsReturn.
Machine learning behave differently to the presence of negative numbers such as quantity = -1. We must consider both the business logic (A negative quantity means a return and thus an opportunity loss) and the model performance impact.

Code

# 1. Separate quantities and create indicators
df['QuantityAbs'] = df['Quantity'].abs()
# Identify returns
df['IsReturn'] = (df['Quantity'] < 0).astype(int)
df['TransactionType'] = df['Quantity'].apply(lambda x: 'Return' if x < 0 else 'Sale')


# 2. Create revenue Revenue with proper handling
df['RevenueAbs'] = np.where(df['ClearanceFlag'] == 'No', 
                            df['QuantityAbs'] * df['OriginalPrice'],
                            df['QuantityAbs'] * df['DiscountedPrice'])


# 3. Create final Revenue (negative for returns)
df['RevenueFinal'] = np.where(df['IsReturn'] == 1, 
                              -df['RevenueAbs'], 
                              df['RevenueAbs'])

# 4. Drop original Quantity if desired
df = df.drop('Quantity', axis=1)


# 5. create time-based features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['WeekOfYear'] = df['Date'].dt.isocalendar().week
df['DayOfWeek'] = df['Date'].dt.dayofweek

# 6. Calculate discount percentage
df['DiscountPercentage'] = ((df['OriginalPrice'] - df['DiscountedPrice']) / df['OriginalPrice'] * 100).round(2)

df.head(5)

	TransactionID	StoreID	Date	ProductID	Brand	Supplier	Category	Subcategory	Type	OriginalPrice	...	QuantityAbs	TransactionType	RevenueAbs	RevenueFinal	Year	Month	WeekOfYear	DayOfWeek	DiscountPercentage
0	T100000	S008	2024-04-12	P4655	Puma	Asos	Footwear	Sneakers	Lifestyle Sneakers	243.03	...	5	Sale	1215.15	1215.15	2024	4	15	4	20.00
1	T100001	S007	2024-04-12	P9119	Nike	Asos	Kids	Baby	Booties	144.83	...	5	Sale	724.15	724.15	2024	4	15	4	10.00
2	T100002	S008	2024-04-09	P4186	Puma	SoleSupplier	Kids	Girls	Dress	60.00	...	4	Sale	240.00	240.00	2024	4	15	1	40.00
3	T100003	S008	2024-02-07	P2999	Ralph Lauren	Bloomingdale's	Footwear	Sandals	Flip Flops	20.22	...	2	Sale	16.18	16.18	2024	2	6	2	59.99
4	T100004	S009	2024-01-22	P2853	Michael Kors	Bloomingdale's	Footwear	Sandals	Flip Flops	21.98	...	2	Sale	43.96	43.96	2024	1	4	0	10.01

5 rows × 24 columns

C. Outlier Analysis

We observed above that DiscountedPrice is skewed to the right indicating the presence of outliers.

There are many ways to detect outliers:

By calculating the skewness of the distribution and establishing a threshold
By visualizing the distribution of features
And more advanced methods more outlier detection
Using the IQR for the feature.

Above we applied the first two. We can confirm that with the last method:

For DiscountedPrice, we have:

min = 8
25% = 58
50% = 103
75% = 149.98
100% = 270

By finding the difference between each interval, we can deduce from where outliers occur:

25% - min = 50
50% - 25% = 45
75% - 50% = 46
100% - 75%= 120

Outliers Insight

We notice a jump from 75th percentile to the 100th depicting outliers.
75th percentile + 50 = ~200. Anything above 200 in the DiscountedPrice is an outlier.
Though a mild skewness (0.35) is observed, we will attempt to deal with it.

Summary of transformations

Code

transformations = {}

# Original
transformations['Original'] = df['DiscountedPrice']

# Log transformation
transformations['Log'] = np.log1p(df['DiscountedPrice'])

# Square root
transformations['Sqrt'] = np.sqrt(df['DiscountedPrice'])

# Box-Cox
transformations['BoxCox'], _ = boxcox(df['DiscountedPrice'] + 1)

# Yeo-Johnson
pt = PowerTransformer(method='yeo-johnson', standardize=False)
transformations['YeoJohnson'] = pt.fit_transform(df[['DiscountedPrice']]).flatten()

# Compare skewness
results = []
for name, data in transformations.items():
    skew_val = stats.skew(data)
    results.append({'Transformation': name, 'Skewness': skew_val})

comparison_df = pd.DataFrame(results)
print(comparison_df.round(3))

# Find the best transformation (closest to 0)
best_transform = comparison_df.loc[comparison_df['Skewness'].abs().idxmin(), 'Transformation']
print(f"\nBest transformation: {best_transform}")

  Transformation  Skewness
0       Original     0.352
1            Log    -0.781
2           Sqrt    -0.176
3         BoxCox    -0.116
4     YeoJohnson    -0.116

Best transformation: BoxCox

Final implementation

Code

# Apply Box-Cox and store the Lambda value 
df['DiscountedPriceBoxCox'], lambda_val = boxcox(df['DiscountedPrice'] + 1)

# Drop original column
df = df.drop('DiscountedPrice', axis=1)


# Update your revenue calculation with transformed prices
df['RevenueAbs'] = np.where(df['ClearanceFlag'] == 'No', 
                           df['QuantityAbs'] * df['OriginalPrice'],
                           df['QuantityAbs'] * df['DiscountedPriceBoxCox'])

print(f"New skewness: {stats.skew(df['DiscountedPriceBoxCox']):.3f}")
df.head(5)

New skewness: -0.116

	TransactionID	StoreID	Date	ProductID	Brand	Supplier	Category	Subcategory	Type	OriginalPrice	...	TransactionType	RevenueAbs	RevenueFinal	Year	Month	WeekOfYear	DayOfWeek	DiscountPercentage	DiscountedPriceBoxCox
0	T100000	S008	2024-04-12	P4655	Puma	Asos	Footwear	Sneakers	Lifestyle Sneakers	243.03	...	Sale	1215.150000	1215.15	2024	4	15	4	20.00	30.868606
1	T100001	S007	2024-04-12	P9119	Nike	Asos	Kids	Baby	Booties	144.83	...	Sale	724.150000	724.15	2024	4	15	4	10.00	24.486402
2	T100002	S008	2024-04-09	P4186	Puma	SoleSupplier	Kids	Girls	Dress	60.00	...	Sale	240.000000	240.00	2024	4	15	1	40.00	11.337297
3	T100003	S008	2024-02-07	P2999	Ralph Lauren	Bloomingdale's	Footwear	Sandals	Flip Flops	20.22	...	Sale	8.567399	16.18	2024	2	6	2	59.99	4.283699
4	T100004	S009	2024-01-22	P2853	Michael Kors	Bloomingdale's	Footwear	Sandals	Flip Flops	21.98	...	Sale	43.960000	43.96	2024	1	4	0	10.01	7.776294

5 rows × 24 columns

Summary of Data Preparation

We loaded our dataset and check its observations and features
We made sure there were no missing values nor any duplicated values
We created new features to enrich our dataset and make it more consistent for further analysis.
We converted Date and transformed the DiscountedPrice feature to handle the format for the former and outliers for the latter.

III. EXPLORATORY DATA ANALYSIS - EDA

We explore the dataset in depth and answer business questions.

Let us have a look at the updated dataset’s features:

III.1. Units and Revenue Analysis

Sales information

Code

print(f"Sales transactions: {df[df['IsReturn'] == 0].shape[0]:,}")
print(f"Return transactions: {df[df['IsReturn'] == 1].shape[0]:,}")
print(f"Return rate: {df['IsReturn'].mean():.3f}")

Sales transactions: 94,914
Return transactions: 5,086
Return rate: 0.051

Sales Insights

Sales account for 94.5% of all transactions while returns for ~ 5.1%.

Because we focus on units sold and revenue, we filter to only use items that were not returned.

Code

df_sales = df[df['IsReturn'] == 0].copy()
print(f"\nAnalyzing {df_sales.shape[0]:,} sales transactions (excluding returns)")


Analyzing 94,914 sales transactions (excluding returns)

Units Analysis

Code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

# Summarise
grouped_df = (
    df_sales.groupby(['Region', 'Supplier', 'ClearanceFlag'], as_index=False)
      .agg({'QuantityAbs': 'sum', 'RevenueFinal': 'sum'})
)

# Order Supplier by total quantity across all ClearanceFlags
supplier_order = (
    grouped_df.groupby('Supplier')['QuantityAbs'].sum()
    .sort_values().index.tolist()
)
grouped_df['Supplier'] = pd.Categorical(grouped_df['Supplier'], categories=supplier_order, ordered=True)

# Define colors
palette = {"Yes": "#faca80", "No": "#151931"}

# Prepare plot
fig, axes = plt.subplots(1, 3, figsize=(18, 9), sharey=True)
fig.subplots_adjust(top=0.85, bottom=0.15, left=0.05, right=0.95, wspace=0.25)

# Custom Title and Subtitle
highlight_color = "#faca80"
subtitle_color = "gray"
caption_color = "#666666"
text_color = "#151931"

fig.suptitle("Supplier Contribution to Quantity Sold", fontsize=18, fontweight='bold', y=0.98)
fig.text(0.5, 0.925,
         f"Comparison of Clearance status per Region",
         ha='center', fontsize=13, color=subtitle_color)

regions = grouped_df['Region'].unique()

for i, region in enumerate(regions):
    ax = axes[i]
    region_data = grouped_df[grouped_df['Region'] == region]

    suppliers = region_data['Supplier'].cat.categories
    clearance_yes = []
    clearance_no = []

    for supplier in suppliers:
        supplier_data = region_data[region_data['Supplier'] == supplier]
        yes_val = supplier_data[supplier_data['ClearanceFlag'] == 'Yes']['QuantityAbs'].sum()
        no_val = supplier_data[supplier_data['ClearanceFlag'] == 'No']['QuantityAbs'].sum()
        clearance_yes.append(yes_val)
        clearance_no.append(no_val)

    y_pos = range(len(suppliers))
    bar_height = 0.35

    bars_no = ax.barh([y - bar_height/2 for y in y_pos], clearance_no, 
                      bar_height, label='No', color=palette['No'])
    bars_yes = ax.barh([y + bar_height/2 for y in y_pos], clearance_yes, 
                       bar_height, label='Yes', color=palette['Yes'])

    # Add data labels
    for bar in bars_no:
        width = bar.get_width()
        if width > 0:
            ax.text(width + max(clearance_no) * 0.01, bar.get_y() + bar.get_height()/2,
                    f"{int(width):,}", va='center', ha='left',
                    fontsize=9, color=text_color, fontweight='bold')

    for bar in bars_yes:
        width = bar.get_width()
        if width > 0:
            ax.text(width + max(clearance_yes) * 0.01, bar.get_y() + bar.get_height()/2,
                    f"{int(width):,}", va='center', ha='left',
                    fontsize=9, color=text_color, fontweight='bold')

    ax.set_yticks(y_pos)
    ax.set_yticklabels(suppliers)
    ax.set_xlabel('Total Quantity', fontsize=11)
    ax.set_title(f"{region}", fontsize=13, fontweight='bold', color=text_color, pad=15)
    ax.grid(axis='x', alpha=0.3, linestyle='--')

    ax.xaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x):,}'))

    if i == 0:
        ax.legend(title='Clearance Sale', loc='upper right', fontsize=9)

# Add caption
fig.text(0.01, 0.02, 'Source: Dan - Faro', 
         ha='left', va='bottom', fontsize=9, color=caption_color)

plt.show()

Code

# Summarise
grouped_df = (
    df_sales.groupby(['Region', 'Brand', 'ClearanceFlag'], as_index=False)
      .agg({'QuantityAbs': 'sum', 'RevenueFinal': 'sum'})
)

# Order Supplier by total quantity across all ClearanceFlags
brand_order = (
    grouped_df.groupby('Brand')['QuantityAbs'].sum()
    .sort_values().index.tolist()
)
grouped_df['Brand'] = pd.Categorical(grouped_df['Brand'], categories=brand_order, ordered=True)

# Define colors
palette = {"Yes": "#faca80", "No": "#151931"}

# Prepare plot
fig, axes = plt.subplots(1, 3, figsize=(18, 9), sharey=True)
fig.subplots_adjust(top=0.85, bottom=0.15, left=0.05, right=0.95, wspace=0.25)

# Custom Title and Subtitle
highlight_color = "#faca80"
subtitle_color = "gray"
caption_color = "#666666"
text_color = "#151931"

fig.suptitle("Brand Contribution to Quantity Sold", fontsize=18, fontweight='bold', y=0.98)
fig.text(0.5, 0.925,
         f"Comparison of Clearance status per Region",
         ha='center', fontsize=13, color=subtitle_color)

regions = grouped_df['Region'].unique()

for i, region in enumerate(regions):
    ax = axes[i]
    region_data = grouped_df[grouped_df['Region'] == region]

    brands = region_data['Brand'].cat.categories
    clearance_yes = []
    clearance_no = []

    for brand in brands:
        brand_data = region_data[region_data['Brand'] == brand]
        yes_val = brand_data[brand_data['ClearanceFlag'] == 'Yes']['QuantityAbs'].sum()
        no_val = brand_data[brand_data['ClearanceFlag'] == 'No']['QuantityAbs'].sum()
        clearance_yes.append(yes_val)
        clearance_no.append(no_val)

    y_pos = range(len(brands))
    bar_height = 0.35

    bars_no = ax.barh([y - bar_height/2 for y in y_pos], clearance_no, 
                      bar_height, label='No', color=palette['No'])
    bars_yes = ax.barh([y + bar_height/2 for y in y_pos], clearance_yes, 
                       bar_height, label='Yes', color=palette['Yes'])

    # Add data labels
    for bar in bars_no:
        width = bar.get_width()
        if width > 0:
            ax.text(width + max(clearance_no) * 0.01, bar.get_y() + bar.get_height()/2,
                    f"{int(width):,}", va='center', ha='left',
                    fontsize=9, color=text_color, fontweight='bold')

    for bar in bars_yes:
        width = bar.get_width()
        if width > 0:
            ax.text(width + max(clearance_yes) * 0.01, bar.get_y() + bar.get_height()/2,
                    f"{int(width):,}", va='center', ha='left',
                    fontsize=9, color=text_color, fontweight='bold')

    ax.set_yticks(y_pos)
    ax.set_yticklabels(brands)
    ax.set_xlabel('Total Quantity', fontsize=11)
    ax.set_title(f"{region}", fontsize=13, fontweight='bold', color=text_color, pad=15)
    ax.grid(axis='x', alpha=0.3, linestyle='--')

    ax.xaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x):,}'))

    if i == 0:
        ax.legend(title='Clearance Sale', loc='upper right', fontsize=9)

# Add caption
fig.text(0.01, 0.02, 'Source: Dan - Faro', 
         ha='left', va='bottom', fontsize=9, color=caption_color)

plt.show()

Code

# Summarise
grouped_df = (
    df_sales.groupby(['Region', 'Category', 'ClearanceFlag'], as_index=False)
      .agg({'QuantityAbs': 'sum', 'RevenueFinal': 'sum'})
)


category_order = (
    grouped_df.groupby('Category')['QuantityAbs'].sum()
    .sort_values().index.tolist()
)
grouped_df['Category'] = pd.Categorical(grouped_df['Category'], categories=category_order, ordered=True)

# Define colors
palette = {"Yes": "#faca80", "No": "#151931"}

# Prepare plot
fig, axes = plt.subplots(1, 3, figsize=(18, 9), sharey=True)
fig.subplots_adjust(top=0.85, bottom=0.15, left=0.05, right=0.95, wspace=0.25)

# Custom Title and Subtitle
highlight_color = "#faca80"
subtitle_color = "gray"
caption_color = "#666666"
text_color = "#151931"

fig.suptitle("Category Contribution to Quantity Sold", fontsize=18, fontweight='bold', y=0.98)
fig.text(0.5, 0.925,
         f"Comparison of Clearance status per Region",
         ha='center', fontsize=13, color=subtitle_color)

regions = grouped_df['Region'].unique()

for i, region in enumerate(regions):
    ax = axes[i]
    region_data = grouped_df[grouped_df['Region'] == region]

    categories = region_data['Category'].cat.categories
    clearance_yes = []
    clearance_no = []

    for category in categories:
        category_data = region_data[region_data['Category'] == category]
        yes_val = category_data[category_data['ClearanceFlag'] == 'Yes']['QuantityAbs'].sum()
        no_val = category_data[category_data['ClearanceFlag'] == 'No']['QuantityAbs'].sum()
        clearance_yes.append(yes_val)
        clearance_no.append(no_val)

    y_pos = range(len(categories))
    bar_height = 0.35

    bars_no = ax.barh([y - bar_height/2 for y in y_pos], clearance_no, 
                      bar_height, label='No', color=palette['No'])
    bars_yes = ax.barh([y + bar_height/2 for y in y_pos], clearance_yes, 
                       bar_height, label='Yes', color=palette['Yes'])

    # Add data labels
    for bar in bars_no:
        width = bar.get_width()
        if width > 0:
            ax.text(width + max(clearance_no) * 0.01, bar.get_y() + bar.get_height()/2,
                    f"{int(width):,}", va='center', ha='left',
                    fontsize=9, color=text_color, fontweight='bold')

    for bar in bars_yes:
        width = bar.get_width()
        if width > 0:
            ax.text(width + max(clearance_yes) * 0.01, bar.get_y() + bar.get_height()/2,
                    f"{int(width):,}", va='center', ha='left',
                    fontsize=9, color=text_color, fontweight='bold')

    ax.set_yticks(y_pos)
    ax.set_yticklabels(categories)
    ax.set_xlabel('Total Quantity', fontsize=11)
    ax.set_title(f"{region}", fontsize=13, fontweight='bold', color=text_color, pad=15)
    ax.grid(axis='x', alpha=0.3, linestyle='--')

    ax.xaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x):,}'))

    if i == 0:
        ax.legend(title='Clearance Sale', loc='upper right', fontsize=9)

# Add caption
fig.text(0.01, 0.02, 'Source: Dan - Faro', 
         ha='left', va='bottom', fontsize=9, color=caption_color)

plt.show()

Code

units_by_category = df_sales.groupby('Category')['QuantityAbs'].sum().sort_values(ascending=False)
units_by_brand = df_sales.groupby('Brand')['QuantityAbs'].sum().sort_values(ascending=False).head(15)
units_by_supplier = df_sales.groupby('Supplier')['QuantityAbs'].sum().sort_values(ascending=False).head(15)


category_units_df = units_by_category.reset_index()
category_units_df.columns = ['Category', 'Units']

brand_units_df = units_by_brand.reset_index()
brand_units_df.columns = ['Brand', 'Units']

supplier_units_df = units_by_supplier.reset_index()
supplier_units_df.columns = ['Supplier', 'Units']

print("Top 5 Categories by units sold:")
for i, (cat, units) in enumerate(units_by_category.head().items(), 1):
    print(f"{i}. {cat}: {units:,} units")

print("\nTop 5 Brands by units sold:")
for i, (brand, units) in enumerate(units_by_brand.head().items(), 1):
    print(f"{i}. {brand}: {units:,} units")

print("\nTop 5 Suppliers by units sold:")
for i, (brand, units) in enumerate(units_by_supplier.head().items(), 1):
    print(f"{i}. {brand}: {units:,} units")

Top 5 Categories by units sold:
1. Accessories: 71,445 units
2. Apparel: 71,300 units
3. Kids: 71,154 units
4. Footwear: 70,780 units

Top 5 Brands by units sold:
1. Tommy Hilfiger: 28,915 units
2. Ralph Lauren: 28,840 units
3. Under Armour: 28,739 units
4. Adidas: 28,542 units
5. DKNY: 28,408 units

Top 5 Suppliers by units sold:
1. Asos: 66,762 units
2. Nordstrom: 43,342 units
3. Bloomingdale's: 42,764 units
4. Macy's: 42,366 units
5. JD Sports: 38,301 units

Revenue Analysis

Code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter


grouped_df = (
    df_sales
      .groupby(['Region', 'Supplier', 'ClearanceFlag'], as_index=False)
      .agg({'QuantityAbs': 'sum',              
            'RevenueFinal': 'sum'})
)


supplier_order = (
    grouped_df.groupby('Supplier')['RevenueFinal']   
              .sum()
              .sort_values()
              .index
              .tolist()
)
grouped_df['Supplier'] = pd.Categorical(grouped_df['Supplier'],
                                        categories=supplier_order,
                                        ordered=True)


palette        = {"Yes": "#faca80", "No": "#151931"}
highlight_col  = "#faca80"
subtitle_col   = "gray"
caption_col    = "#666666"
text_col       = "#151931"


fig, axes = plt.subplots(1, 3, figsize=(18, 9), sharey=True)
fig.subplots_adjust(top=0.85, bottom=0.15, left=0.05,
                    right=0.95, wspace=0.25)

fig.suptitle("Supplier Contribution to Revenue",
             fontsize=18, fontweight='bold', y=0.98)

fig.text(0.5, 0.925,
         "Clearance vs Regular sales by region",
         ha='center', fontsize=13, color=subtitle_col)

regions = grouped_df['Region'].unique()


for i, region in enumerate(regions):
    ax = axes[i]
    region_data = grouped_df[grouped_df['Region'] == region]

    suppliers       = region_data['Supplier'].cat.categories
    clearance_yes   = []
    clearance_no    = []

    
    for supplier in suppliers:
        data = region_data[region_data['Supplier'] == supplier]
        yes_val = data[data['ClearanceFlag'] == 'Yes']['RevenueFinal'].sum()
        no_val  = data[data['ClearanceFlag'] == 'No']['RevenueFinal'].sum()
        clearance_yes.append(yes_val)
        clearance_no.append(no_val)

    y_pos      = range(len(suppliers))
    bar_height = 0.35

    bars_no = ax.barh([y - bar_height/2 for y in y_pos],
                      clearance_no, bar_height,
                      label='No', color=palette['No'])
    bars_yes = ax.barh([y + bar_height/2 for y in y_pos],
                       clearance_yes, bar_height,
                       label='Yes', color=palette['Yes'])

    
    def _add_labels(bars, offset):
        for bar in bars:
            width = bar.get_width()
            if width > 0:
                ax.text(width + offset, bar.get_y() + bar.get_height()/2,
                        f"{width:,.0f}",      
                        va='center', ha='left',
                        fontsize=9, color=text_col, fontweight='bold')
                

    _add_labels(bars_no, max(clearance_no) * 0.01 if clearance_no else 0)
    _add_labels(bars_yes, max(clearance_yes) * 0.01 if clearance_yes else 0)

    
    ax.set_yticks(y_pos)
    ax.set_yticklabels(suppliers)
    ax.set_xlabel('Total Revenue', fontsize=11)
    ax.set_title(region, fontsize=13, fontweight='bold',
                 color=text_col, pad=15)
    ax.grid(axis='x', alpha=0.3, linestyle='--')

    
    ax.xaxis.set_major_formatter(FuncFormatter(
        lambda x, _: f"R {x:,.0f}"      
        
    ))
    ax.tick_params(axis='x', labelrotation=30)

    if i == 0:
        ax.legend(title='Clearance Sale', loc='upper right', fontsize=9)


fig.text(0.01, 0.02, 'Source: Dan – Faro',
         ha='left', va='bottom', fontsize=9, color=caption_col)

plt.show()

Code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter


grouped_df = (
    df_sales
      .groupby(['Region', 'Brand', 'ClearanceFlag'], as_index=False)
      .agg({'QuantityAbs': 'sum',              
            'RevenueFinal': 'sum'})
)


brand_order = (
    grouped_df.groupby('Brand')['RevenueFinal']   
              .sum()
              .sort_values()
              .index
              .tolist()
)
grouped_df['Brand'] = pd.Categorical(grouped_df['Brand'],
                                        categories=brand_order,
                                        ordered=True)


palette        = {"Yes": "#faca80", "No": "#151931"}
highlight_col  = "#faca80"
subtitle_col   = "gray"
caption_col    = "#666666"
text_col       = "#151931"


fig, axes = plt.subplots(1, 3, figsize=(18, 9), sharey=True)
fig.subplots_adjust(top=0.85, bottom=0.15, left=0.05,
                    right=0.95, wspace=0.25)

fig.suptitle("Brand Contribution to Revenue",
             fontsize=18, fontweight='bold', y=0.98)

fig.text(0.5, 0.925,
         "Clearance status revenue by region",
         ha='center', fontsize=13, color=subtitle_col)

regions = grouped_df['Region'].unique()


for i, region in enumerate(regions):
    ax = axes[i]
    region_data = grouped_df[grouped_df['Region'] == region]

    brands       = region_data['Brand'].cat.categories
    clearance_yes   = []
    clearance_no    = []

    
    for brand in brands:
        data = region_data[region_data['Brand'] == brand]
        yes_val = data[data['ClearanceFlag'] == 'Yes']['RevenueFinal'].sum()
        no_val  = data[data['ClearanceFlag'] == 'No']['RevenueFinal'].sum()
        clearance_yes.append(yes_val)
        clearance_no.append(no_val)

    y_pos      = range(len(brands))
    bar_height = 0.35

    bars_no = ax.barh([y - bar_height/2 for y in y_pos],
                      clearance_no, bar_height,
                      label='No', color=palette['No'])
    bars_yes = ax.barh([y + bar_height/2 for y in y_pos],
                       clearance_yes, bar_height,
                       label='Yes', color=palette['Yes'])

    
    def _add_labels(bars, offset):
        for bar in bars:
            width = bar.get_width()
            if width > 0:
                ax.text(width + offset, bar.get_y() + bar.get_height()/2,
                        f"{width:,.0f}",      
                        va='center', ha='left',
                        fontsize=9, color=text_col, fontweight='bold'
                        )

    _add_labels(bars_no, max(clearance_no) * 0.01 if clearance_no else 0)
    _add_labels(bars_yes, max(clearance_yes) * 0.01 if clearance_yes else 0)

    
    ax.set_yticks(y_pos)
    ax.set_yticklabels(brands)
    ax.set_xlabel('Total Revenue', fontsize=11)
    ax.set_title(region, fontsize=13, fontweight='bold',
                 color=text_col, pad=15)
    ax.grid(axis='x', alpha=0.3, linestyle='--')

    
    ax.xaxis.set_major_formatter(FuncFormatter(
        lambda x, _: f"R {x:,.0f}"      
        
    ))

    ax.tick_params(axis='x', labelrotation=30)

    if i == 0:
        ax.legend(title='Clearance Sale', loc='upper right', fontsize=9)


fig.text(0.01, 0.02, 'Source: Dan – Faro',
         ha='left', va='bottom', fontsize=9, color=caption_col)

plt.show()

Code

grouped_df = (
    df_sales
      .groupby(['Region', 'Category', 'ClearanceFlag'], as_index=False)
      .agg({'QuantityAbs': 'sum',              
            'RevenueFinal': 'sum'})
)


category_order = (
    grouped_df.groupby('Category')['RevenueFinal']   
              .sum()
              .sort_values()
              .index
              .tolist()
)
grouped_df['Category'] = pd.Categorical(grouped_df['Category'],
                                        categories=category_order,
                                        ordered=True)


palette        = {"Yes": "#faca80", "No": "#151931"}
highlight_col  = "#faca80"
subtitle_col   = "gray"
caption_col    = "#666666"
text_col       = "#151931"


fig, axes = plt.subplots(1, 3, figsize=(18, 9), sharey=True)
fig.subplots_adjust(top=0.85, bottom=0.15, left=0.05,
                    right=0.95, wspace=0.25)

fig.suptitle("Category Contribution to Revenue",
             fontsize=18, fontweight='bold', y=0.98)

fig.text(0.5, 0.925,
         "Clearance status revenue by region",
         ha='center', fontsize=13, color=subtitle_col)

regions = grouped_df['Region'].unique()


for i, region in enumerate(regions):
    ax = axes[i]
    region_data = grouped_df[grouped_df['Region'] == region]

    categories       = region_data['Category'].cat.categories
    clearance_yes   = []
    clearance_no    = []

    
    for category in categories:
        data = region_data[region_data['Category'] == category]
        yes_val = data[data['ClearanceFlag'] == 'Yes']['RevenueFinal'].sum()
        no_val  = data[data['ClearanceFlag'] == 'No']['RevenueFinal'].sum()
        clearance_yes.append(yes_val)
        clearance_no.append(no_val)

    y_pos      = range(len(categories))
    bar_height = 0.35

    bars_no = ax.barh([y - bar_height/2 for y in y_pos],
                      clearance_no, bar_height,
                      label='No', color=palette['No'])
    bars_yes = ax.barh([y + bar_height/2 for y in y_pos],
                       clearance_yes, bar_height,
                       label='Yes', color=palette['Yes'])

    
    def _add_labels(bars, offset):
        for bar in bars:
            width = bar.get_width()
            if width > 0:
                ax.text(width + offset, bar.get_y() + bar.get_height()/2,
                        f"{width:,.0f}",      
                        va='center', ha='left',
                        fontsize=9, color=text_col, fontweight='bold'
                        )

    _add_labels(bars_no, max(clearance_no) * 0.01 if clearance_no else 0)
    _add_labels(bars_yes, max(clearance_yes) * 0.01 if clearance_yes else 0)

    
    ax.set_yticks(y_pos)
    ax.set_yticklabels(categories)
    ax.set_xlabel('Total Revenue', fontsize=11)
    ax.set_title(region, fontsize=13, fontweight='bold',
                 color=text_col, pad=15)
    ax.grid(axis='x', alpha=0.3, linestyle='--')

    
    ax.xaxis.set_major_formatter(FuncFormatter(
        lambda x, _: f"R {x:,.0f}"      
        
    ))

    ax.tick_params(axis='x', labelrotation=30)

    if i == 0:
        ax.legend(title='Clearance Sale', loc='upper right', fontsize=9)


fig.text(0.01, 0.02, 'Source: Dan – Faro',
         ha='left', va='bottom', fontsize=9, color=caption_col)

plt.show()

Code

revenue_by_category = df_sales.groupby('Category')['RevenueFinal'].sum().sort_values(ascending=False)
revenue_by_brand = df_sales.groupby('Brand')['RevenueFinal'].sum().sort_values(ascending=False).head(15)
revenue_by_supplier = df_sales.groupby('Supplier')['RevenueFinal'].sum().sort_values(ascending=False).head(15)


category_revenue_df = revenue_by_category.reset_index()
category_revenue_df.columns = ['Category', 'Revenue']


brand_revenue_df = revenue_by_brand.reset_index()
brand_revenue_df.columns = ['Brand', 'Revenue']

supplier_revenue_df = revenue_by_supplier.reset_index()
supplier_revenue_df.columns = ['Supplier', 'Revenue']



print("\nTOP 5 Categories by Revenue:")
for i, (cat, rev) in enumerate(revenue_by_category.head().items(), 1):
    print(f"{i}. {cat}: R {rev:,.2f}")


print("\nTOP 5 Brands by Revenue:")
for i, (brand, rev) in enumerate(revenue_by_brand.head().items(), 1):
    print(f"{i}. {brand}: R {rev:,.2f}")


print("\nTOP 5 Suppliers by Revenue:")
for i, (brand, rev) in enumerate(units_by_supplier.head().items(), 1):
    print(f"{i}. {brand}: R {rev:,.2f}")


TOP 5 Categories by Revenue:
1. Kids: R 10,225,549.88
2. Apparel: R 10,210,762.89
3. Accessories: R 10,159,280.09
4. Footwear: R 10,110,418.12

TOP 5 Brands by Revenue:
1. Tommy Hilfiger: R 4,117,751.97
2. DKNY: R 4,116,944.23
3. Under Armour: R 4,111,537.65
4. Ralph Lauren: R 4,078,220.00
5. Adidas: R 4,072,014.76

TOP 5 Suppliers by Revenue:
1. Asos: R 66,762.00
2. Nordstrom: R 43,342.00
3. Bloomingdale's: R 42,764.00
4. Macy's: R 42,366.00
5. JD Sports: R 38,301.00

Insights of Analysis

In all three regions (GP, KZN, WC),Asos is the top performer supplier selling the most units both on sales and at normal price. Thus generating also the most revenue.
Based on the number of units sold:
- In GP, Tommy Hilfiger is the top brand selling the most units at normal prices and the second best on sales products.
- In KZN, Ralph Lauren is the top brand selling the most units both on sales and at normal prices.
- In WC, Adidas is the top brand selling the most units at normal price and Ralph Lauren is the top brand selling the most units on sales.
- Tommy Hilfiger is the top performer brand in units sold. Followed by Ralph Lauren and Under Armour.
- Tommy Hilfiger is the top performer brand in revenue. Followed by DKNY and Under Armour.
Accessories is the top performer Category followed by Apparel and Kids in units sold while Kids, Apparel, and Accessories are the top performers in revenue.

III.2. Discount effect on top brands & suppliers

Supplier Discount % and units sold

Code

# discount data for scatter plot
discount_scatter_df = df_sales[df_sales['DiscountPercentage'] > 0].sample(min(5000, len(df_sales[df_sales['DiscountPercentage'] > 0]))).copy()

# discount distribution
discount_dist_df = df_sales[df_sales['DiscountPercentage'] > 0].copy()

# Group by Supplier
supplier_summary = (
    discount_dist_df.groupby('Supplier')
    .agg(TotalUnitsSold=('QuantityAbs', 'sum'), AvgDiscountPercent=('DiscountPercentage', 'mean'))
    .reset_index()
)

# Group by Brand
brand_summary = (
    discount_dist_df.groupby('Brand')
    .agg(TotalUnitsSold=('QuantityAbs', 'sum'), AvgDiscountPercent=('DiscountPercentage', 'mean'))
    .reset_index()
)

# Select top 10 suppliers
top_suppliers = supplier_summary.nlargest(10, 'TotalUnitsSold')

# Select top 10 brands
top_brands = brand_summary.nlargest(10, 'TotalUnitsSold')

p_discount_scatter = (
    ggplot(top_suppliers, aes(x='AvgDiscountPercent', y='TotalUnitsSold', label='Supplier')) +
    geom_point(color='#faca80', size=4, alpha=0.8) +
    geom_text(nudge_y=5, size=8) +
    labs(
        title='Top 7 Suppliers: Discount % vs. Units Sold',
        x='Average Discount Percent',
        y='Total Units Sold'
    ) +
    theme_minimal() +
    theme(figure_size=(6, 5))
)
p_discount_scatter

Supplier discount % and Units

Asos, top performer supplier in units sold, has an average discount percentage of 32.3%, whereas Nordstrom and Bloomingda have an average discount percentage of 43.3% with a performance of more or less 42,000 units, same as Macy's with an average discount percentage of 25%. Depicting how the discount percentage does not drive sales.

III.3. Average turnover per store

Turnover Statistics

Code

# Calculate weekly sales per store (sales only)
weekly_sales = df_sales.groupby(['StoreID', 'WeekOfYear'])['QuantityAbs'].sum().reset_index()
avg_weekly_turnover = weekly_sales.groupby('StoreID')['QuantityAbs'].mean().sort_values(ascending=False)


print("Stores performance KPI:")
print(f"-   Best performing store: {avg_weekly_turnover.index[0]} ({avg_weekly_turnover.iloc[0]:.1f} units/week)")
print(f"-   Average weekly turnover across all stores: {avg_weekly_turnover.mean():.1f} units/week")
print(f"-   Median weekly turnover: {avg_weekly_turnover.median():.1f} units/week")
print(f"-   Standard deviation: {avg_weekly_turnover.std():.1f} units/week")

print("\nTop 5 Stores by weekly turnover:")
for i, (store, turnover) in enumerate(avg_weekly_turnover.head(5).items(), 1):
    print(f"{i}. Store {store}: {turnover:.1f} units/week")

Stores performance KPI:
-   Best performing store: S002 (1438.0 units/week)
-   Average weekly turnover across all stores: 1423.4 units/week
-   Median weekly turnover: 1423.8 units/week
-   Standard deviation: 8.5 units/week

Top 5 Stores by weekly turnover:
1. Store S002: 1438.0 units/week
2. Store S010: 1429.9 units/week
3. Store S003: 1428.4 units/week
4. Store S001: 1428.2 units/week
5. Store S009: 1426.5 units/week

Visualizations

Code

turnover_df = avg_weekly_turnover.reset_index()
turnover_df.columns = ['StoreID', 'AvgWeeklyTurnover']

stores_df = turnover_df.head(10)


# Stores Performance
p_avg_turnover = (ggplot(stores_df, 
        aes(x='reorder(StoreID, AvgWeeklyTurnover)', y='AvgWeeklyTurnover')
    ) +
      geom_col(fill='#151931', alpha=0.8) +
      coord_flip() +
      labs(title='Average Weekly Turnover by Store', 
           x='Store ID', 
           y='Average Weekly Units Sold') +
      theme_minimal() +
      theme(figure_size=(6, 5)))

p_avg_turnover

Weekly trend

Code

weekly_trend = df_sales.groupby('WeekOfYear')['QuantityAbs'].sum().reset_index()
weekly_trend['WeekOfYear'] = weekly_trend['WeekOfYear'].astype(int)


p_trend = (ggplot(weekly_trend, aes(x='WeekOfYear', y='QuantityAbs')) +
       geom_line(color='#faca80', size=1) +
       geom_point(color='#151931', size=2) +
       labs(title='Weekly Sales Trend Across All Stores', 
            x='Week of Year', 
            y='Total Units Sold') +
       theme_minimal() +
       theme(figure_size=(10, 5)))

p_trend

Stores Insights

Stores sell on average 1424 units per week and only varies by 9 units indicating they sell more or less the same quantity.
S002 is the best performing store with 1438 units per week compared to the average of 1424 units per week.

III.4. Clearance Impact

Sales only

Code

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter

PALETTE = {"No": "#faca80", "Yes": "#151931"}
currency = FuncFormatter(lambda x, _: f"${x:,.0f}")

def quartile_table(df, group_col, value_col):
    q = (df.groupby(group_col)[value_col]
            .quantile([.25, .5, .75]).unstack())
    q.columns = ['Q1', 'Median', 'Q3']
    return q.reset_index()

def annotate_quartiles(ax, stats, x_offset=0.06):
    for xpos, (_, row) in enumerate(stats.iterrows()):
        ax.scatter(xpos, row.Median, s=80,
                   facecolor='white', edgecolor='#151931',
                   linewidth=1.4, zorder=3)
        ax.text(xpos + x_offset, row.Median,
                f"Median: R {row.Median:,.0f}", va='center',
                ha='left', fontsize=9, fontweight='bold',
                color='#151931')
        ax.text(xpos - x_offset, row.Q1,
                f"Q1: R {row.Q1:,.0f}", va='center', ha='right',
                fontsize=8, color='#151931')
        ax.text(xpos - x_offset, row.Q3,
                f"Q3: R {row.Q3:,.0f}", va='center', ha='right',
                fontsize=8, color='#151931')

def clearance_plot(df, title):
    """Half-eye (left-half violin) + boxplot + annotated quartiles."""
    df = df.copy()
    df["_half"] = True  # dummy hue level for split violin

    fig, ax = plt.subplots(figsize=(6, 6))

    # Use dummy palette for _half
    dummy_palette = {True: "#b0b0b8"}

    # half violin (left side)
    sns.violinplot(
        data=df, x='ClearanceFlag', y='RevenueFinal',
        hue='_half',
        split=True,
        bw_adjust=.6,
        cut=0,
        palette=dummy_palette,  
        legend=False,
        inner=None, linewidth=0,
        ax=ax
    )

    # 2oxplot on top, with correct PALETTE
    sns.boxplot(
        data=df, x='ClearanceFlag', y='RevenueFinal',
        hue='ClearanceFlag', palette=PALETTE,
        legend=False, dodge=False,
        width=.18, fliersize=0,
        boxprops={'alpha': .9}, ax=ax
    )

    # annotate quartiles
    stats = quartile_table(df, 'ClearanceFlag', 'RevenueFinal')
    annotate_quartiles(ax, stats)

    # Final formatting
    ax.set_xlabel(None)
    ax.set_ylabel('Revenue (per line)', fontsize=11)
    ax.set_xticks([0, 1])
    ax.set_xticklabels(['Regular Price', 'Clearance'],
                       fontsize=11, fontweight='bold', color='#151931')
    ax.set_title(title, fontsize=14, weight='bold')
    ax.yaxis.set_major_formatter(currency)
    sns.despine(left=True)
    fig.tight_layout()
    return fig

# Build plots

sales_df = df[df['IsReturn'] == 0].copy()
returns_df = df[df['IsReturn'] == 1].copy()   

fig_sales = clearance_plot(
    sales_df,
    "Sales Revenue Distribution by Clearance Status"
)

if not returns_df.empty:
    fig_returns = clearance_plot(
        returns_df,
        "Return Revenue Distribution by Clearance Status"
    )

plt.show()

Code

# Sales only
clearance_sales_analysis = df_sales.groupby('ClearanceFlag').agg({
    'QuantityAbs': ['sum', 'mean'],
    'RevenueFinal': ['sum', 'mean'],
}).round(2)




print("Clearance vs Non-Clearance (Sales Only):")
print(clearance_sales_analysis)

Clearance vs Non-Clearance (Sales Only):
              QuantityAbs       RevenueFinal        
                      sum  mean          sum    mean
ClearanceFlag                                       
No                 227842  3.00  36373670.00  479.34
Yes                 56837  2.99   4332340.98  227.65

Returns and Sales

Code

# Sales and returns transactions
clearance_return_analysis = df.groupby('ClearanceFlag').agg({
    'IsReturn': ['sum', 'mean'],
    'RevenueFinal': ['sum', 'mean'],
    'QuantityAbs': 'sum'
}).round(3)

print("Sales and returns")
print(clearance_return_analysis)

print("\nRETURN ANALYSIS (All Transactions):")
print("Non-Clearance:")
print(f"  - Return rate: {clearance_return_analysis.loc['No', ('IsReturn', 'mean')]:.3f}")
print(f"  - Total return transactions: {clearance_return_analysis.loc['No', ('IsReturn', 'sum')]:,.0f}")

print("Clearance:")
print(f"  - Return rate: {clearance_return_analysis.loc['Yes', ('IsReturn', 'mean')]:.3f}")
print(f"  - Total return transactions: {clearance_return_analysis.loc['Yes', ('IsReturn', 'sum')]:,.0f}")

Sales and returns
              IsReturn        RevenueFinal          QuantityAbs
                   sum   mean          sum     mean         sum
ClearanceFlag                                                  
No                4097  0.051  35380209.05  442.363      234025
Yes                989  0.049   4217442.73  210.661       58307

RETURN ANALYSIS (All Transactions):
Non-Clearance:
  - Return rate: 0.051
  - Total return transactions: 4,097
Clearance:
  - Return rate: 0.049
  - Total return transactions: 989

Clearance Impact Insights

Faro has sold more items at normal price (227842 units for R36,373,670.00) than at discount price ( 56837 units for R4,332,340.98).
The return rates for items by their clearance status is roughly the same, 5.1% when sold at not price and 4.9% when sold at discounted price indicating that clearance does not impact the return.

III.5. Customer behavior

Code

regional_sales_analysis = df_sales.groupby('Region').agg({
    'CustomerID': 'nunique',
    'QuantityAbs': ['sum', 'mean'],
    'RevenueFinal': ['sum', 'mean'],
    'DiscountPercentage': 'mean',
    'TransactionID': 'count'
}).round(2)

regional_return_analysis = df.groupby('Region')['IsReturn'].mean()

print("Regional Sales KPI:")
print(regional_sales_analysis)

print("\nRegional Returns KPI:")
for region, rate in regional_return_analysis.items():
    print(f"{region}: {rate:.3f}")

Regional Sales KPI:
       CustomerID QuantityAbs       RevenueFinal         DiscountPercentage  \
          nunique         sum  mean          sum    mean               mean   
Region                                                                        
GP          26751       95479  3.01  13650498.60  430.05              33.00   
KZN         26712       95226  3.00  13700240.18  431.31              32.74   
WC          26571       93974  2.99  13355272.20  425.22              32.88   

       TransactionID  
               count  
Region                
GP             31742  
KZN            31764  
WC             31408  

Regional Returns KPI:
GP: 0.051
KZN: 0.051
WC: 0.051

Regional metrics

Code

regional_summary = df_sales.groupby('Region').agg({
    'CustomerID': 'nunique',
    'TransactionID': 'count',
    'RevenueFinal': 'sum',
    'QuantityAbs': 'sum'
}).reset_index()

regional_summary['TransactionsPerCustomer'] = regional_summary['TransactionID'] / regional_summary['CustomerID']
regional_summary['RevenuePerCustomer'] = regional_summary['RevenueFinal'] / regional_summary['CustomerID']
regional_summary['UnitsPerCustomer'] = regional_summary['QuantityAbs'] / regional_summary['CustomerID']

# Add return rates
regional_summary = regional_summary.merge(
    regional_return_analysis.reset_index().rename(columns={'IsReturn': 'ReturnRate'}),
    on='Region'
)

print("\nCustomer-based Regional KPI:")
print(regional_summary[['Region', 'CustomerID', 'TransactionsPerCustomer', 'RevenuePerCustomer', 'UnitsPerCustomer', 'ReturnRate']])


Customer-based Regional KPI:
  Region  CustomerID  TransactionsPerCustomer  RevenuePerCustomer  \
0     GP       26751                 1.186572          510.279937   
1    KZN       26712                 1.189128          512.887099   
2     WC       26571                 1.182041          502.625878   

   UnitsPerCustomer  ReturnRate  
0          3.569175    0.050721  
1          3.564915    0.050687  
2          3.536713    0.051175

Customer behavior Insights

Regions have roughly the same number of unique customers and generating the same revenue.
Regions have the same number of returns items indicating that no particular region is underperforming.

IV. PREDICTIVE MODELLING

State of the dataset

Code

print("\nData types:")
df.info()


Data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 24 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   TransactionID          100000 non-null  object        
 1   StoreID                100000 non-null  object        
 2   Date                   100000 non-null  datetime64[ns]
 3   ProductID              100000 non-null  object        
 4   Brand                  100000 non-null  object        
 5   Supplier               100000 non-null  object        
 6   Category               100000 non-null  object        
 7   Subcategory            100000 non-null  object        
 8   Type                   100000 non-null  object        
 9   OriginalPrice          100000 non-null  float64       
 10  CustomerID             100000 non-null  object        
 11  Region                 100000 non-null  object        
 12  ClearanceFlag          100000 non-null  object        
 13  QuantityAbs            100000 non-null  int64         
 14  IsReturn               100000 non-null  int32         
 15  TransactionType        100000 non-null  object        
 16  RevenueAbs             100000 non-null  float64       
 17  RevenueFinal           100000 non-null  float64       
 18  Year                   100000 non-null  int32         
 19  Month                  100000 non-null  int32         
 20  WeekOfYear             100000 non-null  UInt32        
 21  DayOfWeek              100000 non-null  int32         
 22  DiscountPercentage     100000 non-null  float64       
 23  DiscountedPriceBoxCox  100000 non-null  float64       
dtypes: UInt32(1), datetime64[ns](1), float64(5), int32(4), int64(1), object(12)
memory usage: 16.5+ MB

Note

TransactionID is not informative for the model because of high cardinality.
ProductID is not informative for the model because of high cardinality.

Code

df = df.drop(['TransactionID', 'ProductID'], axis=1)

IV.1. Target Feature

ClearanceFlag is the target feature of interest. We attempt to predict whether a product will be sold at clearance or at normal price. This will guide the marketing team to know when, how, and for which products to promote clearance sales to maximize revenue. It is, however, a tricky task as any product WILL definitely be sold at clearance.

Thus the goal is to penalize prediction of Yes when the product would have been sold at normal price.

Code

df["ClearanceFlag"] = (df["ClearanceFlag"] == 'Yes').astype(int)

df["ClearanceFlag"].value_counts()

ClearanceFlag
0    79980
1    20020
Name: count, dtype: int64

Note

A 1:4 ratio between Yes and No for clearance status displaying imbalance.

IV.2. Feature Selection

Feature selection is a very important step in machine learning, just like the saying garbage in, garbage out, feeding the wrong features may negatively impact the model performance.

There are numerous way to select what features, recursive selection, mutual information, to more advanced methods making use of evolutionary algorithms such as particle swarm optimization or grey wolf.

In this project, we will follow a simple mutual information(MI), a concept from information theory telling us how much we can learn from a feature if we know the value of of another.

Goal: I first want to establish a baseline model without performing any advanced selection or engineering of features.

Numerical and Categorical features

Code

numerical_cols = ['OriginalPrice', 'QuantityAbs','RevenueAbs', 'RevenueFinal', 'Month', 'WeekOfYear', 'DayOfWeek', 'IsReturn','DiscountPercentage','DiscountedPriceBoxCox']


df[numerical_cols].corrwith(df.ClearanceFlag).to_frame('correlation')

	correlation
OriginalPrice	0.002843
QuantityAbs	-0.003838
RevenueAbs	-0.474774
RevenueFinal	-0.258259
Month	-0.002561
WeekOfYear	-0.002215
DayOfWeek	-0.003218
IsReturn	-0.003323
DiscountPercentage	0.737020
DiscountedPriceBoxCox	-0.253410

Note

DiscountPercentage has a 0.74 correlation with ClearanceFlag indicating that higher discount percentage likely to sell the product at clearance which is obvious. Maybe binning to find out what range of percentage is likely to sell at clearance.
RevenueAbs (-0.47) and RevenueFinal (-0.25) indicate that when revenue increases, the item is unlikely to have been sold at clearance.
A high DiscountedPrice will reduce the likelihood of the item being sold at clearance. Maybe binning the feature for better interpretability and possibly improve model performance.

Code

categorical_cols = ['StoreID', 'Brand', 'Supplier', 'Category', 'Subcategory', 'Type', 'Region','TransactionType']



### ----- Define the function to calculate the M.I on the training set ONLY
def calculate_mi(series):
    return mutual_info_score(series, df["ClearanceFlag"])

#### ---- Calculate MI between 'y' and the categorical variables of the training set ONLY 
df_mi = df[categorical_cols].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')

print('Below are the variable with highest M.I score:')
display(df_mi.head(15))

Below are the variable with highest M.I score:

	MI
Type	0.000251
Subcategory	0.000068
StoreID	0.000067
Region	0.000062
Brand	0.000043
Supplier	0.000020
Category	0.000010
TransactionType	0.000006

Note

Knowing the Type of the product provides more information on whether or not it will be sold at clearance.
TransactionType values are not very informative.
Brand, StoreID, Region, and Subcategory values also provide valuable information on the target feature status. Maybe creating interaction between these categorical features might be helpful for a better generalization.

Base model

1. Select features

Code

# Based on correlation
selected_num = ['OriginalPrice', 'QuantityAbs','RevenueAbs', 'RevenueFinal', 'DiscountPercentage','DiscountedPriceBoxCox']

# Based on MI
selected_cat = ['StoreID', 'Brand', 'Supplier', 'Category', 'Subcategory', 'Type', 'Region']

2. Split

Code

# Split
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=42)



## ---- Extract and create "ClearanceFlag" for the different splits
y_train = df_train["ClearanceFlag"].values
y_val = df_val["ClearanceFlag"].values
y_test = df_test["ClearanceFlag"].values

### ------ Delete "ClearanceFlag"from the splits
del df_train['ClearanceFlag']
del df_val['ClearanceFlag']
del df_test['ClearanceFlag']

print(f"Training set: {len(df_train)}, Validation set: {len(df_val)} , Test set: {len(df_test)}")

Training set: 60000, Validation set: 20000 , Test set: 20000

3. Train base model

Code

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

## ----- Initialize the encoder: ------
dv = DictVectorizer(sparse=False)

## ---- Apply the transformation on the training set

train_dict = df_train[selected_num + selected_cat].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

## ---- Apply the transformation on the validation set (for evaluation)
val_dict = df_val[selected_num + selected_cat].to_dict(orient='records')
X_val = dv.transform(val_dict)

#### Training on Logistic regression

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, 
class_weight='balanced',random_state = 42)
model.fit(X_train, y_train)


# Predict on validation set
y_pred = model.predict(X_val)

# Compute metrics
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred, average='weighted')  
recall = recall_score(y_val, y_pred, average='weighted')

# Print results
print("Base logistic regression")
print(f"Validation Accuracy : {accuracy:.4f}")
print("Classification report on baseline LR model:")
print(classification_report(y_val, y_pred))

Base logistic regression
Validation Accuracy : 1.0000
Classification report on baseline LR model:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     16020
           1       1.00      1.00      1.00      3980

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000

Overfitting

We noticed earlier a very high correlation between certain features to the clearance status. This results in a model learning for example that whenever DiscountPercentage is high, predict ClearanceFlag=Yes.
A feature like Revenue is a target-leaking feature as it was engineered based on the ClearanceFlag status. Solution: Drop any feature that is engineered from the target feature and observe the result.

Code

filtered_numerical_cols = ['OriginalPrice', 'QuantityAbs', 'Month', 'WeekOfYear', 'DayOfWeek']
selected_cat = ['StoreID', 'Brand', 'Supplier', 'Category', 'Subcategory', 'Type', 'Region', 'IsReturn', 'TransactionType']

# DictVectorizer encoding
dv = DictVectorizer(sparse=False)
train_dict = df_train[filtered_numerical_cols + selected_cat].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

# DictVectorizer encoding (val)
val_dict = df_val[filtered_numerical_cols + selected_cat].to_dict(orient='records')
X_val = dv.transform(val_dict)



# Train logistic regression
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000,
class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

# Predict on validation
y_pred = model.predict(X_val)

# Evaluate performance
accuracy = accuracy_score(y_val, y_pred)


print("Logistic Regression (Leakage-Free Features)")
print(f"  Accuracy : {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_val, y_pred))

Logistic Regression (Leakage-Free Features)
  Accuracy : 0.5130

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.52      0.63     16020
           1       0.20      0.49      0.28      3980

    accuracy                           0.51     20000
   macro avg       0.50      0.50      0.46     20000
weighted avg       0.68      0.51      0.56     20000

Note

No overfitting, that is a good sign. But underfitting.
The base model never predicts class 1 (Clearance) as the accuracy is 80% which is also the percentage of observations that were sold at normal price.
Recall for class 1 = 0 indicating all actual clearance items are missed.
Precision for class 1 = 0 indicating when it predicts class 1 (which it doesn’t), it’s always wrong.
Macro avg: Treats both classes equally which exposes failure on class 1.
Weighted avg: Dominated by class 0 which looks better but hides the issue.

Possible solutions: - Introduce feature engineering and observe the results. - Try a sampling technique such oversampling or undersampling or SMOTE. - Use a tree-based models like decision tree.

IV.3. Feature Engineering

To enrich the features, we create new features to help the model learn better. The features will be created in the following groups:

Temporal features:
- IsWeekend, IsMonthEnd, IsMonthStart, IsPromotionSeason
Price features:
- DiscountCategory, DiscountedPriceCategory, PriceEfficiency
Interaction features:
- Region_Type, Region_Subcategory, Type_Subcategory, StoreID_Type, Supplier_Type, etc
- Supplier_return_rate, Brand_return_rate

After feature engineering, we will attempt to make the model more robust by trying out the following:

Hyperparameter tuning
Smote
Oversampling
Undersampling