Aller au contenu

Association Rules

Les règles d'association

Les règles d'association sont un algorithme d'apprentissage non supervisé. Il est utilisé pour analyser les habitudes d'achats des clients.

png

Itemset

U itemset est un utiliser l'algorithme du KMeans pour former un modèle, il faut des données numériques.

Support

Le support d'un itemset est la fréquence d'un itemset.

Méthode d'optimisation

Condition d'arrêt

Association Rules

Packages

# !pip install -q pandas mlxtend
import re

import pandas as pd

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth

Paths & Config

DATA_PATH = "/kaggle/input/sales-forecasting/train.csv"

SUPPORT = 0.2

Get Data

sales_data = pd.read_csv(DATA_PATH)
sales_data.sample(10)
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City State Postal Code Region Product ID Category Sub-Category Product Name Sales
2624 2625 CA-2018-127180 22/10/2018 24/10/2018 First Class TA-21385 Tom Ashbrook Home Office United States New York City New York 10024.0 East TEC-PH-10001494 Technology Phones Polycom CX600 IP Phone VoIP phone 2399.600
4891 4892 CA-2017-135776 23/12/2017 30/12/2017 Standard Class EH-13765 Edward Hooks Corporate United States Seattle Washington 98103.0 West OFF-PA-10001295 Office Supplies Paper Computer Printout Paper with Letter-Trim Perfo... 37.940
3462 3463 CA-2016-152611 20/02/2016 23/02/2016 Second Class KA-16525 Kelly Andreada Consumer United States Perth Amboy New Jersey 8861.0 East OFF-AR-10003903 Office Supplies Art Sanford 52201 APSCO Electric Pencil Sharpener 286.790
2295 2296 CA-2016-113145 01/11/2016 05/11/2016 Standard Class VD-21670 Valerie Dominguez Consumer United States New York City New York 10011.0 East OFF-PA-10002659 Office Supplies Paper Avoid Verbal Orders Carbonless Minifold Book 13.520
223 224 CA-2016-169397 24/12/2016 27/12/2016 First Class JB-15925 Joni Blumstein Consumer United States Dublin Ohio 43017.0 East TEC-MA-10001148 Technology Machines Swingline SM12-08 MicroCut Jam Free Shredder 479.988
3559 3560 CA-2018-152737 07/11/2018 12/11/2018 Standard Class TS-21505 Tony Sayre Consumer United States San Francisco California 94122.0 West TEC-AC-10004975 Technology Accessories Plantronics Audio 995 Wireless Stereo Headset 439.800
9642 9643 CA-2015-104563 07/03/2015 12/03/2015 Standard Class CM-12715 Craig Molinari Corporate United States Seattle Washington 98103.0 West FUR-CH-10002780 Furniture Chairs Office Star - Task Chair with Contemporary Loo... 436.704
3761 3762 CA-2018-104577 12/05/2018 17/05/2018 Standard Class CK-12205 Chloris Kastensmidt Consumer United States Everett Massachusetts 2149.0 East OFF-PA-10000659 Office Supplies Paper TOPS Carbonless Receipt Book, Four 2-3/4 x 7-1... 87.600
3551 3552 CA-2017-152555 29/03/2017 02/04/2017 Second Class ME-17320 Maria Etezadi Home Office United States Chicago Illinois 60653.0 Central FUR-CH-10002965 Furniture Chairs Global Leather Highback Executive Chair with P... 844.116
7010 7011 US-2015-135881 23/05/2015 27/05/2015 Standard Class GT-14710 Greg Tran Consumer United States New York City New York 10035.0 East OFF-BI-10000829 Office Supplies Binders Avery Non-Stick Binders 17.960

Explore Data

sales_data.shape
(9800, 18)
sales_data.describe()
Row ID Postal Code Sales
count 9800.000000 9789.000000 9800.000000
mean 4900.500000 55273.322403 230.769059
std 2829.160653 32041.223413 626.651875
min 1.000000 1040.000000 0.444000
25% 2450.750000 23223.000000 17.248000
50% 4900.500000 58103.000000 54.490000
75% 7350.250000 90008.000000 210.605000
max 9800.000000 99301.000000 22638.480000
sales_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float64(2), int64(1), object(15)
memory usage: 1.3+ MB

Clean Data

new_columns = [re.sub(' |-', '_', column).strip() for column in list(sales_data.columns)]
sales_data.columns = new_columns
sales_data.head()
Row_ID Order_ID Order_Date Ship_Date Ship_Mode Customer_ID Customer_Name Segment Country City State Postal_Code Region Product_ID Category Sub_Category Product_Name Sales
0 1 CA-2017-152156 08/11/2017 11/11/2017 Second Class CG-12520 Claire Gute Consumer United States Henderson Kentucky 42420.0 South FUR-BO-10001798 Furniture Bookcases Bush Somerset Collection Bookcase 261.9600
1 2 CA-2017-152156 08/11/2017 11/11/2017 Second Class CG-12520 Claire Gute Consumer United States Henderson Kentucky 42420.0 South FUR-CH-10000454 Furniture Chairs Hon Deluxe Fabric Upholstered Stacking Chairs,... 731.9400
2 3 CA-2017-138688 12/06/2017 16/06/2017 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles California 90036.0 West OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 14.6200
3 4 US-2016-108966 11/10/2016 18/10/2016 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale Florida 33311.0 South FUR-TA-10000577 Furniture Tables Bretford CR4500 Series Slim Rectangular Table 957.5775
4 5 US-2016-108966 11/10/2016 18/10/2016 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale Florida 33311.0 South OFF-ST-10000760 Office Supplies Storage Eldon Fold 'N Roll Cart System 22.3680
sales_data.columns
Index(['Row_ID', 'Order_ID', 'Order_Date', 'Ship_Date', 'Ship_Mode',
       'Customer_ID', 'Customer_Name', 'Segment', 'Country', 'City', 'State',
       'Postal_Code', 'Region', 'Product_ID', 'Category', 'Sub_Category',
       'Product_Name', 'Sales'],
      dtype='object')

Transform Data

def create_cart(item):
    return [elt for elt in list(item)]
df = sales_data[["Customer_ID", "Sub_Category"]].groupby(by="Customer_ID").agg({"Sub_Category": create_cart}).reset_index().rename(columns={"Sub_Category": "cart"})
# df = sales_data[["Customer_ID", "Product_Name"]].groupby(by="Customer_ID").agg({"Product_Name": create_cart}).reset_index().rename(columns={"Product_Name": "cart"})
df.head()
Customer_ID cart
0 AA-10315 [Appliances, Binders, Storage, Binders, Applia...
1 AA-10375 [Storage, Furnishings, Accessories, Binders, A...
2 AA-10480 [Paper, Furnishings, Paper, Storage, Paper, Pa...
3 AA-10645 [Chairs, Phones, Chairs, Furnishings, Envelope...
4 AB-10015 [Chairs, Art, Storage, Storage, Phones, Bookca...
df['cart_size'] = df['cart'].apply(lambda x: len(x))
df.sample(15)
Customer_ID cart cart_size
378 JK-15625 [Chairs, Chairs, Bookcases, Art, Paper, Chairs... 12
363 JF-15565 [Fasteners, Paper, Appliances, Paper, Binders,... 16
422 KD-16270 [Paper, Tables, Paper, Storage, Machines, Art,... 16
505 MG-17875 [Appliances, Storage, Tables, Paper, Furnishin... 7
306 GM-14680 [Furnishings, Art, Fasteners, Phones, Tables, ... 11
415 KB-16315 [Paper, Furnishings, Binders, Art, Labels, Env... 22
594 PJ-18835 [Paper, Accessories, Accessories, Paper, Envel... 13
542 MT-17815 [Envelopes, Binders, Storage, Paper, Tables, P... 10
769 TT-21070 [Paper, Furnishings, Paper, Fasteners, Machine... 14
435 KM-16225 [Furnishings, Furnishings, Phones, Storage, Fa... 19
696 SJ-20500 [Appliances, Accessories, Binders, Chairs, Bin... 7
317 HA-14920 [Storage, Accessories, Furnishings, Binders, A... 18
384 JL-15175 [Paper, Furnishings, Chairs, Appliances, Binde... 7
251 EB-13975 [Binders, Binders, Copiers, Supplies, Binders,... 6
493 MC-18100 [Phones, Chairs, Paper, Furnishings, Storage, ... 19
X = df['cart'].values
te = TransactionEncoder()
te_ary = te.fit(X).transform(X)
df_x = pd.DataFrame(te_ary, columns=te.columns_)
df_x.head()
Accessories Appliances Art Binders Bookcases Chairs Copiers Envelopes Fasteners Furnishings Labels Machines Paper Phones Storage Supplies Tables
0 True True False True False False False False True True False False True True True True False
1 True False True True False False False False False True False False True True True False False
2 True False True False False False False False False True False False True True True False True
3 False False True True True True False True False True False False True True True False False
4 False False True False True True False False False False False False False True True False False
frequent_itemsets = fpgrowth(df_x, min_support=SUPPORT, use_colnames=True)
#### alternatively:
#frequent_itemsets = apriori(df, min_support=SUPPORT, use_colnames=True)
# frequent_itemsets = fpmax(df, min_support=SUPPORT, use_colnames=True)

frequent_itemsets.sort_values(by='support', ascending=False).head(10)
support itemsets
0 0.815889 (Binders)
1 0.757881 (Paper)
2 0.658260 (Furnishings)
3 0.641866 (Storage)
4 0.641866 (Phones)
14 0.640605 (Paper, Binders)
8 0.617907 (Art)
5 0.590164 (Accessories)
15 0.544767 (Binders, Furnishings)
18 0.537201 (Binders, Storage)
from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7).head(10)
antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (Paper) (Binders) 0.757881 0.815889 0.640605 0.845258 1.035996 0.022258 1.189792 0.143506
1 (Binders) (Paper) 0.815889 0.757881 0.640605 0.785162 1.035996 0.022258 1.126983 0.188720
2 (Furnishings) (Binders) 0.658260 0.815889 0.544767 0.827586 1.014337 0.007700 1.067844 0.041359
3 (Furnishings) (Paper) 0.658260 0.757881 0.519546 0.789272 1.041419 0.020663 1.148963 0.116379
4 (Paper, Furnishings) (Binders) 0.519546 0.815889 0.438840 0.844660 1.035264 0.014948 1.185214 0.070896
5 (Binders, Furnishings) (Paper) 0.544767 0.757881 0.438840 0.805556 1.062904 0.025971 1.245181 0.130003
6 (Storage) (Binders) 0.641866 0.815889 0.537201 0.836935 1.025795 0.013509 1.129066 0.070216
7 (Storage) (Paper) 0.641866 0.757881 0.527112 0.821218 1.083571 0.040654 1.354267 0.215353
8 (Paper, Binders) (Storage) 0.640605 0.641866 0.451450 0.704724 1.097930 0.040267 1.212879 0.248182
9 (Paper, Storage) (Binders) 0.527112 0.815889 0.451450 0.856459 1.049725 0.021385 1.282640 0.100171

References

Let get in touch

Github Badge Facebook Badge Linkedin Badge Twitter Badge Gmail Badge

Packages

import mlxtend
print(mlxtend.__version__)
0.22.0

import pandas as pd

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

Get data

DATASET_PATH = "Data/Sample - Superstore.xls"
data = pd.read_excel(DATASET_PATH, sheet_name="Orders")
print(data.head().T)
                                               0  ...                               4
Row ID                                         1  ...                               5
Order ID                          CA-2016-152156  ...                  US-2015-108966
Order Date                   2016-11-08 00:00:00  ...             2015-10-11 00:00:00
Ship Date                    2016-11-11 00:00:00  ...             2015-10-18 00:00:00
Ship Mode                           Second Class  ...                  Standard Class
Customer ID                             CG-12520  ...                        SO-20335
Customer Name                        Claire Gute  ...                  Sean O'Donnell
Segment                                 Consumer  ...                        Consumer
Country                            United States  ...                   United States
City                                   Henderson  ...                 Fort Lauderdale
State                                   Kentucky  ...                         Florida
Postal Code                                42420  ...                           33311
Region                                     South  ...                           South
Product ID                       FUR-BO-10001798  ...                 OFF-ST-10000760
Category                               Furniture  ...                 Office Supplies
Sub-Category                           Bookcases  ...                         Storage
Product Name   Bush Somerset Collection Bookcase  ...  Eldon Fold 'N Roll Cart System
Sales                                     261.96  ...                          22.368
Quantity                                       2  ...                               2
Discount                                     0.0  ...                             0.2
Profit                                   41.9136  ...                          2.5164

[21 rows x 5 columns]

print(data.shape)
(9994, 21)

Data Exploration

print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row ID         9994 non-null   int64         
 1   Order ID       9994 non-null   object        
 2   Order Date     9994 non-null   datetime64[ns]
 3   Ship Date      9994 non-null   datetime64[ns]
 4   Ship Mode      9994 non-null   object        
 5   Customer ID    9994 non-null   object        
 6   Customer Name  9994 non-null   object        
 7   Segment        9994 non-null   object        
 8   Country        9994 non-null   object        
 9   City           9994 non-null   object        
 10  State          9994 non-null   object        
 11  Postal Code    9994 non-null   int64         
 12  Region         9994 non-null   object        
 13  Product ID     9994 non-null   object        
 14  Category       9994 non-null   object        
 15  Sub-Category   9994 non-null   object        
 16  Product Name   9994 non-null   object        
 17  Sales          9994 non-null   float64       
 18  Quantity       9994 non-null   int64         
 19  Discount       9994 non-null   float64       
 20  Profit         9994 non-null   float64       
dtypes: datetime64[ns](2), float64(3), int64(3), object(13)
memory usage: 1.6+ MB
None

data = data[["Customer ID", "Sub-Category"]]
print(data.sample(10))
     Customer ID Sub-Category
6496    KH-16510  Accessories
484     MT-18070       Labels
855     BK-11260        Paper
227     DS-13180      Storage
6546    DP-13000        Paper
822     TR-21325  Accessories
6610    AG-10495      Binders
2434    PJ-18835    Envelopes
6121    JR-16210    Bookcases
5197    AI-10855       Phones

print(data.describe())
       Customer ID Sub-Category
count         9994         9994
unique         793           17
top       WB-21850      Binders
freq            37         1523

Preprocessing

flatten = lambda cart: list(cart)

carts = data.groupby(by="Customer ID").agg({"Sub-Category": flatten}).reset_index().rename(columns={"Sub-Category": "Chart"})
print(carts.head(10))
  Customer ID                                              Chart
0    AA-10315  [Appliances, Binders, Storage, Binders, Applia...
1    AA-10375  [Storage, Furnishings, Accessories, Binders, A...
2    AA-10480  [Paper, Furnishings, Paper, Storage, Paper, Pa...
3    AA-10645  [Chairs, Phones, Chairs, Furnishings, Envelope...
4    AB-10015  [Chairs, Art, Storage, Storage, Phones, Bookca...
5    AB-10060  [Accessories, Paper, Binders, Paper, Binders, ...
6    AB-10105  [Tables, Furnishings, Binders, Phones, Labels,...
7    AB-10150  [Accessories, Paper, Supplies, Art, Furnishing...
8    AB-10165  [Accessories, Paper, Art, Art, Paper, Binders,...
9    AB-10255  [Phones, Storage, Accessories, Supplies, Paper...

encoder = TransactionEncoder()
encoder.fit(carts["Chart"])
chart_array = encoder.transform(carts["Chart"])
df_chart = pd.DataFrame(chart_array, columns=encoder.columns_, dtype=int)
print(df_chart.head(10))
   Accessories  Appliances  Art  Binders  Bookcases  Chairs  ...  Machines  Paper  Phones  Storage  Supplies  Tables
0            1           1    0        1          0       0  ...         0      1       1        1         1       0
1            1           0    1        1          0       0  ...         0      1       1        1         0       0
2            1           0    1        0          0       0  ...         0      1       1        1         0       1
3            0           0    1        1          1       1  ...         0      1       1        1         0       0
4            0           0    1        0          1       1  ...         0      0       1        1         0       0
5            1           1    0        1          0       1  ...         0      1       0        0         1       1
6            1           0    1        1          0       0  ...         1      0       1        1         0       1
7            1           0    1        1          0       0  ...         0      1       0        0         1       0
8            1           0    1        1          0       1  ...         0      1       0        1         0       0
9            1           0    1        1          0       0  ...         0      1       1        1         1       0

[10 rows x 17 columns]

Itemsets

frequent_itemsets = apriori(df_chart, min_support=0.45, use_colnames=True)
print(frequent_itemsets)
     support                       itemsets
0   0.597730                  (Accessories)
1   0.622951                          (Art)
2   0.819672                      (Binders)
3   0.513241                       (Chairs)
4   0.665826                  (Furnishings)
5   0.770492                        (Paper)
6   0.644388                       (Phones)
7   0.648172                      (Storage)
8   0.508197         (Accessories, Binders)
9   0.491803           (Paper, Accessories)
10  0.523329                 (Art, Binders)
11  0.493064                   (Paper, Art)
12  0.553594         (Furnishings, Binders)
13  0.650694               (Paper, Binders)
14  0.532156              (Binders, Phones)
15  0.546028             (Storage, Binders)
16  0.535939           (Paper, Furnishings)
17  0.450189          (Furnishings, Phones)
18  0.520807                (Paper, Phones)
19  0.535939               (Paper, Storage)
20  0.451450  (Paper, Furnishings, Binders)
21  0.460277      (Paper, Storage, Binders)

Association Rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
rules = rules[["antecedents", "consequents", "support", "confidence", "lift", "leverage"]]
# rules
print(rules)
               antecedents       consequents   support  confidence      lift  leverage
0            (Accessories)         (Binders)  0.508197    0.850211  1.037257  0.018254
1            (Accessories)           (Paper)  0.491803    0.822785  1.067870  0.031257
2                    (Art)         (Binders)  0.523329    0.840081  1.024899  0.012714
3                    (Art)           (Paper)  0.493064    0.791498  1.027263  0.013086
4            (Furnishings)         (Binders)  0.553594    0.831439  1.014356  0.007835
5                  (Paper)         (Binders)  0.650694    0.844517  1.030311  0.019143
6                (Binders)           (Paper)  0.650694    0.793846  1.030311  0.019143
7                 (Phones)         (Binders)  0.532156    0.825832  1.007515  0.003969
8                (Storage)         (Binders)  0.546028    0.842412  1.027743  0.014740
9            (Furnishings)           (Paper)  0.535939    0.804924  1.044689  0.022926
10                (Phones)           (Paper)  0.520807    0.808219  1.048965  0.024311
11               (Storage)           (Paper)  0.535939    0.826848  1.073143  0.036529
12    (Paper, Furnishings)         (Binders)  0.451450    0.842353  1.027671  0.012156
13  (Furnishings, Binders)           (Paper)  0.451450    0.815490  1.058402  0.024911
14        (Paper, Storage)         (Binders)  0.460277    0.858824  1.047765  0.020983
15        (Paper, Binders)         (Storage)  0.460277    0.707364  1.091323  0.038516
16      (Storage, Binders)           (Paper)  0.460277    0.842956  1.094049  0.039568
17               (Storage)  (Paper, Binders)  0.460277    0.710117  1.091323  0.038516