Association Rules#
Les règles d'association#
Les règles d'association sont un algorithme d'apprentissage non supervisé. Il est utilisé pour analyser les habitudes d'achats des clients.
Itemset#
U itemset est un utiliser l'algorithme du KMeans pour former un modèle, il faut des données numériques.
Support#
Le support d'un itemset est la fréquence d'un itemset.
Méthode d'optimisation#
Condition d'arrêt#
Association Rules#
Packages#
import re
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
Paths & Config#
Get Data#
Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2624 | 2625 | CA-2018-127180 | 22/10/2018 | 24/10/2018 | First Class | TA-21385 | Tom Ashbrook | Home Office | United States | New York City | New York | 10024.0 | East | TEC-PH-10001494 | Technology | Phones | Polycom CX600 IP Phone VoIP phone | 2399.600 |
4891 | 4892 | CA-2017-135776 | 23/12/2017 | 30/12/2017 | Standard Class | EH-13765 | Edward Hooks | Corporate | United States | Seattle | Washington | 98103.0 | West | OFF-PA-10001295 | Office Supplies | Paper | Computer Printout Paper with Letter-Trim Perfo... | 37.940 |
3462 | 3463 | CA-2016-152611 | 20/02/2016 | 23/02/2016 | Second Class | KA-16525 | Kelly Andreada | Consumer | United States | Perth Amboy | New Jersey | 8861.0 | East | OFF-AR-10003903 | Office Supplies | Art | Sanford 52201 APSCO Electric Pencil Sharpener | 286.790 |
2295 | 2296 | CA-2016-113145 | 01/11/2016 | 05/11/2016 | Standard Class | VD-21670 | Valerie Dominguez | Consumer | United States | New York City | New York | 10011.0 | East | OFF-PA-10002659 | Office Supplies | Paper | Avoid Verbal Orders Carbonless Minifold Book | 13.520 |
223 | 224 | CA-2016-169397 | 24/12/2016 | 27/12/2016 | First Class | JB-15925 | Joni Blumstein | Consumer | United States | Dublin | Ohio | 43017.0 | East | TEC-MA-10001148 | Technology | Machines | Swingline SM12-08 MicroCut Jam Free Shredder | 479.988 |
3559 | 3560 | CA-2018-152737 | 07/11/2018 | 12/11/2018 | Standard Class | TS-21505 | Tony Sayre | Consumer | United States | San Francisco | California | 94122.0 | West | TEC-AC-10004975 | Technology | Accessories | Plantronics Audio 995 Wireless Stereo Headset | 439.800 |
9642 | 9643 | CA-2015-104563 | 07/03/2015 | 12/03/2015 | Standard Class | CM-12715 | Craig Molinari | Corporate | United States | Seattle | Washington | 98103.0 | West | FUR-CH-10002780 | Furniture | Chairs | Office Star - Task Chair with Contemporary Loo... | 436.704 |
3761 | 3762 | CA-2018-104577 | 12/05/2018 | 17/05/2018 | Standard Class | CK-12205 | Chloris Kastensmidt | Consumer | United States | Everett | Massachusetts | 2149.0 | East | OFF-PA-10000659 | Office Supplies | Paper | TOPS Carbonless Receipt Book, Four 2-3/4 x 7-1... | 87.600 |
3551 | 3552 | CA-2017-152555 | 29/03/2017 | 02/04/2017 | Second Class | ME-17320 | Maria Etezadi | Home Office | United States | Chicago | Illinois | 60653.0 | Central | FUR-CH-10002965 | Furniture | Chairs | Global Leather Highback Executive Chair with P... | 844.116 |
7010 | 7011 | US-2015-135881 | 23/05/2015 | 27/05/2015 | Standard Class | GT-14710 | Greg Tran | Consumer | United States | New York City | New York | 10035.0 | East | OFF-BI-10000829 | Office Supplies | Binders | Avery Non-Stick Binders | 17.960 |
Explore Data#
(9800, 18)
Row ID | Postal Code | Sales | |
---|---|---|---|
count | 9800.000000 | 9789.000000 | 9800.000000 |
mean | 4900.500000 | 55273.322403 | 230.769059 |
std | 2829.160653 | 32041.223413 | 626.651875 |
min | 1.000000 | 1040.000000 | 0.444000 |
25% | 2450.750000 | 23223.000000 | 17.248000 |
50% | 4900.500000 | 58103.000000 | 54.490000 |
75% | 7350.250000 | 90008.000000 | 210.605000 |
max | 9800.000000 | 99301.000000 | 22638.480000 |
<class 'pandas.core.frame.dataframe'="">
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9800 non-null int64
1 Order ID 9800 non-null object
2 Order Date 9800 non-null object
3 Ship Date 9800 non-null object
4 Ship Mode 9800 non-null object
5 Customer ID 9800 non-null object
6 Customer Name 9800 non-null object
7 Segment 9800 non-null object
8 Country 9800 non-null object
9 City 9800 non-null object
10 State 9800 non-null object
11 Postal Code 9789 non-null float64
12 Region 9800 non-null object
13 Product ID 9800 non-null object
14 Category 9800 non-null object
15 Sub-Category 9800 non-null object
16 Product Name 9800 non-null object
17 Sales 9800 non-null float64
dtypes: float64(2), int64(1), object(15)
memory usage: 1.3+ MB
Clean Data#
new_columns = [re.sub(' |-', '_', column).strip() for column in list(sales_data.columns)]
sales_data.columns = new_columns
sales_data.head()
Row_ID | Order_ID | Order_Date | Ship_Date | Ship_Mode | Customer_ID | Customer_Name | Segment | Country | City | State | Postal_Code | Region | Product_ID | Category | Sub_Category | Product_Name | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | CA-2017-152156 | 08/11/2017 | 11/11/2017 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420.0 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.9600 |
1 | 2 | CA-2017-152156 | 08/11/2017 | 11/11/2017 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420.0 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.9400 |
2 | 3 | CA-2017-138688 | 12/06/2017 | 16/06/2017 | Second Class | DV-13045 | Darrin Van Huff | Corporate | United States | Los Angeles | California | 90036.0 | West | OFF-LA-10000240 | Office Supplies | Labels | Self-Adhesive Address Labels for Typewriters b... | 14.6200 |
3 | 4 | US-2016-108966 | 11/10/2016 | 18/10/2016 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311.0 | South | FUR-TA-10000577 | Furniture | Tables | Bretford CR4500 Series Slim Rectangular Table | 957.5775 |
4 | 5 | US-2016-108966 | 11/10/2016 | 18/10/2016 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311.0 | South | OFF-ST-10000760 | Office Supplies | Storage | Eldon Fold 'N Roll Cart System | 22.3680 |
Index(['Row_ID', 'Order_ID', 'Order_Date', 'Ship_Date', 'Ship_Mode',
'Customer_ID', 'Customer_Name', 'Segment', 'Country', 'City', 'State',
'Postal_Code', 'Region', 'Product_ID', 'Category', 'Sub_Category',
'Product_Name', 'Sales'],
dtype='object')
Transform Data#
df = sales_data[["Customer_ID", "Sub_Category"]].groupby(by="Customer_ID").agg({"Sub_Category": create_cart}).reset_index().rename(columns={"Sub_Category": "cart"})
# df = sales_data[["Customer_ID", "Product_Name"]].groupby(by="Customer_ID").agg({"Product_Name": create_cart}).reset_index().rename(columns={"Product_Name": "cart"})
df.head()
Customer_ID | cart | |
---|---|---|
0 | AA-10315 | [Appliances, Binders, Storage, Binders, Applia... |
1 | AA-10375 | [Storage, Furnishings, Accessories, Binders, A... |
2 | AA-10480 | [Paper, Furnishings, Paper, Storage, Paper, Pa... |
3 | AA-10645 | [Chairs, Phones, Chairs, Furnishings, Envelope... |
4 | AB-10015 | [Chairs, Art, Storage, Storage, Phones, Bookca... |
Customer_ID | cart | cart_size | |
---|---|---|---|
378 | JK-15625 | [Chairs, Chairs, Bookcases, Art, Paper, Chairs... | 12 |
363 | JF-15565 | [Fasteners, Paper, Appliances, Paper, Binders,... | 16 |
422 | KD-16270 | [Paper, Tables, Paper, Storage, Machines, Art,... | 16 |
505 | MG-17875 | [Appliances, Storage, Tables, Paper, Furnishin... | 7 |
306 | GM-14680 | [Furnishings, Art, Fasteners, Phones, Tables, ... | 11 |
415 | KB-16315 | [Paper, Furnishings, Binders, Art, Labels, Env... | 22 |
594 | PJ-18835 | [Paper, Accessories, Accessories, Paper, Envel... | 13 |
542 | MT-17815 | [Envelopes, Binders, Storage, Paper, Tables, P... | 10 |
769 | TT-21070 | [Paper, Furnishings, Paper, Fasteners, Machine... | 14 |
435 | KM-16225 | [Furnishings, Furnishings, Phones, Storage, Fa... | 19 |
696 | SJ-20500 | [Appliances, Accessories, Binders, Chairs, Bin... | 7 |
317 | HA-14920 | [Storage, Accessories, Furnishings, Binders, A... | 18 |
384 | JL-15175 | [Paper, Furnishings, Chairs, Appliances, Binde... | 7 |
251 | EB-13975 | [Binders, Binders, Copiers, Supplies, Binders,... | 6 |
493 | MC-18100 | [Phones, Chairs, Paper, Furnishings, Storage, ... | 19 |
te = TransactionEncoder()
te_ary = te.fit(X).transform(X)
df_x = pd.DataFrame(te_ary, columns=te.columns_)
df_x.head()
Accessories | Appliances | Art | Binders | Bookcases | Chairs | Copiers | Envelopes | Fasteners | Furnishings | Labels | Machines | Paper | Phones | Storage | Supplies | Tables | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | False | True | False | False | False | False | True | True | False | False | True | True | True | True | False |
1 | True | False | True | True | False | False | False | False | False | True | False | False | True | True | True | False | False |
2 | True | False | True | False | False | False | False | False | False | True | False | False | True | True | True | False | True |
3 | False | False | True | True | True | True | False | True | False | True | False | False | True | True | True | False | False |
4 | False | False | True | False | True | True | False | False | False | False | False | False | False | True | True | False | False |
frequent_itemsets = fpgrowth(df_x, min_support=SUPPORT, use_colnames=True)
#### alternatively:
#frequent_itemsets = apriori(df, min_support=SUPPORT, use_colnames=True)
# frequent_itemsets = fpmax(df, min_support=SUPPORT, use_colnames=True)
frequent_itemsets.sort_values(by='support', ascending=False).head(10)
support | itemsets | |
---|---|---|
0 | 0.815889 | (Binders) |
1 | 0.757881 | (Paper) |
2 | 0.658260 | (Furnishings) |
3 | 0.641866 | (Storage) |
4 | 0.641866 | (Phones) |
14 | 0.640605 | (Paper, Binders) |
8 | 0.617907 | (Art) |
5 | 0.590164 | (Accessories) |
15 | 0.544767 | (Binders, Furnishings) |
18 | 0.537201 | (Binders, Storage) |
from mlxtend.frequent_patterns import association_rules
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7).head(10)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (Paper) | (Binders) | 0.757881 | 0.815889 | 0.640605 | 0.845258 | 1.035996 | 0.022258 | 1.189792 | 0.143506 |
1 | (Binders) | (Paper) | 0.815889 | 0.757881 | 0.640605 | 0.785162 | 1.035996 | 0.022258 | 1.126983 | 0.188720 |
2 | (Furnishings) | (Binders) | 0.658260 | 0.815889 | 0.544767 | 0.827586 | 1.014337 | 0.007700 | 1.067844 | 0.041359 |
3 | (Furnishings) | (Paper) | 0.658260 | 0.757881 | 0.519546 | 0.789272 | 1.041419 | 0.020663 | 1.148963 | 0.116379 |
4 | (Paper, Furnishings) | (Binders) | 0.519546 | 0.815889 | 0.438840 | 0.844660 | 1.035264 | 0.014948 | 1.185214 | 0.070896 |
5 | (Binders, Furnishings) | (Paper) | 0.544767 | 0.757881 | 0.438840 | 0.805556 | 1.062904 | 0.025971 | 1.245181 | 0.130003 |
6 | (Storage) | (Binders) | 0.641866 | 0.815889 | 0.537201 | 0.836935 | 1.025795 | 0.013509 | 1.129066 | 0.070216 |
7 | (Storage) | (Paper) | 0.641866 | 0.757881 | 0.527112 | 0.821218 | 1.083571 | 0.040654 | 1.354267 | 0.215353 |
8 | (Paper, Binders) | (Storage) | 0.640605 | 0.641866 | 0.451450 | 0.704724 | 1.097930 | 0.040267 | 1.212879 | 0.248182 |
9 | (Paper, Storage) | (Binders) | 0.527112 | 0.815889 | 0.451450 | 0.856459 | 1.049725 | 0.021385 | 1.282640 | 0.100171 |