Association Rules
Les règles d'association
Les règles d'association sont un algorithme d'apprentissage non supervisé. Il est utilisé pour analyser les habitudes d'achats des clients.
Itemset
U itemset est un utiliser l'algorithme du KMeans pour former un modèle, il faut des données numériques.
Support
Le support d'un itemset est la fréquence d'un itemset.
Méthode d'optimisation
Condition d'arrêt
Association Rules
Packages
import re
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
Paths & Config
Get Data
Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2624 | 2625 | CA-2018-127180 | 22/10/2018 | 24/10/2018 | First Class | TA-21385 | Tom Ashbrook | Home Office | United States | New York City | New York | 10024.0 | East | TEC-PH-10001494 | Technology | Phones | Polycom CX600 IP Phone VoIP phone | 2399.600 |
4891 | 4892 | CA-2017-135776 | 23/12/2017 | 30/12/2017 | Standard Class | EH-13765 | Edward Hooks | Corporate | United States | Seattle | Washington | 98103.0 | West | OFF-PA-10001295 | Office Supplies | Paper | Computer Printout Paper with Letter-Trim Perfo... | 37.940 |
3462 | 3463 | CA-2016-152611 | 20/02/2016 | 23/02/2016 | Second Class | KA-16525 | Kelly Andreada | Consumer | United States | Perth Amboy | New Jersey | 8861.0 | East | OFF-AR-10003903 | Office Supplies | Art | Sanford 52201 APSCO Electric Pencil Sharpener | 286.790 |
2295 | 2296 | CA-2016-113145 | 01/11/2016 | 05/11/2016 | Standard Class | VD-21670 | Valerie Dominguez | Consumer | United States | New York City | New York | 10011.0 | East | OFF-PA-10002659 | Office Supplies | Paper | Avoid Verbal Orders Carbonless Minifold Book | 13.520 |
223 | 224 | CA-2016-169397 | 24/12/2016 | 27/12/2016 | First Class | JB-15925 | Joni Blumstein | Consumer | United States | Dublin | Ohio | 43017.0 | East | TEC-MA-10001148 | Technology | Machines | Swingline SM12-08 MicroCut Jam Free Shredder | 479.988 |
3559 | 3560 | CA-2018-152737 | 07/11/2018 | 12/11/2018 | Standard Class | TS-21505 | Tony Sayre | Consumer | United States | San Francisco | California | 94122.0 | West | TEC-AC-10004975 | Technology | Accessories | Plantronics Audio 995 Wireless Stereo Headset | 439.800 |
9642 | 9643 | CA-2015-104563 | 07/03/2015 | 12/03/2015 | Standard Class | CM-12715 | Craig Molinari | Corporate | United States | Seattle | Washington | 98103.0 | West | FUR-CH-10002780 | Furniture | Chairs | Office Star - Task Chair with Contemporary Loo... | 436.704 |
3761 | 3762 | CA-2018-104577 | 12/05/2018 | 17/05/2018 | Standard Class | CK-12205 | Chloris Kastensmidt | Consumer | United States | Everett | Massachusetts | 2149.0 | East | OFF-PA-10000659 | Office Supplies | Paper | TOPS Carbonless Receipt Book, Four 2-3/4 x 7-1... | 87.600 |
3551 | 3552 | CA-2017-152555 | 29/03/2017 | 02/04/2017 | Second Class | ME-17320 | Maria Etezadi | Home Office | United States | Chicago | Illinois | 60653.0 | Central | FUR-CH-10002965 | Furniture | Chairs | Global Leather Highback Executive Chair with P... | 844.116 |
7010 | 7011 | US-2015-135881 | 23/05/2015 | 27/05/2015 | Standard Class | GT-14710 | Greg Tran | Consumer | United States | New York City | New York | 10035.0 | East | OFF-BI-10000829 | Office Supplies | Binders | Avery Non-Stick Binders | 17.960 |
Explore Data
(9800, 18)
Row ID | Postal Code | Sales | |
---|---|---|---|
count | 9800.000000 | 9789.000000 | 9800.000000 |
mean | 4900.500000 | 55273.322403 | 230.769059 |
std | 2829.160653 | 32041.223413 | 626.651875 |
min | 1.000000 | 1040.000000 | 0.444000 |
25% | 2450.750000 | 23223.000000 | 17.248000 |
50% | 4900.500000 | 58103.000000 | 54.490000 |
75% | 7350.250000 | 90008.000000 | 210.605000 |
max | 9800.000000 | 99301.000000 | 22638.480000 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9800 non-null int64
1 Order ID 9800 non-null object
2 Order Date 9800 non-null object
3 Ship Date 9800 non-null object
4 Ship Mode 9800 non-null object
5 Customer ID 9800 non-null object
6 Customer Name 9800 non-null object
7 Segment 9800 non-null object
8 Country 9800 non-null object
9 City 9800 non-null object
10 State 9800 non-null object
11 Postal Code 9789 non-null float64
12 Region 9800 non-null object
13 Product ID 9800 non-null object
14 Category 9800 non-null object
15 Sub-Category 9800 non-null object
16 Product Name 9800 non-null object
17 Sales 9800 non-null float64
dtypes: float64(2), int64(1), object(15)
memory usage: 1.3+ MB
Clean Data
new_columns = [re.sub(' |-', '_', column).strip() for column in list(sales_data.columns)]
sales_data.columns = new_columns
sales_data.head()
Row_ID | Order_ID | Order_Date | Ship_Date | Ship_Mode | Customer_ID | Customer_Name | Segment | Country | City | State | Postal_Code | Region | Product_ID | Category | Sub_Category | Product_Name | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | CA-2017-152156 | 08/11/2017 | 11/11/2017 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420.0 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.9600 |
1 | 2 | CA-2017-152156 | 08/11/2017 | 11/11/2017 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420.0 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.9400 |
2 | 3 | CA-2017-138688 | 12/06/2017 | 16/06/2017 | Second Class | DV-13045 | Darrin Van Huff | Corporate | United States | Los Angeles | California | 90036.0 | West | OFF-LA-10000240 | Office Supplies | Labels | Self-Adhesive Address Labels for Typewriters b... | 14.6200 |
3 | 4 | US-2016-108966 | 11/10/2016 | 18/10/2016 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311.0 | South | FUR-TA-10000577 | Furniture | Tables | Bretford CR4500 Series Slim Rectangular Table | 957.5775 |
4 | 5 | US-2016-108966 | 11/10/2016 | 18/10/2016 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311.0 | South | OFF-ST-10000760 | Office Supplies | Storage | Eldon Fold 'N Roll Cart System | 22.3680 |
Index(['Row_ID', 'Order_ID', 'Order_Date', 'Ship_Date', 'Ship_Mode',
'Customer_ID', 'Customer_Name', 'Segment', 'Country', 'City', 'State',
'Postal_Code', 'Region', 'Product_ID', 'Category', 'Sub_Category',
'Product_Name', 'Sales'],
dtype='object')
Transform Data
df = sales_data[["Customer_ID", "Sub_Category"]].groupby(by="Customer_ID").agg({"Sub_Category": create_cart}).reset_index().rename(columns={"Sub_Category": "cart"})
# df = sales_data[["Customer_ID", "Product_Name"]].groupby(by="Customer_ID").agg({"Product_Name": create_cart}).reset_index().rename(columns={"Product_Name": "cart"})
df.head()
Customer_ID | cart | |
---|---|---|
0 | AA-10315 | [Appliances, Binders, Storage, Binders, Applia... |
1 | AA-10375 | [Storage, Furnishings, Accessories, Binders, A... |
2 | AA-10480 | [Paper, Furnishings, Paper, Storage, Paper, Pa... |
3 | AA-10645 | [Chairs, Phones, Chairs, Furnishings, Envelope... |
4 | AB-10015 | [Chairs, Art, Storage, Storage, Phones, Bookca... |
Customer_ID | cart | cart_size | |
---|---|---|---|
378 | JK-15625 | [Chairs, Chairs, Bookcases, Art, Paper, Chairs... | 12 |
363 | JF-15565 | [Fasteners, Paper, Appliances, Paper, Binders,... | 16 |
422 | KD-16270 | [Paper, Tables, Paper, Storage, Machines, Art,... | 16 |
505 | MG-17875 | [Appliances, Storage, Tables, Paper, Furnishin... | 7 |
306 | GM-14680 | [Furnishings, Art, Fasteners, Phones, Tables, ... | 11 |
415 | KB-16315 | [Paper, Furnishings, Binders, Art, Labels, Env... | 22 |
594 | PJ-18835 | [Paper, Accessories, Accessories, Paper, Envel... | 13 |
542 | MT-17815 | [Envelopes, Binders, Storage, Paper, Tables, P... | 10 |
769 | TT-21070 | [Paper, Furnishings, Paper, Fasteners, Machine... | 14 |
435 | KM-16225 | [Furnishings, Furnishings, Phones, Storage, Fa... | 19 |
696 | SJ-20500 | [Appliances, Accessories, Binders, Chairs, Bin... | 7 |
317 | HA-14920 | [Storage, Accessories, Furnishings, Binders, A... | 18 |
384 | JL-15175 | [Paper, Furnishings, Chairs, Appliances, Binde... | 7 |
251 | EB-13975 | [Binders, Binders, Copiers, Supplies, Binders,... | 6 |
493 | MC-18100 | [Phones, Chairs, Paper, Furnishings, Storage, ... | 19 |
te = TransactionEncoder()
te_ary = te.fit(X).transform(X)
df_x = pd.DataFrame(te_ary, columns=te.columns_)
df_x.head()
Accessories | Appliances | Art | Binders | Bookcases | Chairs | Copiers | Envelopes | Fasteners | Furnishings | Labels | Machines | Paper | Phones | Storage | Supplies | Tables | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | False | True | False | False | False | False | True | True | False | False | True | True | True | True | False |
1 | True | False | True | True | False | False | False | False | False | True | False | False | True | True | True | False | False |
2 | True | False | True | False | False | False | False | False | False | True | False | False | True | True | True | False | True |
3 | False | False | True | True | True | True | False | True | False | True | False | False | True | True | True | False | False |
4 | False | False | True | False | True | True | False | False | False | False | False | False | False | True | True | False | False |
frequent_itemsets = fpgrowth(df_x, min_support=SUPPORT, use_colnames=True)
#### alternatively:
#frequent_itemsets = apriori(df, min_support=SUPPORT, use_colnames=True)
# frequent_itemsets = fpmax(df, min_support=SUPPORT, use_colnames=True)
frequent_itemsets.sort_values(by='support', ascending=False).head(10)
support | itemsets | |
---|---|---|
0 | 0.815889 | (Binders) |
1 | 0.757881 | (Paper) |
2 | 0.658260 | (Furnishings) |
3 | 0.641866 | (Storage) |
4 | 0.641866 | (Phones) |
14 | 0.640605 | (Paper, Binders) |
8 | 0.617907 | (Art) |
5 | 0.590164 | (Accessories) |
15 | 0.544767 | (Binders, Furnishings) |
18 | 0.537201 | (Binders, Storage) |
from mlxtend.frequent_patterns import association_rules
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7).head(10)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
---|---|---|---|---|---|---|---|---|---|---|
0 | (Paper) | (Binders) | 0.757881 | 0.815889 | 0.640605 | 0.845258 | 1.035996 | 0.022258 | 1.189792 | 0.143506 |
1 | (Binders) | (Paper) | 0.815889 | 0.757881 | 0.640605 | 0.785162 | 1.035996 | 0.022258 | 1.126983 | 0.188720 |
2 | (Furnishings) | (Binders) | 0.658260 | 0.815889 | 0.544767 | 0.827586 | 1.014337 | 0.007700 | 1.067844 | 0.041359 |
3 | (Furnishings) | (Paper) | 0.658260 | 0.757881 | 0.519546 | 0.789272 | 1.041419 | 0.020663 | 1.148963 | 0.116379 |
4 | (Paper, Furnishings) | (Binders) | 0.519546 | 0.815889 | 0.438840 | 0.844660 | 1.035264 | 0.014948 | 1.185214 | 0.070896 |
5 | (Binders, Furnishings) | (Paper) | 0.544767 | 0.757881 | 0.438840 | 0.805556 | 1.062904 | 0.025971 | 1.245181 | 0.130003 |
6 | (Storage) | (Binders) | 0.641866 | 0.815889 | 0.537201 | 0.836935 | 1.025795 | 0.013509 | 1.129066 | 0.070216 |
7 | (Storage) | (Paper) | 0.641866 | 0.757881 | 0.527112 | 0.821218 | 1.083571 | 0.040654 | 1.354267 | 0.215353 |
8 | (Paper, Binders) | (Storage) | 0.640605 | 0.641866 | 0.451450 | 0.704724 | 1.097930 | 0.040267 | 1.212879 | 0.248182 |
9 | (Paper, Storage) | (Binders) | 0.527112 | 0.815889 | 0.451450 | 0.856459 | 1.049725 | 0.021385 | 1.282640 | 0.100171 |