• CS 349: Market Basket Data Mining All about beer and diapers.
  • Overview What is Data Mining Market Baskets How fast does it run? What does it do?
  • What is Data Mining? Statistics Data Analysis Machine Learning Databases
  • Types of Data that can be Mined market basket classification time series text
  • Applications of Market Basket supermarkets data with boolean attributes census data: single vs married word occurrence
  • Some Measures of the Data number of baskets : N number of items : M average number of items per basket: W (width)
  • Aspects of Market Basket Mining What is interesting? How do you make it run fast?
  • What is Interesting? (first try) Itemset I = set of items association rule - A -> B support(I) = fraction of baskets that contain I confidence(A->B) = probability that a basket contains B given that it contains A
  • How do you find Itemsets with high support? Apriori algorithm, Agrawal et al (1993) Find all itemsets with support > s 1-itemset = itemset with 1 item … k-itemset = itemset with k items large itemset = itemset with support > s candidate itemset = itemset that may have support > s
  • Apriori Algorithm start with all 1-itemsets go through data and count their support and find all “large” 1-itemsets combine them to form “candidate” 2-itemsets go through data and count their support and find all “large” 2-itemsets combine them to form “candidate” 3-itemsets …
  • Run Time k passes over data where k is the size of the largest candidate itemset Memory chunking algorithm ==> 2 passes over data on disk but multiple in memory Toivonen 1996 gives statistical technique 1 + e passes (but more memory) Brin 1997 - Dynamic Itemset Counting 1 + e passes (less memory)
  • But what is really interesting? A->B Support = P(AB) Confidence = P(B|A) Interest = P(AB)/P(A)P(B) Implication Strength = P(A)P(~B)/P(A~B)
  • But what is really really interesting? Causality Surprise
  • Summary What is Data Mining? Market Baskets Finding Itemsets with high support Finding Interesting Rules
Please download to view
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
...

CS 349: Market Basket Data Mining All about beer and diapers.

by barnaby-strickland

on

Report

Category:

Documents

Download: 0

Comment: 0

212

views

Comments

Description

Download CS 349: Market Basket Data Mining All about beer and diapers.

Transcript

  • CS 349: Market Basket Data Mining All about beer and diapers.
  • Overview What is Data Mining Market Baskets How fast does it run? What does it do?
  • What is Data Mining? Statistics Data Analysis Machine Learning Databases
  • Types of Data that can be Mined market basket classification time series text
  • Applications of Market Basket supermarkets data with boolean attributes census data: single vs married word occurrence
  • Some Measures of the Data number of baskets : N number of items : M average number of items per basket: W (width)
  • Aspects of Market Basket Mining What is interesting? How do you make it run fast?
  • What is Interesting? (first try) Itemset I = set of items association rule - A -> B support(I) = fraction of baskets that contain I confidence(A->B) = probability that a basket contains B given that it contains A
  • How do you find Itemsets with high support? Apriori algorithm, Agrawal et al (1993) Find all itemsets with support > s 1-itemset = itemset with 1 item … k-itemset = itemset with k items large itemset = itemset with support > s candidate itemset = itemset that may have support > s
  • Apriori Algorithm start with all 1-itemsets go through data and count their support and find all “large” 1-itemsets combine them to form “candidate” 2-itemsets go through data and count their support and find all “large” 2-itemsets combine them to form “candidate” 3-itemsets …
  • Run Time k passes over data where k is the size of the largest candidate itemset Memory chunking algorithm ==> 2 passes over data on disk but multiple in memory Toivonen 1996 gives statistical technique 1 + e passes (but more memory) Brin 1997 - Dynamic Itemset Counting 1 + e passes (less memory)
  • But what is really interesting? A->B Support = P(AB) Confidence = P(B|A) Interest = P(AB)/P(A)P(B) Implication Strength = P(A)P(~B)/P(A~B)
  • But what is really really interesting? Causality Surprise
  • Summary What is Data Mining? Market Baskets Finding Itemsets with high support Finding Interesting Rules
Fly UP