Contenu connexe Plus de Mark Tabladillo (20) Data Mining Beyond Adventure Works (Redmond WA 10/3/2009)2. Approach of this Presentation
• Emphasize
– Conceptual value of data mining
– Relationship of data mining to the real
world
• Reserve
– Specific procedures and mechanics
– Specific mathematics
– Production implementation
© 2009 Mark Tabladillo Ph.D. 2
5. Data Mining Definitions
• Data mining is the automatic or semi-
automatic process of exploring data for
meaningful or useful patterns.
• Data mining algorithms typically use
estimation or optimization to achieve
results (as opposed to only calculations).
© 2009 Mark Tabladillo Ph.D. 5
6. Microsoft Data Mining
• Microsoft Data Mining refers to
Microsoft’s specific implementation of
certain common data mining algorithms for
the DMX (Data Mining Extensions)
language.
• Also called SQL Server Data Mining, the
technology is integrated into SQL Server
rather than presented as an independent
application.
© 2009 Mark Tabladillo Ph.D. 6
7. Data Mining Tasks
• Supervised
– Answer known, what is correlated?
• Unsupervised
– Answer unknown (unspecified), what are the
groups?
• Forecasting
– Given a trend, what is next? Value
Slide
© 2009 Mark Tabladillo Ph.D. 7
8. List the Data Mining Algorithms
• Ten Answers
• Each one is a field of academic focus
© 2009 Mark Tabladillo Ph.D. 8
9. The Data Mining Algorithms
• Microsoft Naive Bayes
• Microsoft Linear Regression
• Microsoft Decision Trees
• Microsoft Time Series
• Microsoft Clustering
• Microsoft Sequence Clustering
• Microsoft Association Rules
• Microsoft Neural Networks
• Microsoft Logistic Regression
• Text Mining
© 2009 Mark Tabladillo Ph.D. 9
10. The Analyze Tab
Menu Option Data Mining Algorithm
Analyze Key Influencers Naïve Bayes
Detect Categories Clustering
Fill from Example Logistic Regression
Forecast Time Series
Highlight Exceptions Clustering
Scenario Analysis (Goal Seek) Logistic Regression
Scenario Analysis (What If) Logistic Regression
Prediction Calculator Logistic Regression
Shopping Basket Analysis Association Rules
© 2009 Mark Tabladillo Ph.D. 10
11. Demo One:
National League Baseball
• Directions:
You are on the management team for the
Atlanta Braves. To better serve the team,
you have been instructed by the owner to
group the players by considering both their
position and their salary.
© 2009 Mark Tabladillo Ph.D. 11
12. Demo One:
National League Baseball
• The following rules apply:
– You must make more than one group
– Each group must have at least two players
– Players of different position may be in the
same group
© 2009 Mark Tabladillo Ph.D. 12
13. Demo One:
National League Baseball
• Individual attributes can be used to make
groups
• Historical statistics can be used to group
new players
• Both supervised and unsupervised
algorithms can be applied to the same
data
© 2009 Mark Tabladillo Ph.D. 13
14. Demo Two:
Government Forecasting
• Directions:
The President is asking your opinion on
how the following numbers will increase
over the next few months. Because this
project is sensitive, you do not know what
these numbers measure. However, based
on the available history, make your best
projection for the next six periods.
© 2009 Mark Tabladillo Ph.D. 14
15. Demo Two:
Government Forecasting
8
7
6
5
4
3
2
1
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008
© 2009 Mark Tabladillo Ph.D. 15
16. Demo Two:
Government Forecasting
12
10
8
6
4
2
0
Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
2007 2007 2007 2007 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2009 2009 2009 2009 2009 2009 2009 2009
© 2009 Mark Tabladillo Ph.D. 16
17. Demo Two:
Government Forecasting
• Rapid response is as useful as prediction
• Seek intelligent correlations among related
metrics
• Projections depend on time frame –
modeling is continual
© 2009 Mark Tabladillo Ph.D. 17
19. Supervised Algorithms
• Microsoft Naive Bayes
• Microsoft Linear Regression
• Microsoft Decision Trees
• Microsoft Neural Networks
• Microsoft Logistic Regression
Value
Slide
© 2009 Mark Tabladillo Ph.D. 19
20. Unsupervised Algorithms
• Microsoft Clustering
• Microsoft Sequence Clustering
• Microsoft Association Rules
• Text Mining
Value
Slide
© 2009 Mark Tabladillo Ph.D. 20
21. Resources
• MarkTab.NET
Links, video resources and information for data mining
• Data Mining with Microsoft SQL Server 2008
by Jamie MacLennan (Author), ZhaoHui Tang (Author), Bogdan Crivat (Author)
• Smart Business Intelligence Solutions with Microsoft® SQL Server® 2008
(PRO-Developer)
by Lynn Langit (Author), Matthew Roche (Author)
© 2009 Mark Tabladillo Ph.D. 21
24. Bonus:
Sequence Clustering Ideas
• Trading players in professional sports
• Assigning players to certain positions
• Moving from city to city
• Store path at the mall
• Cancer treatment path
• Taking up a musical instrument
• Taking up sports
• Blogging
• Viral news
© 2009 Mark Tabladillo Ph.D. 24