SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Copyright ©2014 Visible Technologies, Inc. All rights reserved.1
Data Mining and Engineering
Lucas Parker
Senior Software Development Engineer, Research & Development
Presented by
Copyright ©2014 Visible Technologies, Inc. All rights reserved.2
About Visible
Our Mission: Customer Value
Global Authoritative Content
• Most comprehensive global content sourcing model
• Clean & accurate data
Powerful Search & Discovery
• Only the most pertinent results, based on your criteria
• Pivot and drill to identify discussion drivers
Engagement/Social CRM
•Social Media workflow and engagement for individual or team
•Integrate with CRM for continuous relationship management
Sophisticated Social Analytics
•Measure, compare, and contrast program & communication results
•Segment results by product attributes, reputation drivers, etc.
Actionable Insights
•Discovery & analytics to uncover insights in real time
•Holistic consumer insights, integrated with other market research
Copyright ©2014 Visible Technologies, Inc. All rights reserved.3
What We Do
• Domain is “social media”
• Twitter, Facebook, forums, blogs, etc
• Huge data sets, lots of noise.
• Enrichment, aggregation, reporting.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.4
Visible – Target Business Groups
• Customer Servicing
• Interactions between business and customer.
• Marketing
• Brand effort, campaigns, periodic messaging.
• Corporate Communications
• PR, reputation of company and stakeholders
• Research
• Audience definition, demographics, psychographics
Data Mining Meets Engineering
Copyright ©2014 Visible Technologies, Inc. All rights reserved.6
Articulating the Problem
“Marketing analysts need to
understand the impact of their
campaigns and we can provide them
an avenue to do so.”
- Surf and turf
“We should totally
Hadoop something!”
- Knuckle sandwich
Copyright ©2014 Visible Technologies, Inc. All rights reserved.7
Feature Engineering
• Your concise data features are easy to grasp, but do
they provide for an adequate model?
• Your 600-dimension model is totally awesome, but
does it scale?
• How much is “good enough”?
Copyright ©2014 Visible Technologies, Inc. All rights reserved.8
Proposing Solutions to the Business
• Understand scale issues.
• Provide alternatives:
• There is no such thing as a perfect system.
• Communicate clearly about real and opportunity costs.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.9
The Hazards of Third Party Data
• Data might not be available forever.
• Vendors might change terms.
• Entrenchment can impede growth/change due to poor
quality over time (data sources can decay, vendors may
slack on maintenance).
Copyright ©2014 Visible Technologies, Inc. All rights reserved.10
Bonini’s Paradox
Copyright ©2014 Visible Technologies, Inc. All rights reserved.11
Productionalizing Prototypes
• Isn’t that a fancy word?
• Strike balance between awesome and simple.
• This is almost impossible to get right.
• Even if you get it right once, it won’t last.
• Better for everybody if you give me as simple a
mechanism as possible.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.12
Expanding and Maintaining 1
• Data drift
• How does data change organically over time?
• Bit rot
• Does anybody even remember how to refit the model?
• Split maintenance
• Keeping the research model up to date with the
production model never happens.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.13
Expanding and Maintaining 2
• Horizontal expansion exposes original scope
assumptions.
• “We have it in English. What do you mean we can’t get
it in Swahili?”
• Value trumps veracity. Sacrifices of purity cause
degradation.
• Business needs results in accretion of surrounding goo.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.14
Document Tone: “NLP” versus “Statistical”
• NLP/Probablistic Grammars:
• Effective.
• Slow.
• Costly reference grammars. Consider a vendor.
• Vector space modeling (term vectors/n-grams)
• Very fast at runtime.
• Work best with lots of training data.
• Can fit yourself, so long as you can afford to maintain it.
Language Detection
Engineering Case Study
Copyright ©2014 Visible Technologies, Inc. All rights reserved.16
Language-Detection
Copyright ©2014 Visible Technologies, Inc. All rights reserved.17
Language-Detection: Features
• Supports 53 languages.
• Fitted on Wikipedia corpora.
• Classic “one-versus-all” classification.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.18
Language-Detection: Mechanism
• Determines the frequency with which n-grams of 1-3
characters appear inside of a labeled corpus.
“To what extent does
each 1-3 character n-gram
participate in a label?”
"tho":134583,"thr":87801,"the":3415279,"thi":110969,"tha":240340
Copyright ©2014 Visible Technologies, Inc. All rights reserved.19
Language-Detection: Practicalities
• Downsides?
• Twitter and Facebook!
• Letter casing (“I love you” versus “i love you”).
• Mixed-language documents (e.g. Chinese documents
with English words).
Delta Airlines and “Needle Sandwiches”
PR Case Study:
Copyright ©2014 Visible Technologies, Inc. All rights reserved.21
Overview
• Airline passengers found sewing needles in
sandwiches.
• Airline attempted to redirect the conversation and
measure the results.
• Visible tracked this event in social media.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.22
Delta Airlines: Needle Sandwiches
Purchased a
refinery to
reduce fuel
costs
Passengers
found needles
in their on-
flight
sandwiches
Free
tickets
given away
as a
promotion
Prominent
terms at a
week view.
Prominent
terms at a
month view.
Prominent
terms at a
three month
view.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.23
Delta Volumes Over Time
Purchased a
refinery to
reduce fuel
costs
Needles found
in on flight
Turkey
Sandwiches
Free tickets
given away
as a
promotion
Copyright ©2014 Visible Technologies, Inc. All rights reserved.24
Delta Volumes Over Time
Copyright ©2014 Visible Technologies, Inc. All rights reserved.25
Month View
Copyright ©2014 Visible Technologies, Inc. All rights reserved.26
3 Month View
Copyright ©2014 Visible Technologies, Inc. All rights reserved.27
PR Case Study: Conclusion
• Contest didn’t pay off in the long term.
• Attempts to redirect the conversation may be
ham-fisted.
• Thoughts? Conjecture?
Copyright ©2014 Visible Technologies, Inc. All rights reserved.28
Conclusion
Questions?
Thank You
www.visibletechnologies.com
info@visibletechnologies.com
Twitter: @Visible
Phone: (888) 852-0320

Contenu connexe

Similaire à Data Mining & Engineering

Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Data Con LA
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onDataWorks Summit
 
10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancyHannelore Vanhaverbeke
 
Enterprise DevOps: Crossing the Great Divide with DevOps Training
Enterprise DevOps: Crossing the Great Divide with DevOps TrainingEnterprise DevOps: Crossing the Great Divide with DevOps Training
Enterprise DevOps: Crossing the Great Divide with DevOps TrainingITpreneurs
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Helping Developers with Privacy
Helping Developers with PrivacyHelping Developers with Privacy
Helping Developers with PrivacyJason Hong
 
Social Intranets for Smarter Enterprise Collaboration
Social Intranets for Smarter Enterprise CollaborationSocial Intranets for Smarter Enterprise Collaboration
Social Intranets for Smarter Enterprise Collaborationrivetlogic
 
Open Source: What is It?
Open Source: What is It?Open Source: What is It?
Open Source: What is It?DuraSpace
 
The Very Best Intranets and Digital Workplaces of 2017
The Very Best Intranets and Digital Workplaces of 2017The Very Best Intranets and Digital Workplaces of 2017
The Very Best Intranets and Digital Workplaces of 2017Prescient Digital Media
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
Sp meetup 17 slidedeck
Sp meetup 17 slidedeckSp meetup 17 slidedeck
Sp meetup 17 slidedeckRic Centre
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014Clarke & Esposito, LLC
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
 
Conversational User Interfaces, Past and Future
Conversational User Interfaces, Past and FutureConversational User Interfaces, Past and Future
Conversational User Interfaces, Past and FutureCrispin Reedy
 

Similaire à Data Mining & Engineering (20)

Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
 
10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy
 
Enterprise DevOps: Crossing the Great Divide with DevOps Training
Enterprise DevOps: Crossing the Great Divide with DevOps TrainingEnterprise DevOps: Crossing the Great Divide with DevOps Training
Enterprise DevOps: Crossing the Great Divide with DevOps Training
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Helping Developers with Privacy
Helping Developers with PrivacyHelping Developers with Privacy
Helping Developers with Privacy
 
Social Intranets for Smarter Enterprise Collaboration
Social Intranets for Smarter Enterprise CollaborationSocial Intranets for Smarter Enterprise Collaboration
Social Intranets for Smarter Enterprise Collaboration
 
Open Source: What is It?
Open Source: What is It?Open Source: What is It?
Open Source: What is It?
 
The Very Best Intranets and Digital Workplaces of 2017
The Very Best Intranets and Digital Workplaces of 2017The Very Best Intranets and Digital Workplaces of 2017
The Very Best Intranets and Digital Workplaces of 2017
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
Sp meetup 17 slidedeck
Sp meetup 17 slidedeckSp meetup 17 slidedeck
Sp meetup 17 slidedeck
 
Connor big data
Connor big dataConnor big data
Connor big data
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
Conversational User Interfaces, Past and Future
Conversational User Interfaces, Past and FutureConversational User Interfaces, Past and Future
Conversational User Interfaces, Past and Future
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Data Mining & Engineering

  • 1. Copyright ©2014 Visible Technologies, Inc. All rights reserved.1 Data Mining and Engineering Lucas Parker Senior Software Development Engineer, Research & Development Presented by
  • 2. Copyright ©2014 Visible Technologies, Inc. All rights reserved.2 About Visible Our Mission: Customer Value Global Authoritative Content • Most comprehensive global content sourcing model • Clean & accurate data Powerful Search & Discovery • Only the most pertinent results, based on your criteria • Pivot and drill to identify discussion drivers Engagement/Social CRM •Social Media workflow and engagement for individual or team •Integrate with CRM for continuous relationship management Sophisticated Social Analytics •Measure, compare, and contrast program & communication results •Segment results by product attributes, reputation drivers, etc. Actionable Insights •Discovery & analytics to uncover insights in real time •Holistic consumer insights, integrated with other market research
  • 3. Copyright ©2014 Visible Technologies, Inc. All rights reserved.3 What We Do • Domain is “social media” • Twitter, Facebook, forums, blogs, etc • Huge data sets, lots of noise. • Enrichment, aggregation, reporting.
  • 4. Copyright ©2014 Visible Technologies, Inc. All rights reserved.4 Visible – Target Business Groups • Customer Servicing • Interactions between business and customer. • Marketing • Brand effort, campaigns, periodic messaging. • Corporate Communications • PR, reputation of company and stakeholders • Research • Audience definition, demographics, psychographics
  • 5. Data Mining Meets Engineering
  • 6. Copyright ©2014 Visible Technologies, Inc. All rights reserved.6 Articulating the Problem “Marketing analysts need to understand the impact of their campaigns and we can provide them an avenue to do so.” - Surf and turf “We should totally Hadoop something!” - Knuckle sandwich
  • 7. Copyright ©2014 Visible Technologies, Inc. All rights reserved.7 Feature Engineering • Your concise data features are easy to grasp, but do they provide for an adequate model? • Your 600-dimension model is totally awesome, but does it scale? • How much is “good enough”?
  • 8. Copyright ©2014 Visible Technologies, Inc. All rights reserved.8 Proposing Solutions to the Business • Understand scale issues. • Provide alternatives: • There is no such thing as a perfect system. • Communicate clearly about real and opportunity costs.
  • 9. Copyright ©2014 Visible Technologies, Inc. All rights reserved.9 The Hazards of Third Party Data • Data might not be available forever. • Vendors might change terms. • Entrenchment can impede growth/change due to poor quality over time (data sources can decay, vendors may slack on maintenance).
  • 10. Copyright ©2014 Visible Technologies, Inc. All rights reserved.10 Bonini’s Paradox
  • 11. Copyright ©2014 Visible Technologies, Inc. All rights reserved.11 Productionalizing Prototypes • Isn’t that a fancy word? • Strike balance between awesome and simple. • This is almost impossible to get right. • Even if you get it right once, it won’t last. • Better for everybody if you give me as simple a mechanism as possible.
  • 12. Copyright ©2014 Visible Technologies, Inc. All rights reserved.12 Expanding and Maintaining 1 • Data drift • How does data change organically over time? • Bit rot • Does anybody even remember how to refit the model? • Split maintenance • Keeping the research model up to date with the production model never happens.
  • 13. Copyright ©2014 Visible Technologies, Inc. All rights reserved.13 Expanding and Maintaining 2 • Horizontal expansion exposes original scope assumptions. • “We have it in English. What do you mean we can’t get it in Swahili?” • Value trumps veracity. Sacrifices of purity cause degradation. • Business needs results in accretion of surrounding goo.
  • 14. Copyright ©2014 Visible Technologies, Inc. All rights reserved.14 Document Tone: “NLP” versus “Statistical” • NLP/Probablistic Grammars: • Effective. • Slow. • Costly reference grammars. Consider a vendor. • Vector space modeling (term vectors/n-grams) • Very fast at runtime. • Work best with lots of training data. • Can fit yourself, so long as you can afford to maintain it.
  • 16. Copyright ©2014 Visible Technologies, Inc. All rights reserved.16 Language-Detection
  • 17. Copyright ©2014 Visible Technologies, Inc. All rights reserved.17 Language-Detection: Features • Supports 53 languages. • Fitted on Wikipedia corpora. • Classic “one-versus-all” classification.
  • 18. Copyright ©2014 Visible Technologies, Inc. All rights reserved.18 Language-Detection: Mechanism • Determines the frequency with which n-grams of 1-3 characters appear inside of a labeled corpus. “To what extent does each 1-3 character n-gram participate in a label?” "tho":134583,"thr":87801,"the":3415279,"thi":110969,"tha":240340
  • 19. Copyright ©2014 Visible Technologies, Inc. All rights reserved.19 Language-Detection: Practicalities • Downsides? • Twitter and Facebook! • Letter casing (“I love you” versus “i love you”). • Mixed-language documents (e.g. Chinese documents with English words).
  • 20. Delta Airlines and “Needle Sandwiches” PR Case Study:
  • 21. Copyright ©2014 Visible Technologies, Inc. All rights reserved.21 Overview • Airline passengers found sewing needles in sandwiches. • Airline attempted to redirect the conversation and measure the results. • Visible tracked this event in social media.
  • 22. Copyright ©2014 Visible Technologies, Inc. All rights reserved.22 Delta Airlines: Needle Sandwiches Purchased a refinery to reduce fuel costs Passengers found needles in their on- flight sandwiches Free tickets given away as a promotion Prominent terms at a week view. Prominent terms at a month view. Prominent terms at a three month view.
  • 23. Copyright ©2014 Visible Technologies, Inc. All rights reserved.23 Delta Volumes Over Time Purchased a refinery to reduce fuel costs Needles found in on flight Turkey Sandwiches Free tickets given away as a promotion
  • 24. Copyright ©2014 Visible Technologies, Inc. All rights reserved.24 Delta Volumes Over Time
  • 25. Copyright ©2014 Visible Technologies, Inc. All rights reserved.25 Month View
  • 26. Copyright ©2014 Visible Technologies, Inc. All rights reserved.26 3 Month View
  • 27. Copyright ©2014 Visible Technologies, Inc. All rights reserved.27 PR Case Study: Conclusion • Contest didn’t pay off in the long term. • Attempts to redirect the conversation may be ham-fisted. • Thoughts? Conjecture?
  • 28. Copyright ©2014 Visible Technologies, Inc. All rights reserved.28 Conclusion Questions?