SlideShare une entreprise Scribd logo
1  sur  19
Top 10 challenges of making big data real
– and tips to overcome them

   Rich Dill
   Solutions Engineer, SnapLogic
   rdill@snaplogic.com
A play on Dave Letterman’s top 10

• 1. A miracle occurs here
    - Of course we can connect to it…
• 2. There is always more data than you expected
    -   Unless there is not enough data to be meaningful
• 3. Never mistake a memo for reality
    - Did you hear what I said or what I meant?
• 4. It is logically impossible to schedule for the unknown
    - Or the relationship between developers and weathermen
• 5. There is life beyond American English
    - Eventually you will have to deal with other languages



2
A play on Dave Letterman’s top 10

• 6. Of course the data is accurate, clean and ready
    - Data quality issues can kill project schedules
• 7. Dealing with unstructured data is fun
    - Somewhere buried inside is your delimiter where you least
      expect it
• 8. The data and process is subject to…
    - Pick your acronym PCI, FIX, HIPAA, SOX
• 9. The requirements once defined are set in stone
    - Requirements almost always evolve
• 10. The most critical data will be on the most difficult
  platform to access
    - “a good deal of our case data is on Notes running on AS400”
3
A miracle occurs here

• Of course we can connect to it…




4
And we know the image resonates, v2…




5
SnapLogic Solution


                           Users
 ESB            RDBMS




 Data Center               Mobile




   Enterprise

        Amazon Redshift



       Cloud              Big Data
There is always more data than you expected



• Unless there is not enough data to be
  meaningful
    - It’s feast or famine
    - Distributed systems replicate data
      • At the site level and at the network level
         - 3x at the data center in Houston and 3x in Chicago
         - Replicated data can increase the cost of hardware,
           network and software
    - We are far from normal
      • Data is organized for performance and reliability
        not space efficiency
7
It is logically impossible to schedule for the unknown


• Or my theory of the relationship between developers
  and weathermen




• The accuracy of an estimate is a function of the
  number of variables and the length of the project
8
Never mistake a memo for reality

• Did you hear what I said or what I meant?
•   Are you a literal listener?
     -   Psycholinguistics should be required reading for project managers
• Waterfall process
     - Allows you to build something the user wants today that you deliver in
       9 months or two years
• Iterative process
     - We’ll figure it out as we go along
     - Not really suited for deep architectural designs
•   Process
     -   Listen
     -   Process
     -   Repeat back “this is what I heard you say”
•   Nothing beats showing a functioning prototype, demo or wireframe


9
There is life beyond American English

• Eventually you will have to deal with other languages
     - German will test your user interface spacing
     - Cyrillic will add to the character set
• Middle eastern languages
     - Read right to left
     - Some languages don’t have consistent spelling
• Far eastern languages
     - There is no such thing as Chinese
        •   Mandarin is the “Speech of Officials”
        •   Cantonese is used in Hong Kong
        •   Hangul is used in Korea
        •   Japanese
              -   Kanji is adopted Chinese characters
              -   Kana is a combination of Hiragana & Katakana

10
Of course the data is accurate, clean and ready


• How good is the data?
     -   Profiling the data is key to accurate project estimates
     -   What percentage of the data is null, blank, invalid?

• Data lifecycle includes
     -   Acquisition or creation
     -   Validation
          •   Business rules
          •   Which may result in…

• Data cleansing
     -   Zip code tables, barcodes, D & B credit ratings
     -   Public data resources: www.data.gov
• Storage in an accessible format/location
• Archiving
     -   Industry or legal rules for archiving


11
Dealing with unstructured data is fun

• Somewhere buried inside is your delimiter where you
  least expect it
• Email is one of the most complex to handle
• Hierarchal data structures must be mapped or
  navigated
• XML is not the end all, be all of structure data
  formatting
     -   JSON
     -   BSON
     -   SomethingImissedSON




12
Big Data Reference Architecture

       1                  2               3
 Collect          Translate & Enrich   Distribute

                                        DB
Structured Data




                                                    DB


                                                    Data
                                                    View

 Unstructured
     Data
The data and process is subject to…

• Pick your acronym: PCI, FIX, HIPAA, SOX
• Almost every industry has some form or another of data
  handling protocols that must be addressed
• These protocols are a combination of
     -   Data creation
     -   Data access
     -   Technology and workflow
     -   It is not just encryption and access
• Know your customers requirements!




14
The requirements once defined are set in stone


 • What your users know today is not what they will know
   tomorrow…
 • Requirements evolve
 • Why do you think they call them users?
      - If you are successful they will want more
 • Things change
      -   Economy
      -   Budgets
      -   Timeframe
      -   Management
 • Feature creep is not a bad thing if budgets and
   timelines also creep
 15
The most critical data will be on the most difficult
platform to access

• “A good deal of our case data is on Notes running on AS400”
• Discover where the data is first
• When can you access it?
     - 24x7, after hours, on demand
• Throughput is key
     - Either during business hours of afterwards
• What conditions?
     -   One time download
     -   Scheduled
     -   Event based
     -   Stream
• What about security requirements?
     - There is a performance impact of encryption during transmission

16
Containerization with Snaps




                BUY                          BUILD
    •   SnapStore                 •   SDK + API
    •   Certified and supported   •   Java, Python
        by SnapLogic              •   Customer, Partner or
                                      SnapLogic
The eleventh rule

• Free software sometimes is worth the cost
     - Or the money you save on licenses is multiplied by
       the cost of training and consultants
     - In most cases labor is the one of the biggest costs of
       most software projects
• Open source is NOT the same as free!
     - Subscription vs. perpetual licenses
     - Does the customer need to
        • Expense or capitalize software licenses



18
Thank you
For more information
www.snaplogic.com
BDaaS - BigData as a Service

Contenu connexe

Plus de SnapLogic

Plus de SnapLogic (20)

Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies
 
Overcoming the challenge of multiple data frameworks in a multiple cloud envi...
Overcoming the challenge of multiple data frameworks in a multiple cloud envi...Overcoming the challenge of multiple data frameworks in a multiple cloud envi...
Overcoming the challenge of multiple data frameworks in a multiple cloud envi...
 
SnapLogic Technology Open House – January 2018
SnapLogic Technology Open House – January 2018SnapLogic Technology Open House – January 2018
SnapLogic Technology Open House – January 2018
 
Self-Service Integration in the Age of Digital Transformation at Box
Self-Service Integration in the Age of Digital Transformation at BoxSelf-Service Integration in the Age of Digital Transformation at Box
Self-Service Integration in the Age of Digital Transformation at Box
 
Live Demo: Accelerate the integration of workday applications
Live Demo: Accelerate the integration of workday applicationsLive Demo: Accelerate the integration of workday applications
Live Demo: Accelerate the integration of workday applications
 
The new dominant companies are running on data
The new dominant companies are running on data The new dominant companies are running on data
The new dominant companies are running on data
 
Spring 2017 release customer webinar
Spring 2017 release customer webinarSpring 2017 release customer webinar
Spring 2017 release customer webinar
 
SnapLogic unveils machine-learning-driven integration assistant
SnapLogic unveils machine-learning-driven integration assistantSnapLogic unveils machine-learning-driven integration assistant
SnapLogic unveils machine-learning-driven integration assistant
 
Webinar: Evolution of Data Management for the IoT
Webinar: Evolution of Data Management for the IoTWebinar: Evolution of Data Management for the IoT
Webinar: Evolution of Data Management for the IoT
 
The API Lie
The API LieThe API Lie
The API Lie
 
SnapLogic Culture
SnapLogic CultureSnapLogic Culture
SnapLogic Culture
 
SnapLogic Live: Enabling the Citizen Integrator
SnapLogic Live: Enabling the Citizen IntegratorSnapLogic Live: Enabling the Citizen Integrator
SnapLogic Live: Enabling the Citizen Integrator
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To Know
 
SnapLogic Live: Workday Integration
SnapLogic Live: Workday IntegrationSnapLogic Live: Workday Integration
SnapLogic Live: Workday Integration
 
SnapLogic Live: Big Data Integration
SnapLogic Live: Big Data IntegrationSnapLogic Live: Big Data Integration
SnapLogic Live: Big Data Integration
 
SnapLogic Live: IoT Integration
SnapLogic Live: IoT IntegrationSnapLogic Live: IoT Integration
SnapLogic Live: IoT Integration
 
SnapLogic Live: Powering Cloud Analytics
SnapLogic Live: Powering Cloud AnalyticsSnapLogic Live: Powering Cloud Analytics
SnapLogic Live: Powering Cloud Analytics
 
SnapLogic Live: ServiceNow Integration
SnapLogic Live: ServiceNow IntegrationSnapLogic Live: ServiceNow Integration
SnapLogic Live: ServiceNow Integration
 
SnapLogic Live: Salesforce Integration
SnapLogic Live: Salesforce IntegrationSnapLogic Live: Salesforce Integration
SnapLogic Live: Salesforce Integration
 
SnapLogic Live: Anaplan Integration
SnapLogic Live: Anaplan IntegrationSnapLogic Live: Anaplan Integration
SnapLogic Live: Anaplan Integration
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Top 10 Challenges of Making Big Data Real and Tips to Overcome Them

  • 1. Top 10 challenges of making big data real – and tips to overcome them Rich Dill Solutions Engineer, SnapLogic rdill@snaplogic.com
  • 2. A play on Dave Letterman’s top 10 • 1. A miracle occurs here - Of course we can connect to it… • 2. There is always more data than you expected - Unless there is not enough data to be meaningful • 3. Never mistake a memo for reality - Did you hear what I said or what I meant? • 4. It is logically impossible to schedule for the unknown - Or the relationship between developers and weathermen • 5. There is life beyond American English - Eventually you will have to deal with other languages 2
  • 3. A play on Dave Letterman’s top 10 • 6. Of course the data is accurate, clean and ready - Data quality issues can kill project schedules • 7. Dealing with unstructured data is fun - Somewhere buried inside is your delimiter where you least expect it • 8. The data and process is subject to… - Pick your acronym PCI, FIX, HIPAA, SOX • 9. The requirements once defined are set in stone - Requirements almost always evolve • 10. The most critical data will be on the most difficult platform to access - “a good deal of our case data is on Notes running on AS400” 3
  • 4. A miracle occurs here • Of course we can connect to it… 4
  • 5. And we know the image resonates, v2… 5
  • 6. SnapLogic Solution Users ESB RDBMS Data Center Mobile Enterprise Amazon Redshift Cloud Big Data
  • 7. There is always more data than you expected • Unless there is not enough data to be meaningful - It’s feast or famine - Distributed systems replicate data • At the site level and at the network level - 3x at the data center in Houston and 3x in Chicago - Replicated data can increase the cost of hardware, network and software - We are far from normal • Data is organized for performance and reliability not space efficiency 7
  • 8. It is logically impossible to schedule for the unknown • Or my theory of the relationship between developers and weathermen • The accuracy of an estimate is a function of the number of variables and the length of the project 8
  • 9. Never mistake a memo for reality • Did you hear what I said or what I meant? • Are you a literal listener? - Psycholinguistics should be required reading for project managers • Waterfall process - Allows you to build something the user wants today that you deliver in 9 months or two years • Iterative process - We’ll figure it out as we go along - Not really suited for deep architectural designs • Process - Listen - Process - Repeat back “this is what I heard you say” • Nothing beats showing a functioning prototype, demo or wireframe 9
  • 10. There is life beyond American English • Eventually you will have to deal with other languages - German will test your user interface spacing - Cyrillic will add to the character set • Middle eastern languages - Read right to left - Some languages don’t have consistent spelling • Far eastern languages - There is no such thing as Chinese • Mandarin is the “Speech of Officials” • Cantonese is used in Hong Kong • Hangul is used in Korea • Japanese - Kanji is adopted Chinese characters - Kana is a combination of Hiragana & Katakana 10
  • 11. Of course the data is accurate, clean and ready • How good is the data? - Profiling the data is key to accurate project estimates - What percentage of the data is null, blank, invalid? • Data lifecycle includes - Acquisition or creation - Validation • Business rules • Which may result in… • Data cleansing - Zip code tables, barcodes, D & B credit ratings - Public data resources: www.data.gov • Storage in an accessible format/location • Archiving - Industry or legal rules for archiving 11
  • 12. Dealing with unstructured data is fun • Somewhere buried inside is your delimiter where you least expect it • Email is one of the most complex to handle • Hierarchal data structures must be mapped or navigated • XML is not the end all, be all of structure data formatting - JSON - BSON - SomethingImissedSON 12
  • 13. Big Data Reference Architecture 1 2 3 Collect Translate & Enrich Distribute DB Structured Data DB Data View Unstructured Data
  • 14. The data and process is subject to… • Pick your acronym: PCI, FIX, HIPAA, SOX • Almost every industry has some form or another of data handling protocols that must be addressed • These protocols are a combination of - Data creation - Data access - Technology and workflow - It is not just encryption and access • Know your customers requirements! 14
  • 15. The requirements once defined are set in stone • What your users know today is not what they will know tomorrow… • Requirements evolve • Why do you think they call them users? - If you are successful they will want more • Things change - Economy - Budgets - Timeframe - Management • Feature creep is not a bad thing if budgets and timelines also creep 15
  • 16. The most critical data will be on the most difficult platform to access • “A good deal of our case data is on Notes running on AS400” • Discover where the data is first • When can you access it? - 24x7, after hours, on demand • Throughput is key - Either during business hours of afterwards • What conditions? - One time download - Scheduled - Event based - Stream • What about security requirements? - There is a performance impact of encryption during transmission 16
  • 17. Containerization with Snaps BUY BUILD • SnapStore • SDK + API • Certified and supported • Java, Python by SnapLogic • Customer, Partner or SnapLogic
  • 18. The eleventh rule • Free software sometimes is worth the cost - Or the money you save on licenses is multiplied by the cost of training and consultants - In most cases labor is the one of the biggest costs of most software projects • Open source is NOT the same as free! - Subscription vs. perpetual licenses - Does the customer need to • Expense or capitalize software licenses 18
  • 19. Thank you For more information www.snaplogic.com BDaaS - BigData as a Service

Notes de l'éditeur

  1. 1990sValuable data was being generated but was really living in silo’d environments. The term MDM was not even coined till 2003As long as you could connect different systems together via a nightly, or sometimes even a weekly feed, that was pretty darn awesome!Technologies like ESBs, EAIs, ETLs… flourished.Data was mostly structured. Sitting in RDBMS systems2000sNetwork speeds increasedCosts went downPlayers like Salesforce and NetSuite started getting traction from SMB marketImmense value on cost and agilityFlexibility of to subscribe vs. perpetual licenses2005: Consumer / Social dataFB, Twitter, LinkedIn, amazon.com consumer reviews…Humans generating massive amounts of preference data, likes and dislikes, Data was different: Non-relational unstructured. Real-time dataHuge volumes: PetabytesProviding immense value to the business on their customers2010: MachineRFID tags. Various other sensors, weblogs. ArcSight got bought out for $1.5B by HPMassive amounts of dataExabytesSplunk had a successful IPO last monthSnap LogicThese 4 sources create an Impendence mismatch!Good luck doing all of this with an ESB Structured vs. unstructuredStreaming vs. batchPetabytes and Exabytes vs. GigaBytesPull vs. pushHub and spokeUnprecedented opportunity & desire to use dataData silos (data fragmentation) unavoidableLegacy Apps, Cloud Apps, and Hadoop are driving thisDifferent locations, protocols, formats, and architecturesData is more distributed & less accessible (less useful)Compounding due to volume & variety of apps & dataESB is just another connectionEnterprises must share data between their appsCollect, combine, process data into valuable informationCompetitive advantage will become necessity for survivalsnapLogic = data sharing platform
  2. Apple Like Model – we offer an API and about 200 SnapsBuild or BuyEasy to build w Java or Phython – An intern out of school built snaps in 4 daysBuild or Buy – Containerazation of accessAbstraction of the end point – so you do not need to know everything