This document discusses how Syncsort helps companies access and integrate data from various sources to power analytics. It provides examples of how Syncsort has helped companies in insurance, media, and hotels easily onboard and integrate both historical and streaming data from multiple sources like mainframes, databases, and IoT devices. This allows for faster insights, increased productivity, cost savings, and helps future proof applications.
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: Today's ETL Does it All!
1. Powering the Connected Data
Platform With ETL Onboarding
@Scott_Gnau
CTO, Hortonworks
@TenduYogurtcu
Big Data GM, Syncsort
2. Global Leader in Big Iron to Big Data Solutions
2Syncsort Confidential and Proprietary - do not copy or distribute
• Provider of enterprise software and leader in Big Iron to Big Data solutions
in more than 85 countries around the world
• Global presence in 87% of enterprise Fortune 500 companies
• High performance & scalable software harnessing valuable data assets to
power business and operational analytics, while dramatically reducing the
cost of mainframe and legacy systems
• Unique focus on customer value through cost-effective solutions and
unparalleled support; trusted leader for nearly 50 years
WOODCLIFF LAKE, NJ
JAPAN
SINGAPORE
2
Global customer base of leaders and
emerging businesses across all major
industries
Strategic partnerships in Big Iron and Big Data
ecosystems
3. Meet Today’s Presenters
3Syncsort Confidential and Proprietary - do not copy or distribute
Scott Gnau
CTO, Hortonworks
Tendu Yogurtcu, PhD
GM, Big Data, Syncsort
10. Our Strategy: Simplify Big Data Integration
• Deploy on premise or in the cloud
• Choose among multiple execution frameworks – Hadoop, Spark, Linux, Unix,
Windows
• Integrate streaming and batch data with a single data pipeline for innovative
applications, like IoT
• Future-proof applications to avoid re-writing jobs in order to take advantage of
innovations in new execution frameworks
• Access and integrate ALL enterprise data sources – including mainframe – for
advanced analytics
10
11. Three Commitments Underpin Our Big Data Integration Strategy
Syncsort Confidential and Proprietary - do not copy or distribute 12
Light footprint
Self-tuning
engine
Single install.
No 3rd party
dependencies
World-class data processing,
mainframe expertise
JIRA:
MAPREDUCE-2454
MAPREDUCE-4807
MAPREDUCE-4049
MAPREDUCE-5455
HIVE-8347
SQOOP-1272
PARQUET-134
Spark-packages
and more!
Ongoing Contributions to the
Open Source Community
1
Leverage Syncsort Technology
Innovations & Mainframe Heritage
2
Strong Partnerships with Strategic
Big Data & Hadoop Players
3
13. Insurance: Easy Access to ALL Data for Better Analytics
14Syncsort Confidential and Proprietary - do not copy or distribute
• Challenge: Needed hard-to-access operational data for
advanced analytics
• Solution:
• Quickly load ~1000 database tables into HDP with the
click of a button
• Access & integrate complex Mainframe VSAM files, data
from DB2/z, Oracle & SQL Server
• Track changes & keep data up to date
• Benefits:
• Insight: Better and faster analytics
• Agility: Reclaim development time; single tool to ingest, detect changes and populate the data lake
• Compliance: Build audit trails, keep EDW current
• Productivity: No need for deep understanding of Hadoop
14. Leading Media Company: Accelerate New Business Initiatives
15Syncsort Confidential and Proprietary - do not copy or distribute
• Challenge: Build scalable platform to support new business
initiatives & scale for double-digit data growth, while reducing
escalating EDW & ELT Costs
• Solution:
• Shift data storage & processing out of the EDW into
Hadoop
• Migrate 500+ SQL ELT workloads to DMX-h on HDP
• Benefits:
• Agility: Scalable architecture to deploy new business initiatives – analyze more set top box data,
blend website user activity data, etc.
• Cost: Millions of dollars in savings from EDW, including SQL tuning & maintenance costs
• Productivity: ETL developers can stop coding & tuning, and get up & running on Hadoop quickly
15. Hotel Chain: Ease of Use, Timely & Up-to-Date Reporting
16
• Challenge: More timely collection & reporting on room
availability, event bookings, inventory and other hotel data
from 4,000+ properties globally
• Solution:
• Near real-time reporting
• DMX-h consumes property updates from Kafka every 10s
• DMX-h processes data on HDP, loading to TD every 30 min
• Deployed on Google Cloud Platform
• Benefits:
• Time to Value: DMX-h ease of use drastically cut development time
• Agility: Reports updated every 30 minutes vs every 24 hours
• Productivity: Leveraging ETL team for Hadoop (Spark), visual understanding of data pipeline
• Insight: Up-to-date data = better business decisions = happier customers
16. Syncsort DMX-h: Benefits to Business
17Syncsort Confidential and Proprietary - do not copy or distribute
• Faster Time to Value:
•Faster & better insights with readily-accessible data
• Compliance:
•Secure data access, ability to build audit trails
• Increased Productivity:
•Reclaim development time by automating, optimizing and future-proofing development
•Across platforms, on premise and in the cloud
• Cost:
•Lower archival costs
•Reduced development time
•Reduced Total Cost of Ownership, higher ROI
17. Syncsort Confidential and Proprietary - do not copy or distribute
18
See For Yourself!
***
Take a 30-day Free Trial @
www.syncsort.com/try
Notes de l'éditeur
TALK TRACK
Actionable intelligence means that you can capture perishable insights in real-time by analyzing data in motion.
It means drilling into terabytes or petabytes of data at rest for historical insights.
And, in turn, those historical insights help you tune your streaming analytics and data flows.
Modern data applications live and breath at the intersect between those Connected Data Platforms and the data they manage.
Those are the innovative killer applications that deliver actionable intelligence for data discovery, a single view of the data or predictive analytics.
[NEXT SLIDE]
The Important Shift is FROM Converged TO Connected – this is the opposite of traditional systems.
The Process Was: Find the data sources, then pull it all together with ETL, centralize and normalize the data and then run analytics.
BUT… new data sources are too large and variable to make this process effective. IN ADDITION your business can’t wait for this to happen,,, the customer may already be gone!
In the NEW WORLD Data is everywhere—at the edge, in multiple clouds and on prem..CONNECTED DATA PLATFORMS enable analytics to move around to the data, portably and in real time across this decentralized footprint-- even to the edge
We talked about
Syncsort’s data integration has always delivered the ability to process large data volumes, in less time, with fewer resources. However, performance and efficiency are just are starting points. It became apparent in speaking with our customers a few years ago – particularly when “Big Data” and Hadoop took off – that they were facing new challenges that had a common theme of complexity.
The rapid evolution of Big Data technologies presents several challenges on its own:
New technologies require new specialized skills that continue to be in short supply, and very expensive if you can find them
Because execution frameworks continue to be improved, customers don’t want to feel locked in. They don’t want to have to redevelop all their jobs if they want to take advantage of innovative new frameworks. A great example is MapReduce v1 to MapReduce v2 to Spark.
New sources and types of data add to the complexity as well – with streaming sources as a recent example. Connectivity and skills challenges.
Many of our customers are large enterprises that still have a significant reliance on the mainframe – and these companies found it very difficult to leverage the mainframe with the rest of the Big Data integration strategy
Organizations were building data lakes but then struggling to fill them. We heard from customers and partners alike that ingesting data from all its enterprise sources – Mainframe, data warehouse, etc. -- into Hadoop was a big problem to solve.
So, our product strategy has focused on not only delivering data integration products with exceptional performance and efficiency and lower TCO – but to also simplify the data integration process for all enterprise data sources and across all platforms – Linux, Unix, Windows, Hadoop, Spark – on premise, or in the cloud.
Summarize our Strategy and Focus
. We are a big part of the open source community. We were rated the 7th contributor to the open source projects based on the volume of code
. We have been able to leverage our DMX technology and integrate natively into Hadoop. We have a very light footprint on the cluster, our engine is self tuning and doesn’t consume resources unless there is a DMX job running.
. And we have a very strong partnership with the Hadoop players. At Cloudera, for example, we have regular executive level meetings with Mike Olsen, to bi-weekly meetings between our product managers, to engineering partnerships for early development. And we test each other’s software in our development cycle so we can make sure our integration is not going to suffer when a new feature is introduced
Access – Get best in class data ingestion capabilities for Hadoop. Mainframes, RDBMS, MPP, JSON, ORC, Parquet, Avro, NoSQL, Kafka and more.
Integrate – Single interface for streaming and batch processes. Single data pipeline for all enterprise data, batch or streaming.
Comply – Secure data access, data governance and lineage. Seamless integration with Kerberos, Apache Ranger, Apache Ambari, Apache Sentry,...
Simplify – Design once, deploy anywhere & insulate your organization from rapidly changing eco-system. Future proof your applications for new compute frameworks, on premise or in the cloud.
IHG currently has Holidex system which gets information of about 4000 IHG properties present globally – their property policies, check in check out info, discounts, blackout dates, penalties etc. and how inventory of rooms changes each day. Every IHG property currently sends this information which is fed to Teradata warehouse. With Kafka and cloud processing, they want to get changes in properties policies quicker and certainly interested in real time changes in inventory. This new system is Amadeus. They later plan on focusing booking data which would entail customer information, their booking status, booking history, membership details etc.
The new system will send JSON message about any inventory or policy change to kafka which is setup as a separate cluster so that multiple clusters can consume from it. They are envisioning couple of data layers – first being data lake in which hierarchies and structures of json are untouched get stored as parquet. This layer will provide the data in the most raw format to their data scientists. They don’t have any specific use case on it right now. FYI…since DMX has challenges with json arrays and limited parquet support, this layer will be implemented outside of dmx for now.
But DMX will still provide end-to-end implementation of their EZ tools project – policies and inventory. So, we take their hierarchical json data and normalize them in multiple records. And this read from kafka and normalization is a continuous process.
They have ORC backed hive tables which imitates tables in teradata warehouse. The continuous process which I just talked about, provides data for 30 min batch process updating hive tables. So, the messages are captured in real time but are made available every 30 min. All of this data and hive tables are stored in google storage which is setup as hdfs alternate storage system. DMX provides mechanism to hit this layer via HDFS or directly depending on how customer wants to access it.
We are finally generating load ready files for teradata warehouse. Idea is to use cloud cluster for any processing and push final dataset to teradata. Currently reports are still coming out of teradata and that’s where these final files are needed.
FROM FERNANDA:
Old process:
Once a day, every property sends a feed. It’s loaded into TD using Informatica. Daily report is produced
New process they are building with DMX-h:
Every property to send a Kafka message.
House policy (check in/out, gym, room availability) will also go into Kafka
DMX-h on a single node reads Kafka messages (JSON with lots of structures and repeated elements), normalizes and writes to text on HDFS – 10 sec intervals for now, this is easily customizable. Can also be scaled to more nodes as message load increases
Every 30 min:
DMX-h does ETL in the cluster: sort by timestamp, join, CDC, write to Hive ORC (takes about 5min) in google bucket.
The same ETL process also produces compressed text on HDFS to be loaded to Teradata