Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016

Optimizing Big Data to run in the Public Cloud

Qubole

Atlanta MLConf

Qubole

Swimming Across the Data Lake, Lessons learned and keys to success

Today's organizations contend with more diverse applications, data, and systems than ever before – silos that are often fragmented and difficult to leverage together. iWay Big Data Integrator (BDI) simplifies the creation, management, and use of Hadoop-based data lakes. It provides a modern, native approach to Hadoop-based data integration and management that ensures high levels of capability, compatibility, and flexibility to help your organization. Join us to learn how you can simplify adoption of Apache Hadoop using iWay Big Data Integrator. Learn about our ability to streamline the deployment of ingestion, transformation, and extraction tasks. See the pre-recorded webcast online at: http://www.informationbuilders.com/webevents/online/24427#sthash.J0cRy1PG.dpuf

Summer Shorts: Big Data Integration

ibi

Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.

Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...

NoSQLmatters

Analysis of Major Trends in Big Data Analytics

Hadoop Reporting and Analysis - Jaspersoft

Beyond TCO

Big Data Architecture and Deployment

Cisco Canada

Harnessing the Power of Apache Hadoop

Cloudera, Inc.

High Performance Spatial-Temporal Trajectory Analysis with Spark

Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ? In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management. Speakers: Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM Vikram Murali, Program Director, Data Science and Machine Learning, IBM

Impala use case @ Zoosk

Cloudera, Inc.

Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...

Data Con LA

Scaling Data Science on Big Data

More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.

A Reference Architecture for ETL 2.0

Hadoop adoption is a journey. Depending on the business the process can take weeks, months, or even years. Hadoop is a transformative technology so the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. There are challenges for companies who have lived with an application driven business for the last two decades to suddenly become data driven. Companies need to begin thinking less in terms of single, silo’d servers and more about “the cluster”. The concept of the cluster becomes the center of data gravity drawing all the applications to it. Companies, especially the IT organizations, embark on a process of understanding how to maintain and operationalize this environment and provide the data lake as a service to the businesses. They must empower the business by providing the resources for the use cases which drive both renovation and innovation. IT needs to adopt new technologies and new methodologies which enable the solutions. This is not technology for technology sake. Hadoop is a data platform servicing and enabling all facets of an organization. Building out and expanding this platform is the ongoing journey as word gets out to businesses that they can have any data they want and any time. Success is what drives the journey. The length of the journey varies from company to company. Sometimes the challenges are based on the size of the company but many times the challenges are based on the difficulty of unseating established IT processes companies have adopted without forethought for the past two decades. Companies must navigate through the noise. Sifting through the noise to find those solutions which bring real value takes time. As the platform matures and becomes mainstream, more and more companies are finding it easier to adopt Hadoop. Hundreds of companies have already taken many steps; hundreds more have already taken the first step. As the wave of successful Hadoop adoption continues, more and more companies will see the value in starting the journey and paving the way for others.

The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016

This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting.

Accelerating Big Data Analytics

Attunity

Qubole presentation for the Cleveland Big Data and Hadoop Meetup

Qubole

"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...

Dataconomy Media

Tendances (20)

Filling the Data Lake

Optimizing Big Data to run in the Public Cloud

Atlanta MLConf

Swimming Across the Data Lake, Lessons learned and keys to success

Summer Shorts: Big Data Integration

Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...

Analysis of Major Trends in Big Data Analytics

Hadoop Reporting and Analysis - Jaspersoft

Beyond TCO

Big Data Architecture and Deployment

Harnessing the Power of Apache Hadoop

High Performance Spatial-Temporal Trajectory Analysis with Spark

Impala use case @ Zoosk

Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...

Scaling Data Science on Big Data

A Reference Architecture for ETL 2.0

The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016

Accelerating Big Data Analytics

Qubole presentation for the Cleveland Big Data and Hadoop Meetup

"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...

En vedette

Designing Data Pipelines Using Hadoop

Building a Data Pipeline from Scratch - Joe Crobak

Hakka Labs

Big data analysis using map/reduce

RenuSuren

Building a Self-Service Big Data Pipeline

At the StampedeCon 2015 Big Data Conference: Mercy has built a system using batch and streaming technology to allow batch and near real-time updates to flow from its Epic EHR (Electronic Health Records) system into our Hadoop cluster. Mercy is using this system to provide reporting and analytics capabilities to its researchers, business owners, and physicians. The system uses Sqoop, Flume, Pig, Hive, Oozie, and Storm. As part of this system, Mercy developed a generic database replication utility that makes it a one-line configuration change to add new tables and relational data sources. This session will describe the batch and streaming engines that Mercy uses to integrate the patient data. Adam will talk about how the system came together and where it is headed in the future.

Batch and Real-time EHR updates into Hadoop - StampedeCon 2015

Data ingest is a deceptively hard problem. In the world of big data processing, it becomes exponentially more difficult. It's not sufficient to simply land data on a system, that data must be ready for processing and analysis. The Kite SDK is a data API designed for solving the issues related to data infest and preparation. In this talk you'll see how Kite can be used for everything from simple tasks to production ready data pipelines in minutes.

Building data pipelines with kite

Joey Echeverria

Managing data workflows with Luigi

Teemu Kurppa

Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...

As Hadoop becomes a mainstream data platform across organizations, securing a vast and growing volume of critical information, especially financial and healthcare data, is more essential than ever.  In this presentation, Derek will elaborate how to leverage Big Data technologies without sacrificing security and compliance, and will focus specially on how comprehensive security mechanisms should be put in place to secure a production ready Hadoop environment.  The presentation will also highlight technologies, such as encryption in-motion and at-rest for Hadoop services, as well as the complicated compliant processes to meet strictest regulatory requirements and standards.

Hadoop Security and Compliance - StampedeCon 2016

Visualizing Big Data – The Fundamentals

Companies today are all focused on finding new consumption models to better utilize the data they produce. This presentation will provide insights and best practices for creating the organization and sponsorship necessary to set the foundation for success. For this session, Dan will provide an overview of the process and methodologies he employs to establish and sustain a Data Driven Culture. Key topics will include: Data Driven Culture Executive Sponsorship Organizational Structure – Collaboration Hubs and Bi-Modal Analytics Role of Hadoop and Big Data as Part of Data Driven Culture

Creating a Data Driven Organization - StampedeCon 2016

The Internet of (Human) Things is just beginning to take shape. The human body is an inexhaustible source of data about personal health, and the healthcare industry is just beginning to scratch the surface of the potential insights and value that will come from that data.  While much of healthcare traditionally focuses on the episodic delivery of services, the Affordable Care Act is pushing healthcare providers, payers, and self-funded employer groups to look at ways to proactively encourage healthy behaviors. Providing personal health devices as a way to promote individual health is one way that healthcare is beginning to take advantage of IoT technologies.  This session provides insight into how IoT is being leveraged in population health management through a solution jointly delivered by Amitech Solutions and Big Cloud Analytics.  Attendees will learn how Hadoop is being used to gather personal device from various vendors, integrate and analyze that information, differentiate trends across regional and cultural diversity, and provide personal recommendations and insights into health risks. This session presents one important way the healthcare industry is leveraging IoT.

Using The Internet of Things for Population Health Management - StampedeCon 2016

In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies. Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients. In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.

Building Scalable Big Data Pipelines

Christian Gügi

A Beginner's Guide to Building Data Pipelines with Luigi

Growth Intelligence

Building a unified data pipeline in Apache Spark

Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Big Data Spain

Big Data Landscape 2016

Matt Turck

SQL to Hive Cheat Sheet

En vedette (19)

Designing Data Pipelines Using Hadoop

Building a Data Pipeline from Scratch - Joe Crobak

Big data analysis using map/reduce

Building a Self-Service Big Data Pipeline

Batch and Real-time EHR updates into Hadoop - StampedeCon 2015

Building data pipelines with kite

Managing data workflows with Luigi

Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...

Hadoop Security and Compliance - StampedeCon 2016

Visualizing Big Data – The Fundamentals

Creating a Data Driven Organization - StampedeCon 2016

Using The Internet of Things for Population Health Management - StampedeCon 2016

Building Scalable Big Data Pipelines

A Beginner's Guide to Building Data Pipelines with Luigi

Building a unified data pipeline in Apache Spark

Apache Hadoop YARN: Understanding the Data Operating System of Hadoop

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Big Data Landscape 2016

SQL to Hive Cheat Sheet

Similaire à Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016

Transform You Business with Big Data and Hortonworks

To transform your organization and unlock the value of your data, you need a way to ingest, store and analyze every type of data in your organization. This presentation covers the Data Access Layer of the Hadoop Ecosystem which enables you to achieve this. We will use the HDP (Hortonworks Data Platform) reference architecture to walk through the Hadoop core and its ecosystem with focus on the data access layer. We will cover some of the prominent tools of the ecosystem such as Pig, Hive, Sqoop, Flume and Oozie and how they are used for ingesting data into Hadoop from structured, unstructured and streaming sources. Talk to us at +91 80 6567 9700 or send an email to training@springpeople.com for more information.

Transform Your Business with Big Data and Hortonworks

Pactera_US

presentation_Hadoop_File_System

Brett Keim

Hadoop data access layer v4.0

SpringPeople

Bn1028 demo hadoop administration and development

conline training

How is Big Data moved around? How are you planning to move it? This session will focus on familiar and not so similar tools you can use today for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools). We will compare and outline options, discuss how they can work with your existing Hadoop and Windows Azure environment, and provide some guidance on when and how to use each of these tools.

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Stéphane Fréchette

7 février 2015 Samedi SQL Sujet: Session 3 - Introduction to Azure HDInsight (Stéphane Fréchette - Ukübu) *** Session en "Frenchglish" Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required.

Stéphane Fréchette - Samedi SQL - Introduction to HDInsight

MSDEVMTL

Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required. Presented @ MSDEVMTL on Saturday February , 2015

Introduction to Azure HDInsight

Stéphane Fréchette

Hadoop Hadoop & Spark meetup - Altiscale

Mark Kerzner

Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup

Mark Kerzner

201305 hadoop jpl-v3

Eric Baldeschwieler

Part five in a five-part series, this webcast will be a demonstration of the integration of Apache Zeppelin and Pivotal HDB. Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. This webinar will demonstrate the configuration of the psql interpreter and the basic operations of Apache Zeppelin when used in conjunction with Hortonworks HDB.

How to Use Apache Zeppelin with HWX HDB

The Briefing Room with Robin Bloor and HP Vertica Live Webcast on August 26, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=3dd6d1b068fe395f665c75adb682ac41 Hadoop has long passed the point of being a nascent technology, but many users have found that when left to its own devices, Hadoop can be a one trick pony. To get the most out of Hadoop, organizations need a flexible platform that empowers analysts and data managers with a complete set of information lifecycle management and analytics tools without a performance tradeoff. Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he outlines Hadoop’s role in a big data architecture. He’ll be briefed by Walt Maguire of HP Vertica, who will showcase his company’s big data solutions, including HAVEn and the HP Big Data Platform. He will demonstrate how HP Vertica acts as a complement to Hadoop, and how the combination of the two provides a versatile and highly performant solution. Visit InsideAnlaysis.com for more information.

Introduction to Data Analyst Training

Cloudera, Inc.

Tools and techniques for data science

Ajay Ohri

BIG DATA ANALYTICS WITH HADOOP

Imviplav

Level Up – How to Achieve Hadoop Acceleration

Inside Analysis

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Pentaho

Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective. In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Slim Baltagi

Twitter word frequency count using hadoop components 150331221753

pradip patel

Twitter word frequency count using hadoop components 150331221753

pradip patel

Similaire à Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016 (20)

Transform You Business with Big Data and Hortonworks

Transform Your Business with Big Data and Hortonworks

presentation_Hadoop_File_System

Hadoop data access layer v4.0

Bn1028 demo hadoop administration and development

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

Stéphane Fréchette - Samedi SQL - Introduction to HDInsight

Introduction to Azure HDInsight

Hadoop Hadoop & Spark meetup - Altiscale

Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup

201305 hadoop jpl-v3

How to Use Apache Zeppelin with HWX HDB

Introduction to Data Analyst Training

Tools and techniques for data science

BIG DATA ANALYTICS WITH HADOOP

Level Up – How to Achieve Hadoop Acceleration

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Twitter word frequency count using hadoop components 150331221753

Plus de StampedeCon

Despite widespread adoption and success most machine learning models remain black boxes. Many times users and practitioners are asked to implicitly trust the results. However understanding the reasons behind predictions is critical in assessing trust, which is fundamental if one is asked to take action based on such models, or even to compare two similar models. In this talk I will (1.) formulate the notion of interpretability of models, (2.) provide a review of various attempts and research initiatives to solve this very important problem and (3.) demonstrate real industry use-cases and results focusing primarily on Deep Neural Networks.

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...

Words are no longer sufficient in delivering the search results users are looking for, particularly in relation to image search. Text and languages pose many challenges in describing visual details and providing the necessary context for optimal results. Machine Learning technology opens a new world of search innovation that has yet to be applied by businesses. In this session, Mike Ranzinger of Shutterstock will share a technical presentation detailing his research on composition aware search. He will also demonstrate how the research led to the launch of AI technology allowing users to more precisely find the image they need within Shutterstock’s collection of more than 150 million images. While the company released a number of AI search enabled tools in 2016, this new technology allows users to search for items in an image and specify where they should be located within the image. The research identifies the networks that localize and describe regions of an image as well as the relationships between things. The goal of this research was to improve the future of search using visual data, contextual search functions, and AI. A combination of multiple machine learning technologies led to this breakthrough.

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

In many modern applications data are collected in unusual form. Connectome or brain imaging data are graphs. Wearable devices measuring activity are functions over time. In many cases these objects are collected for each individual or transaction leaving the statistician with the challenge of analyzing populations of data not in classical numeric and categorical formats in big spreadsheets. In this talk I introduce object oriented data analysis with an application we recently developed for regression analysis. This talk will be aimed at the general data scientist and emphasis on the concepts and not mathematical detail. The take home message is how can we use covariates (i.e., meta-data) to predict what the structure of a brain image graph will be.

Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017

This talk aims to dive into technical details in machine learning model development, implementation and values it bring to Monsanto breeding pipeline. We genotype over 100 million seeds a year in order to save field resources and product development cycle time. Automation and high throughput production from the lab becomes key to R&D success. In house predictive model development incorporated random forest ensemble based approach with additional features derived from gaussian mixture model. The results show over 95% accuracy with less than 1% false positives/negatives. Model is highly generalizable with over 10 million data points being trained and tested on. The model also offers probabilistic approach to present genotypes in a more meaningful way and help enhanced downstream genomics analyses. The talk targets audience who are in breeding, genetics, molecular biology, and data scientists who are interested in practical applications.

Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...

How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017

This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.

Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017

This presentation will cover all aspects of modeling, from preparing data, training and evaluating the results. There will be descriptions of the mainline ML methods including, neural nets, SVM, boosting, bagging, trees, forests, and deep learning. common problems of overfitting and dimensionality will be covered with discussion of modeling best practices. Other topics will include field standardization, encoding categorical variables, feature creation and selection. It will be a soup-to-nuts overview of all the necessary procedures for building state-of-the art predictive models.

Foundations of Machine Learning - StampedeCon AI Summit 2017

In this session, we’ll discuss approaches for applying convolutional neural networks to novel computer vision problems, even without having millions of images of your own. Pretrained models and generic image data sets from Google, Kaggle, universities, and other places can be leveraged and adapted to solve industry and business specific problems. We’ll discuss the approaches of transfer learning and fine tuning to help anyone get started on using deep learning to get cutting edge results on their computer vision problems.

Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...

Like the story of the six blind men trying to explain the nature of an elephant, current research in cognitive computational systems attempts to identify the nature of an illness, human behavior, or socio-economical phenomenon, from their own perspective. At present, there is no agreed upon definition for cognitive systems. One large communication corporation defines cognitive systems as a category of technology that uses artificial intelligence, machine learning and reasoning, to enable people and machines to interact more naturally. It also extends and magnifies human expertise and cognition to enable accurate decisions on time. Two of the most famous risk and financial advisory firms agree with that interpretation. A different large corporation, however, considers “cognitive systems” as merely marketing jargon. If cognitive systems are going to help us solve challenging problems in medicine, economics, or other fields, three aspects must be considered in order to reveal the “true nature of the elephant”. § All facets of the problem must be addressed, like the main parts of the elephant had to be touched by the men. § These facets must be properly assembled, like the men needed to join hands around the elephant in order to understand what it was. § This assembly must be completed within sufficient time to anticipate future decisions. Just like the men needed to know what an elephant is before the next one charges them. This talk will explain how agnostic (unsupervised, blinded) machine learning findings can be assembled by multiobjective and multimodal optimization research techniques would be utilized to uncover a multifaceted view of the “elephant”, in this case the human being (e.g., genomic variants, personality traits, brain images). It will also give real-world examples of how this knowledge will “extend the human capabilities” by achieving an integrative assessment of the whole person in relation to their risk, which will allow professionals to generate accurate person-centered policies: from personalized diagnoses, business opportunities, or the prevention of outbreaks.

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...

This talk will walk through the important building blocks of Automated AI. Rajiv will highlight the current gaps in the analytics organizations, how to close those gaps using automated AI. Some of the issues discussed around automated AI are the accuracy of models, tradeoffs around control when using automation, interpretability of models, and integration with other tools. These issues will be highlighted with examples of automated analytics in different industries. The talk will end with some examples of how automated AI in the hands of data scientists and business analysts is transforming analytic teams and organizations.

Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017

Artificial Intelligence has entered a renaissance thanks to rapid progress in domains as diverse as self-driving cars, intelligent assistants, and game play. Underlying this progress is Deep Learning – driven by significant improvements in Graphic Processing Units and computational models inspired by the human brain that excel at capturing structures hidden in massive complex datasets. These techniques have been pioneered at research universities and digital giants but mainstream enterprises are starting to apply them as open source tools and improved hardware become available. Learn how AI is impacting analytics today and in the future. Learn how AI is affecting the enterprise including applications like fraud detection, mobile personalization, predicting failures for IoT and text analysis to improve call center interactions. We look at how practical examples of assessing the opportunity for AI, phased adoption, and lessons going from research, to prototype, to scaled production deployment.

AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017

A Different Data Science Approach - StampedeCon AI Summit 2017

Enterprises typically have many data silos of partial customer data and a common theme in big data projects to use big data tools and pipelines to unify all siloed customer data into a single, queryable, platform for improving all future customer interactions. This data often comes from billing, website traffic, logistics, and marketing; all in different formats with different properties. Graph provides a way to unify all of the data into a single place for use in tracking the flow of a user through the various silos. Graph can also be used for visualizations and analytics that are difficult in other systems. In this talk we will explore the ways in which Graph can be leveraged in a customer 360 use case. What it can add to a more conventional system and what the approach to developing a graph based Customer 360 system should be.

Graph in Customer 360 - StampedeCon Big Data Conference 2017

This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python. System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017

Using big data isn’t about doing the same things we’ve always done just with different technologies. The technology advances that we’ve chosen to label as big data create the opportunity for wholly new kinds of solutions. Two of the key advances that are enabling new business capabilities are cloud-based data management platforms and streaming data processing and analytics. In this session, Paul Boal will drill into the cloud-based streaming data architecture that has made possible EVŌ, a new breakthrough health and wellness platform. EVŌ uses a game-changing approach that leverages over 60 billion data points and a predictive analytics engine to intervene BEFORE someone becomes critically ill. All of this is possible by leveraging data from smartphones and wearable fitness devices along with advanced analytics which then help users develop and sustain positive behaviors. Attendees will learn how to create a cloud- based architecture that can receive data, apply multiple layers of dynamic business rules, and drive alerts and decisions through real-time stream processing using technologies including web services, Amazon DynamoDB and Kinesis, Drools, and Apache Spark.

Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...

The collection and use of Big Data has become an important part of modern business practice. The Internet of Things (IoT) movement promises to provide new opportunities for businesses interested in the intersection of people and technology. It is also wrought with pitfalls for practitioners and researchers who struggle to make sense of an increasing cacophony of signals. How should they poll and collect data from millions of signals in a way that is manageable, scalable, and statistically valid? How should they analyze and predict using these data? This presentation will discuss these challenges with applied examples from monitoring and managing one of the world’s largest computers.

Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...

Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

Resource Management in Impala - StampedeCon 2016

This session will detail best practices for architecting, building, operating and managing an Analytics Data Lake platform. Key topics will include: 1) Defining next-generation Data Lake architectures. The defacto standard has been commodity DAS servers with HDFS, but there are now multiple solutions aimed at separating compute and storage, virtualizing or containerizing Hadoop applications, and utilizing Hadoop compatible or embedded HDFS filesystems. This portion will explore the options available, and the pros and cons of each. 2) Data Ingest. There are many ways to load data into a Data Lake, including standardized Apache tools (Sqoop, Flume, Kafka, Storm, Spark, NiFi), standard file and object protocols (SFTP, NFS, Rest, WebHDFS), and proprietary tools (eg, Zaloni Bedrock, DataTorrent). This section will explore these options in the context of best fit to workflows; it will also look at key gaps and challenges, particularly in the areas of data formats and integration with metadata/cataloging tools. 3) Metadata & Cataloguing. One of the biggest inhibitors of successful Data Lake deployments is Data Governance, particularly in the areas of indexing, cataloguing and metadata management. It is nearly impossible to run analytics on top of a Data Lake and get meaningful & timely results without solving these problems. This portion will explore both emerging open standards (Apache Atlas, HCatalog) and proprietary tools (Cloudera Navigator, Zaloni Bedrock/Mica, Informatica Metadata Manager), and balance the pros, cons and gaps of each. 4) Security & Access Controls. Solving these challenges are key for adoption in regulatory driven industries like Healthcare & Financial Services. There are multiple Apache projects and proprietary tools to address this, but the challenge is making security and access controls consistent across the entire application and infrastructure stack, and over the data lifecycle, and being able to audit this in the face of legal challenges. This portion will explore available options and best practices. 5) Provisioning & Workflow Management. The real promise of the Data Lake is integrating Analytics workflows and tools on converged infrastructure-with shared data-and build “As A Service” oriented architectures that are oriented towards self-service data exploration and Analytics for end users. This is an emerging and immature area, but this session will explore some potential concepts, tools and options to achieve this. This will be a moderately technical session, with the above topics being illustrated by real world examples. Attendees should have basic familiarity with Hadoop and the associated Apache projects.

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016