Jubatus is an open source software framework for distributed online machine learning on big data. It focuses on performing real-time deeper analysis through online machine learning algorithms that can be run in a distributed manner by locally updating models and periodically mixing them together. This allows for fast, scalable, and memory-efficient deep learning on large, streaming datasets without requiring data storage or sharing across nodes.
Demystifying Systems for Interactive and Real-time AnalyticsDataWorks Summit
A number of systems have been released recently for use in interactive and real-time analytics. Examples include Drill, Druid, Impala, Muppet, Shark/Spark, Storm, and Tez. It can be confusing for a practitioner to pick the best system for her specific needs. Statements like “this system is 10x better than Hive” can be misleading without understanding factors like: (i) the workload and environment where the improvement can be repeatably obtained, (ii) whether proper system tuning can change the result, and (iii) whether the results can be different under other workloads. Duke and two other research institutions are jointly conducting a large-scale experimental study with multiple systems and workloads in order to answer these questions of broad interest. The workloads used in the study represent new-generation analytics needs that cover a diverse spectrum including SQL-like queries, machine-learning analysis, graph and matrix processing, and queries running continuously over rapid data streams. The talk will use the results from this study to present the strengths and weaknesses of each system, and rigorously characterize the scenarios where each system is the right choice. Opportunities to improve the systems with new features or by cross pollination of features from multiple systems will also be presented.
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
The document discusses Hadoop and its uses for large-scale data processing and analysis. It provides examples of how Hadoop is used by Yahoo and in other enterprise settings for tasks like ETL processing, fraud detection, and cluster analysis. The document also introduces Greenplum HD, an enterprise-ready Hadoop platform that is faster and more reliable than Apache Hadoop.
Demystifying Systems for Interactive and Real-time AnalyticsDataWorks Summit
A number of systems have been released recently for use in interactive and real-time analytics. Examples include Drill, Druid, Impala, Muppet, Shark/Spark, Storm, and Tez. It can be confusing for a practitioner to pick the best system for her specific needs. Statements like “this system is 10x better than Hive” can be misleading without understanding factors like: (i) the workload and environment where the improvement can be repeatably obtained, (ii) whether proper system tuning can change the result, and (iii) whether the results can be different under other workloads. Duke and two other research institutions are jointly conducting a large-scale experimental study with multiple systems and workloads in order to answer these questions of broad interest. The workloads used in the study represent new-generation analytics needs that cover a diverse spectrum including SQL-like queries, machine-learning analysis, graph and matrix processing, and queries running continuously over rapid data streams. The talk will use the results from this study to present the strengths and weaknesses of each system, and rigorously characterize the scenarios where each system is the right choice. Opportunities to improve the systems with new features or by cross pollination of features from multiple systems will also be presented.
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
The document discusses Hadoop and its uses for large-scale data processing and analysis. It provides examples of how Hadoop is used by Yahoo and in other enterprise settings for tasks like ETL processing, fraud detection, and cluster analysis. The document also introduces Greenplum HD, an enterprise-ready Hadoop platform that is faster and more reliable than Apache Hadoop.
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan
This document provides an introduction to distributed computing engines for data processing. It discusses what distributed computing systems are and how they address the problem of data and tasks being too large for a single machine. It then covers key distributed computing systems like Hadoop, Spark and Flink. For each system, it summarizes what it is, when and where it originated, why it was created, and how it works at a high level. It also provides brief examples of common use cases for each system today.
This document discusses big data and Hadoop. It defines big data as very large data measured in petabytes. It explains that Hadoop is an open source framework used to store, process, and analyze huge amounts of unstructured data across clusters of computers. The key components of Hadoop are HDFS for storage, YARN for job scheduling, and MapReduce for parallel processing. Hadoop provides advantages like speed, scalability, low cost, and fault tolerance.
The document provides an introduction to Hadoop and distributed computing, describing Hadoop's core components like MapReduce, HDFS, HBase and Hive. It explains how Hadoop uses a map-reduce programming model to process large datasets in a distributed manner across commodity hardware, and how its distributed file system HDFS stores and manages large amounts of data reliably. Functional programming concepts like immutability and avoiding state changes are important to Hadoop's ability to process data in parallel across clusters.
This document introduces BioCloud, a tool for using cloud computing platforms like Hadoop to process large biological datasets in parallel. It discusses how biology applications are becoming more resource-intensive and how cloud platforms can provide scalable computing resources at a lower cost than local hardware. It provides an overview of Hadoop and MapReduce as an framework for processing vast amounts of data across clusters of machines. Examples of companies using Hadoop include Google, Yahoo, and Facebook for applications involving terabytes of data.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses problems posed by large and complex datasets that cannot be processed by traditional systems. Hadoop uses HDFS for storage and MapReduce for distributed processing of data in parallel. Hadoop clusters can scale to thousands of nodes and petabytes of data, providing low-cost and fault-tolerant solutions for big data problems faced by internet companies and other large organizations.
This document discusses Hadoop and big data. It begins with definitions of big data and how Hadoop can help with large, complex datasets. It then discusses how Hadoop works with other tools like Pig and Hive. The document outlines different scenarios for big data and whether Hadoop is suitable. It also discusses how big data frameworks have evolved from Google papers. Finally, it provides examples of big data use cases and how education is being democratized with big data tools.
Yahoo is the largest corporate contributor, tester, and user of Hadoop. They have 4000+ node clusters and contribute all their Hadoop development work back to Apache as open source. They use Hadoop for large-scale data processing and analytics across petabytes of data to power services like search and ads optimization. Some challenges of using Hadoop at Yahoo's scale include unpredictable user behavior, distributed systems issues, and the difficulties of collaboration in open source projects.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
This document provides an introduction and overview of Apache Hadoop. It discusses how Hadoop provides the ability to store and analyze large datasets in the petabyte range across clusters of commodity hardware. It compares Hadoop to other systems like relational databases and HPC and describes how Hadoop uses MapReduce to process data in parallel. The document outlines how companies are using Hadoop for applications like log analysis, machine learning, and powering new data-driven business features and products.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how virtualization brings the advantages of greater elasticity, stronger isolation for multi-tenancy, and a click HA protection to Hadoop, while maintaining the comparable performance to Hadoop on physical machines.
Objective 1: Understand the benefits of virtualizing Hadoop.
After this session you will be able to:
Objective 2: Understand how to get started with Pivotal HD Hadoop .
Objective 3: Understand where to find more information.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
1) The document provides an overview of a guest lecture on data-intensive processing with Hadoop MapReduce.
2) It discusses why "Big Data" is important in science, engineering, and commerce due to the increasing amounts of data being generated.
3) The lecture then explains how MapReduce and distributed file systems like HDFS enable parallel processing of large datasets across clusters of computers.
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan
This document provides an introduction to distributed computing engines for data processing. It discusses what distributed computing systems are and how they address the problem of data and tasks being too large for a single machine. It then covers key distributed computing systems like Hadoop, Spark and Flink. For each system, it summarizes what it is, when and where it originated, why it was created, and how it works at a high level. It also provides brief examples of common use cases for each system today.
This document discusses big data and Hadoop. It defines big data as very large data measured in petabytes. It explains that Hadoop is an open source framework used to store, process, and analyze huge amounts of unstructured data across clusters of computers. The key components of Hadoop are HDFS for storage, YARN for job scheduling, and MapReduce for parallel processing. Hadoop provides advantages like speed, scalability, low cost, and fault tolerance.
The document provides an introduction to Hadoop and distributed computing, describing Hadoop's core components like MapReduce, HDFS, HBase and Hive. It explains how Hadoop uses a map-reduce programming model to process large datasets in a distributed manner across commodity hardware, and how its distributed file system HDFS stores and manages large amounts of data reliably. Functional programming concepts like immutability and avoiding state changes are important to Hadoop's ability to process data in parallel across clusters.
This document introduces BioCloud, a tool for using cloud computing platforms like Hadoop to process large biological datasets in parallel. It discusses how biology applications are becoming more resource-intensive and how cloud platforms can provide scalable computing resources at a lower cost than local hardware. It provides an overview of Hadoop and MapReduce as an framework for processing vast amounts of data across clusters of machines. Examples of companies using Hadoop include Google, Yahoo, and Facebook for applications involving terabytes of data.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses problems posed by large and complex datasets that cannot be processed by traditional systems. Hadoop uses HDFS for storage and MapReduce for distributed processing of data in parallel. Hadoop clusters can scale to thousands of nodes and petabytes of data, providing low-cost and fault-tolerant solutions for big data problems faced by internet companies and other large organizations.
This document discusses Hadoop and big data. It begins with definitions of big data and how Hadoop can help with large, complex datasets. It then discusses how Hadoop works with other tools like Pig and Hive. The document outlines different scenarios for big data and whether Hadoop is suitable. It also discusses how big data frameworks have evolved from Google papers. Finally, it provides examples of big data use cases and how education is being democratized with big data tools.
Yahoo is the largest corporate contributor, tester, and user of Hadoop. They have 4000+ node clusters and contribute all their Hadoop development work back to Apache as open source. They use Hadoop for large-scale data processing and analytics across petabytes of data to power services like search and ads optimization. Some challenges of using Hadoop at Yahoo's scale include unpredictable user behavior, distributed systems issues, and the difficulties of collaboration in open source projects.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
This document provides an introduction and overview of Apache Hadoop. It discusses how Hadoop provides the ability to store and analyze large datasets in the petabyte range across clusters of commodity hardware. It compares Hadoop to other systems like relational databases and HPC and describes how Hadoop uses MapReduce to process data in parallel. The document outlines how companies are using Hadoop for applications like log analysis, machine learning, and powering new data-driven business features and products.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how virtualization brings the advantages of greater elasticity, stronger isolation for multi-tenancy, and a click HA protection to Hadoop, while maintaining the comparable performance to Hadoop on physical machines.
Objective 1: Understand the benefits of virtualizing Hadoop.
After this session you will be able to:
Objective 2: Understand how to get started with Pivotal HD Hadoop .
Objective 3: Understand where to find more information.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
1) The document provides an overview of a guest lecture on data-intensive processing with Hadoop MapReduce.
2) It discusses why "Big Data" is important in science, engineering, and commerce due to the increasing amounts of data being generated.
3) The lecture then explains how MapReduce and distributed file systems like HDFS enable parallel processing of large datasets across clusters of computers.
PFN福田圭祐による東大大学院「融合情報学特別講義Ⅲ」(2022年10月19日)の講義資料です。
・Introduction to Preferred Networks
・Our developments to date
・Our research & platform
・Simulation ✕ AI
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Jubatus Invited Talk at XLDB Asia
1. Distributed Online Machine Learning
Framework for Big Data
Shohei Hido
Preferred Infrastructure, Inc. Japan.
XLDB Asia, June 22nd, 2012
2. Preferred Infrastructure (PFI): to bring
cutting-edge research advances to products
l Founded: March, 2006, located in Tokyo, Japan
l Employees: 28
l Top university graduates including ICPC world finalists
l Mid-career engineers from Sony, IBM, Yahoo!, Sun
Information retrieval Distributed computing
Natural language
Machine learning
processing
2
4. Overview:
Big Data analytics will go real-time and deeper
1. Bigger data
2. More in real-time
3. Deep analysis
No storage
No data sharing
Only mix model
5. Jubatus: OSS platform for Big Data analytics
l Joint development with NTT laboratory in Japan
l Project started April 2011
l Released as an open source software
l Just released 0.3.0
l You can download it from
l http://github.com/jubatus/
l Waiting for your contribution and collaboration
5
6. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
6
7. Increasing demand in Big Data applications:
Real-time deeper analysis
l Current focus: aggregation and rule processing on bigger data
l CEP (Complex Event Processing) for real-time processing
l Hadoop/MapReduce for distributed computation
l Future: deeper analysis for rapid decisions and actions
l Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012]
l Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011]
Data size
What will
Hadoop come?
CEP
Deep
Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf
7
analysis
http://www.computerworlduk.com/news/networking/3302464/
8. Key technology: Machine learning
l Examples need rapid decisions under uncertainty
l Anomaly detection from M2M sensor data
l Energy demand forecast / Smart grid optimization
l Security monitoring on raw Internet traffic
l What is missing for fast & deep analytics on Big Data?
l Online/real-time machine learning platform
l + Scale-out distributed machine learning platform
1. Bigger data
2. More in real-time
3. Deep analysis
9. Online machine learning in Jubatus
l Batch learning
l Scan all data before building a model
l Data must be stored in memory or storage
Model
l Online learning
l Model will be updated by each data sample
l Sometimes with theory that the online model
converges to the batch model
Model
9
10. Jubatus focuses on latest online algorithms
l Advantage: fast and not memory-intensive
l Low latency & high throughput
l No need for storing large datasets
l Eg. Linear classification algorithms
l Perceptron (1958)
l Passive Aggressive (PA) (2003) Very recent
progress
l Confidence Weighted Learning (CW) (2008)
l AROW (2009)
l Normal HERD (NHERD) (2010)
10
11. Online learning or distributed learning:
No unified solution has been available
l Jubatus combines them into a unified computation framework
Real-time/
Online
Online ML alg.: Jubatus
PA [2003] 2011-
CW[2008]
Large scale
Small scale &
Stand-alone Distributed/
Parallel
WEKA Mahout computing
1993- 2006-
SPSS
1988-
Batch
11
12. What Jubatus currently supports
l Classification (multi-class)
l Perceptron / PA / CW / AROW
l Regression
l PA-based regression
l Nearest neighbor
l LSH / MinHash / Euclid LSH
l Recommendation
l Based on nearest neighbor
l Anomaly detection*
l LOF based on nearest neighbor
l Graph analysis*
l Shortest path / Centrality (PageRank)
l Some simple statistics
12
13. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
13
14. Hadoop and Mahout: Not good for online learning
l Hadoop
l Advantage
l Many extensions for a variety of applications
l Good for distributed data storing and aggregation
l Disadvantage
l No direct support for machine learning and online processing
l Mahout
l Advantage
l Popular machine learning algorithms are implemented
l Disadvantage
l Some implementation are less mature
l Still not capable of online machine learning
14
15. Jubatus vs. Hadoop, RDB-based, and Storm:
Advantage in online AND distributed ML
l Only Jubatus satisfies both of them at the same time
Jubatus Hadoop RDB Storm
Storing ✓ ✓✓ ✓
✓
Big Data External DB HDFS Ext. DB
Batch ✓ ✓✓
✓ ✕
learning Mahout SPSS, etc
Stream
✓ ✕ ✕ ✓✓
processing
Distributed ✓
✓✓ ✕ ✕
learning Mahout
High Online
importance
✓✓ ✕ ✕ ✕
learning
15
16. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
16
17. How to make online algorithms distributed?
=> No trivial!
Batch learning
Online learning
Learn Learn
Easy to
the update parallelize Model update
Learn
Model update Model update
Hard to Learn
Learn
parallelize Model update
the update
due to
Learn
frequent updates
Time
Model update Model update
l Online learning requires frequent model updates
l Naïve distributed architecture leads to too many
synchronization operations
l It causes performance problems in terms of network
communications and accuracy
17
18. Solution: Loose model sharing
l Jubatus only shares the local models in a loose manner
l Model size << Data size
l Jubatus DOES NOT share datasets
l Unique approach compared to existing framework
l Local models can be different on the servers
l Different models will be gradually merged
Model Model Model
Mixed Mixed Mixed
model model model
19. Three fundamental operations on Jubatus:
UPDATE, ANALYZE, and MIX
1. UPDATE
l Receive a sample, learn and update the local model
2. ANALYZE
l Receive a sample, apply the local model, return result
3. MIX (called automatically in backend)
l Exchange and merge the local models between servers
l C.f. Map-Shuffle-Reduce operations on Hadoop
l Algorithms can be implemented independently from
l Distribution logic
l Data sharing
l Failover
19
20. UPDATE
l Each server starts from an initial model
l Each data sample are sent to one (or two) servers
l Local models updated based on the sample
l Data samples are NEVER shared
Distributed
randomly
Local
or consistently
Initial
model
model
1
Local
model Initial
model
2
20
21. MIX
l Each server sends its model diff
l Model diffs are merged and distributed
l Only model diffs are transmitted
Local Model Model
Initial Merged Initial Mixed
model -
model =
diff diff
diff +
model =
model
1 1 1 Merged
+
=
diff
Local Model Model
Initial Merged Initial Mixed
model -
2
model =
diff diff
diff +
model =
model
2 2
21
22. UPDATE (iteration)
l Locally updated models after MIX are discarded
l Each server starts updating from the mixed model
l The mixed model improves gradually thanks to all of the servers
Distributed
randomly
Local
or consistently
Mixed
model
model
1
Local
model Mixed
model
2
22
23. ANALYZE
l For prediction, each sample randomly goes to a server
l Server applies the current mixed model to the sample
l The prediction will be returned to the client
Distributed
randomly
Mixed
model
Return prediction
Mixed
model
Return prediction
23
24. Why Jubatus can work in real-time?
l Focus on online machine learning
l Make online machine learning algorithms distributed
l Update locally
l Online training without communication with others
l Mix only models globally
l Small communication cost, low latency, good performance
l Advantage compared to costly Shuffle in MapReduce
l Analyze locally
l Each server has mixed model
l Low latency for making predictions
l Everything in-memory
l Process data on-the-fly
24
25. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
25
26. Demo: Twitter analysis using natural language
processing and machine learning
Jubatus classifies each tweet from Twitter data stream into pre-defined
categories. Only one Jubatus server is enough to classify over 5,000 QPS,
which is close to the raw Twitter data. We provide a browser-based GUI.
26
27. Experiment: Estimation of power consumption
Jubatus learns the power usage and network data flow pattern of
certain servers. The power consumption of individual servers can be
estimated in real-time by monitoring and analyzing packets without
having to install power measurement modules on all servers.
Predicted value (W)
Data Center /
Office Estimation
Power
No power meter meter
Actual value (W)
TAP
(Packet data)
Consumption differs for
different types of packets
28. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
28
29. Summary
l Jubatus is the first OSS platform for online
distributed machine learning on Big Data streams.
l Download it from http://github.com/jubatus/
l We welcome your contribution and collaboration
1. Bigger data
2. More in real-time
3. Deep analysis
No storage
No data sharing
Only mix model