Slides for the tutorial about Apache Giraph for the Data Mining class.
Sapienza, University of Rome.
Master of Science in Engineering in Computer Science
Prof. A. Anagnostopoulos, I. Chatzigiannakis, A. Gionis
Data Mining class
Fall 2016
Making Nested Columns as First Citizen in Apache Spark SQLDatabricks
Apple Siri is the world's largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. We use large amounts of data to provide our users the best possible personalized experience. Our raw event data is cleaned and pre-joined into an unified data for our data consumers to use. To keep the rich hierarchical structure of the data, our data schemas are very deep nested structures. In this talk, we will discuss how Spark handles nested structures in Spark 2.4, and we'll show the fundamental design issues in reading nested fields which is not being well considered when Spark SQL was designed. This results in Spark SQL reading unnecessary data in many operations. Given that Siri's data is super nested and humongous, this soon becomes a bottleneck in our pipelines. Then we will talk about the various approaches we have taken to tackle this problem. By making nested columns as first citizen in Spark SQL, we can achieve dramatic performance gain. In some of our production queries, the speed-up can be 20x in wall clock time and 8x less data being read. All of our work will be open source, and some has already been merged into upstream.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
Slides for the tutorial about Apache Giraph for the Data Mining class.
Sapienza, University of Rome.
Master of Science in Engineering in Computer Science
Prof. A. Anagnostopoulos, I. Chatzigiannakis, A. Gionis
Data Mining class
Fall 2016
Making Nested Columns as First Citizen in Apache Spark SQLDatabricks
Apple Siri is the world's largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. We use large amounts of data to provide our users the best possible personalized experience. Our raw event data is cleaned and pre-joined into an unified data for our data consumers to use. To keep the rich hierarchical structure of the data, our data schemas are very deep nested structures. In this talk, we will discuss how Spark handles nested structures in Spark 2.4, and we'll show the fundamental design issues in reading nested fields which is not being well considered when Spark SQL was designed. This results in Spark SQL reading unnecessary data in many operations. Given that Siri's data is super nested and humongous, this soon becomes a bottleneck in our pipelines. Then we will talk about the various approaches we have taken to tackle this problem. By making nested columns as first citizen in Spark SQL, we can achieve dramatic performance gain. In some of our production queries, the speed-up can be 20x in wall clock time and 8x less data being read. All of our work will be open source, and some has already been merged into upstream.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
WMS Benchmarking presentation and results, from the FOSS4G 2010 event in Barcelona. 8 different development teams participated in this exercise, to display common data through the WMS standard the fastest. http://2010.foss4g.org/wms_benchmarking.php
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
A quick comparison of Hadoop and Apache Spark with a detailed introduction.
Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. They do different things.
Looking for Similar IT Services?
Write to us business@altencalsoftlabs.com
(OR)
Visit Us @ https://www.altencalsoftlabs.com/
Dask - Parallelism for Machine Learning with PythonMatheus Pereira
Brief presentation of Dask, a Python library that provides advanced parallelism for analytics, enabling performance at scale for the tools you love (Pandas, Numpy and Scikit-Learn)
You can follow the presentation to discovery more about Delayed, Futures and Distributed Work using Dask.
This presentation is an adaptation and simplification of oficial dask-tutorial
https://github.com/dask/dask-tutorial
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside
Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the “next
generation” features. DocValues enable Lucene to efficiently store and retrieve type-safe Document
& Value pairs in a column stride fashion either entirely memory resident random access or disk
resident iterator based without the need to un-invert fields. Its final goal is to provide a
independently update-able per document storage for scoring, sorting or even filtering. This talk will
introduce the current state of development, implementation details, its features and how DocValues
have been integrated into Lucene’s Codec API for full extendability.
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
At Spark Summit 2017, we described our framework to migrate production Hive workload to Spark with minimal user intervention. After a year of migration, Spark now powers an important part of our batch processing workload. The migration framework supports syntax compatibility analysis, offline/online shadowing, and data validation.
In this session, we first introduce new features and improvements in the migration framework to support bucketed tables and increase automation. Next, we will deep dive into the top technical challenges we encountered and how we addressed them. We improved the the syntax compatibility between Hive and Spark from around 51% to 85% by identifying/developing top missing features, fixing incompatible UDFs, and implementing a UDF testing framework. In addition, we developed reliable join operators to improve Spark stability in production when leveraging optimizations such as ShuffledHashJoin.
Finally, we will share an update on our overall migration effort and examples of migrations wins. For example, we were able to migrate one of the most complicated workloads in Facebook from Hive to Spark with more than 2.5X performance gain.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
Apache Arrow is designed to make things faster. Its focused on speeding communication between systems as well as processing within any one system. In this talk I'll start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. I’ll then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data. Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs(e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy.While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro-pose a generic stream sampling framework for big-graph analytics,called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytic swill run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control.
WMS Benchmarking presentation and results, from the FOSS4G 2010 event in Barcelona. 8 different development teams participated in this exercise, to display common data through the WMS standard the fastest. http://2010.foss4g.org/wms_benchmarking.php
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
A quick comparison of Hadoop and Apache Spark with a detailed introduction.
Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. They do different things.
Looking for Similar IT Services?
Write to us business@altencalsoftlabs.com
(OR)
Visit Us @ https://www.altencalsoftlabs.com/
Dask - Parallelism for Machine Learning with PythonMatheus Pereira
Brief presentation of Dask, a Python library that provides advanced parallelism for analytics, enabling performance at scale for the tools you love (Pandas, Numpy and Scikit-Learn)
You can follow the presentation to discovery more about Delayed, Futures and Distributed Work using Dask.
This presentation is an adaptation and simplification of oficial dask-tutorial
https://github.com/dask/dask-tutorial
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011
Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside
Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the “next
generation” features. DocValues enable Lucene to efficiently store and retrieve type-safe Document
& Value pairs in a column stride fashion either entirely memory resident random access or disk
resident iterator based without the need to un-invert fields. Its final goal is to provide a
independently update-able per document storage for scoring, sorting or even filtering. This talk will
introduce the current state of development, implementation details, its features and how DocValues
have been integrated into Lucene’s Codec API for full extendability.
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
At Spark Summit 2017, we described our framework to migrate production Hive workload to Spark with minimal user intervention. After a year of migration, Spark now powers an important part of our batch processing workload. The migration framework supports syntax compatibility analysis, offline/online shadowing, and data validation.
In this session, we first introduce new features and improvements in the migration framework to support bucketed tables and increase automation. Next, we will deep dive into the top technical challenges we encountered and how we addressed them. We improved the the syntax compatibility between Hive and Spark from around 51% to 85% by identifying/developing top missing features, fixing incompatible UDFs, and implementing a UDF testing framework. In addition, we developed reliable join operators to improve Spark stability in production when leveraging optimizations such as ShuffledHashJoin.
Finally, we will share an update on our overall migration effort and examples of migrations wins. For example, we were able to migrate one of the most complicated workloads in Facebook from Hive to Spark with more than 2.5X performance gain.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
Apache Arrow is designed to make things faster. Its focused on speeding communication between systems as well as processing within any one system. In this talk I'll start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. I’ll then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data. Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs(e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy.While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro-pose a generic stream sampling framework for big-graph analytics,called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytic swill run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control.
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
Apache Giraph performs offline, batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Many recent advances have left Giraph more robust, efficient, fast, and able to accept a variety of I/O formats typical for graph data in and out of the Hadoop ecosystem. Giraph's recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come see whats on the roadmap for Giraph, what Giraph on YARN means, and how Giraph is leveraging the power of YARN to become a more robust, usable, and useful platform for processing Big Graph datasets.
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
The genesis of Hadoop was in analyzing massive amounts of data with a mapreduce framework. SQL-onHadoop has followed shortly after that, paving a way to the whole schema-on-read notion. Discovering graph relationship in your data is the next logical step. Apache Giraph (modeled on Google’s Pregel) lets you apply the power of BSP approach to the unstructured data. In this talk we will focus on practical advice of how to get up and running with Apache Giraph, start analyzing simple data sets with built-in algorithms and finally how to implement your own graph processing applications using the APIs provided by the project. We will then dive into how Giraph integrates with the Hadoop ecosystem (Hive, HBase, Accumulo, etc.) and will also provide a whirlwind tour of Giraph architecture.
PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
These are the slide for the talk given at Codemotion Milan on november 2015. The source code shown is available at https://github.com/andreaiacono/TalkGraphX .
A VERY high level over view of Graph Analytics concepts and techniques, including structural analytics, Connectivity Analytics, Community Analytics, Path Analytics, as well as Pattern Matching
Quick introduction to community detection.
Structural properties of real world networks, definition of "communities", fundamental techniques and evaluation measures.
Applying large scale text analytics with graph databasesData Ninja API
Data Ninja Services collaborated with Oracle to reach a major milestone in the integration of text analytics with Oracle Spatial and Graph. The Data Ninja Services client in Java can be used to analyze free texts, extract entities, generate RDF semantic graphs, and choose from a number of graph analytics to infer entity relationships. We demonstrated two case studies involving mining health news and detecting anomalies in product reviews.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
Annexe du cours Big Data
- Annexe A : Etapes d’un projet Big Data
- Annexe B : Schéma général de l’Algorithme MapReduce
- Annexe C : OpenStack & Hadoop
- Annexe D : Mahout
Sahara : Hadoop as Service avec OpenStackALTIC Altic
Un des initiative les plus intéressante du moment c'est Hadoop à la demande. Sahara, en incubation parmi les projets OpenStack facilite la mise en place de votre infrastructure moderne pour rester agile même dans une infrastructure qui réclame un nombre de machine toujours plus important...
présentation Solutions Linux 2014
Parallélisation d'algorithmes de graphes avec MapReduce sur un cluster d'ordi...Hadjer BENHADJ DJILALI
Presentation of our master project "Parallélisation d'algorithmes de graphes avec MapReduce sur un cluster d'ordinateurs".
Resolving FlowShop problem with an exact method by programing a parallel application the find the right solution using framework hadoop and java to perform a parallel evaluation of branch and bound algorithm
Petit-déjeuner OCTO : Hadoop, plateforme multi-tenant, à tout d'une grande !OCTO Technology
Hadoop, initialement conçu pour traiter les (très) gros batchs, a été victime de son succès : il s'affirme de plus en plus comme la plateforme à tout faire du Big Data. On lui demande désormais de supporter plusieurs utilisateurs, les traitements interactifs, la ségrégation ou le partage de données entre entité, et, évidemment... la sécurité qui va avec ces nouveaux usages !
D'une solution de geeks, Hadoop est devenu une plateforme business stratégique pour les entreprises.
Pour aller chatouiller des Oracle ou des Teradata sur leur terrain, Hadoop a dû muscler son jeu avec de nouvelles fonctionnalités.
Ce petit déjeuner est l'occasion de faire un point sur les dernières évolutions d'Hadoop, l'état de l'art de sa mise en oeuvre chez nos clients, et sur les éléments clés de la roadmap des principales distributions.
DrupalCamp Nantes 2016 - Migrer un site Drupal 6 ou Drupal 7 vers Drupal 8Aurelien Navarre
Retour d'expérience de la migration de la base de connaissance docs.acquia.com de Drupal 6 à Drupal 8 au DrupalCamp Nantes 2016. Les thèmes principaux abordés sont : comment auditer et préparer sa migration, comment utiliser les nouvelles commandes Drush à notre disposition pour facilement mettre en place les conditions d'une migration réussie et quelques astuces glanées par l'expérience acquise au cours de cette migration somme toute assez complexe.
Rapide introduction à Hadoop lors du lancement du Casablanca Hadoop & Big Data Meetup.
En partenariat avec Hortonworks
http://www.meetup.com/Casablanca-Hadoop-et-Big-Data-Meetup
l'étendu big data et de l'intelligence artificielle.
l'écosystème qui englobe les différentes fonctionnalités.
les outils utilisés.
( Avec l'explosion des données et les avancées de l'apprentissage automatique, l'automatisation de la prise de décision est devenue une réalité prometteuse pour améliorer les différents secteurs d'activités.)
Techday Arrow Group: Hadoop & le Big DataArrow Group
retrouvez notre techday sur Hadoop & le Big Data.
La Technologie Hadoop au coeur des
projets "Big Data".
Pour en savoir plus sur notre projet Square Predict:
http://www.square-solutions.com/accueil/square-predict-big-data-assurance/
Novascope Télécoms et Réseaux Informatiques en BtoB 2023Enov
Depuis 1996, nous mesurons la digitalisation des entreprises françaises grâce à notre observatoire Novascope Télécoms et Réseaux informatiques en B2B. Découvrez quelques résultats exclusifs de la vague 2023.
2. Introduction
Apache Giraph est un projet qui performe des calculs sur les graphes,
Apache Giraph utilise le service MapReduce de Hadoop.
Apache Giraph a été inspiré par le Framework Pregel de Google
Giraph est utilisé par Facebook et Paypal
3. Pourquoi on n’en a besoin ?
Giraph fait éviter des dépenses onéreuses en mémoire et en réseau pour
des opérations mandataires en MapReude,
Chaque cycle d’une itération de calcule sur Hadoop signifie l’execution
d’un Job MapReduce
4. Le Web
Le World Wide Web peut être structuré comme un énorme graphe:
Les pages sont des sommets connectées par des arcs qui représentent les
hyperliens .
Ce Web Graphe a des billions de sommets et d’arcs.
Le succès des plus grandes sociétés comme Google se résume à son habilité à
opérer de grands calculs sur ces graphes
5. Google PageRank
Le succès de moteur de recherche Google est par son habilité à gérer le
classement des résultats.
Le classement basé sur l’algorithme PageRank qui est un algorithme
orienté graphe très important:
Les pages importantes ont beaucoup de
Lien provenant d’autres pages
6. Social Networks
Sur Facebook, Twitter, Linkedin, les utilisateurs et leurs interactions
forment une Social Graph:
Les utilisateurs sont les sommets connectés par des arcs qui représentent
une interaction comme Friendship ou Following
7. Google Pregel
Un système distribué développé pour le traitement des graphes à grands
echelles « Think like a vertex »
Il prend le Bulk Synchronous parallel comme execution
Tres tolerant aux pannes by checkpointing
8. Pourrais je utiliser Pregel ?
Pregel est la propriété de Google, comme alternative il existe Apache
Giraph qui est une implémentation de Pregel
Elle tourne sous une infrastructure Hadoop
Les calculs s’executent en mémoire
9. Bulk synchronous parellel
SuperSteps
Communication:
Les processus d’échanges de données entre eux afin de faciliter le controle
des données stockées
Barrier synchronisation
Quand un processus atteind le point ( Barrière), il doit attendre que les autres
processus atteignent la meme barrière.
Les calcules finissent quand tous les composants finissent.
10. Pourquoi ne pas implémenter Giraph
avec des MapReduce Jobs
Trop de disques, pas assez de mémoire, a superstep become a Job
11. Qui fait quoi ?
Master: Responsable de la coordination
Assigne les partitions aux Workers
Coordonne la synchronisation
Worker : Responsable des sommets
Invoque les sommets actives
Envoie, reçoit et assigne les messages
12. Super Steps
Chaque sommet a un id, une valeur et une liste de voisins adjacents et leurs arcs
correspondants.
Chaque sommet est invoqué dans chaque super Step et peut recalculer sa valeur et renvoyer
des messages aux autres nœuds
Des combinaisons, agrégations et les mutations de toplogies
13. Architecture maitre-esclave
Les sommets sont partitionnés et assignés à des Workers
Les maitres assignent et coordonnent alors que les Workers executent les
sommets et communiquent entre eux.
15. Installé Giraph sur Ubuntu
Taper les commandes suivantes:
sudo apt-get install git
sudo apt-get install maven
cd /usr/local
sudo git close https://github.com/apache/giraph.git
sudo chown –R user : hadoop giraph
Il faut modifier le fichier bashrc
gedit $HOME/.bashrc
Il faut ajouter la ligne suivante
export GIRAPH_HOME=/usr/local/giraph
16. Installé Giraph sur Ubuntu
source $HOME/.bashrc
cd $GIRAPH_HOME
mvn package –DskipTests
DskipTests permet de sauter la phase de teste, cela va prendre du temps
17.
18. Installé Giraph sur Cloudera
Il faut modifier le bashrc.
Export GIRAPH_HOME=/usr/local/giraph
Export GV=1.0.0
Export PRO= Hadoop_2.0.0
Il faut charger les variables
Source ~/.bashrc
21. Prépation de la donnée
La premiere etape est de mettre notre graphe sur le hdfs sous le nom de
Tiny_graph.txt
[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]
22. On va créer le fichier input et output sur le HDFS
Hdfs dfs –mkdir ginput
Hdfs dfs –mkdir goutput
Ensuite on va copier notre tiny_graph dans le HDFS
Hdfs dfs –put tiny_graph.txt ginput
Ensuite on va télécharger un exemple ( ShortestPathsInputGraph)
sudo yum install wget
wget http://ece.northwestern.edu/~aching/shortestPathsInputGraph.tar.gz
tar zxvf shortestPathsInputGraph.tar.gz
hdfs dfs -put shortestPathsInputGraph ginput
24. org.apache.giraph.GiraphRunner
Classe utilisé pour démarrer les exemples
org.apache.giraph.examples.SimpleShortestPathsComputation
Dans ce cas il va calculer le chemin le plus court
-ca mapred.job.tracker=headnodehost:9010
Nœud principale du Cluster
-vif org.apache.giraph.io;formats.JsonLongDoubleFloatDoubleVertexInputFormat
Fomat d’entrée à utiliser pour les données en entrée
-vip /tiny_Graph,Text
Fichier de donnée en entrée
-vof org.apache.io.formats.IdWithValueTextOutPutFomart
Format de sortie
-op /example/outpout/.shortestspaths
Emplacement de sortie
-w 2
Le nombre de Worker
25. Une fois la tache terminée il faut visualiser ces fichiers
hdfs dfs -text /example/output/shortestpaths/
On aura le résultat suivant
0 1.0
4 5.0
2 2 .0
1 0.0
3 1.0
26. Vous pourrez visualisé l’éxecution de tous ces services plus en détails sur
notre prochain exposé qui portera sur l’utilisation de HD Insights sur Azure.