Soumettre la recherche
Mettre en ligne
Hadoop Summit, Cascading
•
Télécharger en tant que KEY, PDF
•
2 j'aime
•
768 vues
Paco Nathan
Suivre
Hadoop Summit 2009, Cascading dev talk
Lire moins
Lire la suite
Technologie
Signaler
Partager
Signaler
Partager
1 sur 14
Télécharger maintenant
Recommandé
Catállogo 7 de 2009
Catálogo Ciclo 7 de 2009 - Descubra o Segredo
Catálogo Ciclo 7 de 2009 - Descubra o Segredo
yeslondrina
天地禪院(Music)
天地禪院(Music)
天地禪院(Music)
Jaing Lai
Trabalho solicitado pelo professor Ivan Amaro para a apresentação de um slide sobre as crianças de seis anos no ensino fundamental.
As crianças de seis anos e as áreas do conhecimento - Patrícia Corsino (EEPP IV)
As crianças de seis anos e as áreas do conhecimento - Patrícia Corsino (EEPP IV)
Larissa Santos
É preciso prestar serviços de excelência. Ações de fidelização são fundamentais para o sucesso de qualquer restaurante.
[Creategies] Atenção! 4 atitudes podem estar levando os seus clientes para a ...
[Creategies] Atenção! 4 atitudes podem estar levando os seus clientes para a ...
creategies
葛雷葛萊畢克(Music)
葛雷葛萊畢克(Music)
Jaing Lai
Taiane e Eloisa
Primeiro Reinado - Taiane e Eloisa
Primeiro Reinado - Taiane e Eloisa
historiaduzentosedois
Ajustes Curriculares [Mineduc]
General Ajuste Jr 1009 Iquique
General Ajuste Jr 1009 Iquique
Waxu Ku
Las politicas de ajuste
Las politicas de ajuste
Aula Virtual
Recommandé
Catállogo 7 de 2009
Catálogo Ciclo 7 de 2009 - Descubra o Segredo
Catálogo Ciclo 7 de 2009 - Descubra o Segredo
yeslondrina
天地禪院(Music)
天地禪院(Music)
天地禪院(Music)
Jaing Lai
Trabalho solicitado pelo professor Ivan Amaro para a apresentação de um slide sobre as crianças de seis anos no ensino fundamental.
As crianças de seis anos e as áreas do conhecimento - Patrícia Corsino (EEPP IV)
As crianças de seis anos e as áreas do conhecimento - Patrícia Corsino (EEPP IV)
Larissa Santos
É preciso prestar serviços de excelência. Ações de fidelização são fundamentais para o sucesso de qualquer restaurante.
[Creategies] Atenção! 4 atitudes podem estar levando os seus clientes para a ...
[Creategies] Atenção! 4 atitudes podem estar levando os seus clientes para a ...
creategies
葛雷葛萊畢克(Music)
葛雷葛萊畢克(Music)
Jaing Lai
Taiane e Eloisa
Primeiro Reinado - Taiane e Eloisa
Primeiro Reinado - Taiane e Eloisa
historiaduzentosedois
Ajustes Curriculares [Mineduc]
General Ajuste Jr 1009 Iquique
General Ajuste Jr 1009 Iquique
Waxu Ku
Las politicas de ajuste
Las politicas de ajuste
Aula Virtual
Strata CA 2018-03-08 https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223 Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
Strata Singapore 2017 session talk 2017-12-06 https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611 Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models. This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL: * When is HITL indicated vs. when isn’t it applicable? * How do HITL approaches compare/contrast with more “typical” use of Big Data? * What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning? * Experiences training and managing a team which uses HITL at scale * Caveats to know ahead of time: * In what ways do the humans involved learn from the machines? * In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML Big Data Spain, 2017-11-16 https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models. This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL: * When is HITL indicated vs. when isn't it applicable? * How do HITL approaches compare/contrast with more "typical" use of Big Data? * What's the relationship between use of HITL and preparing an organization to leverage Deep Learning? * Experiences training and managing a team which uses HITL at scale * Caveats to know ahead of time * In what ways do the humans involved learn from the machines? In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
JupyterCon NY 2017-08-24 https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies. The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc. Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases. This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
Nike Tech Talk, Portland, 2017-08-10 https://niketechtalks-aug2017.splashthat.com/ O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner. This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon. Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do. In particular, we'll show two open source projects in Python from O'Reilly's AI team: • pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics • nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx. https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859 https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
Computable Content
Computable Content
Paco Nathan
Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.
Computable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan
See 2020 update: https://derwen.ai/s/h88s SF Python Meetup, 2017-02-08 https://www.meetup.com/sfpython/events/237153246/ PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
A keynote presentation for Big Data Spain 2015 in Madrid, 2015-10-15 http://www.bigdataspain.org/program/
Data Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan
Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose. http://meetup.com/SF-Bay-ACM/events/221693508/ Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets. O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages
Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
PyData Seattle 2015 sponsored talk about O'Reilly Learning
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
Paco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/ Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579 In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities. Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as: * What are the trending topic summaries? * Who are the leaders in the community for various topics? * Who discusses most frequently with whom? This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
Microservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472 Big Brains meetup hosted by BloomReach, 2015-06-04 Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
Graph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
Keynote presentation at Universidade da Coruña on 2015-05-27 for the Apache Spark tutorial
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26 http://qconsp.com/presentation/real-time-analytics-spark-streaming This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale. The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale. We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
Data Day Texas 2015 keynote talk http://datadaytexas.com/
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan
Contenu connexe
Plus de Paco Nathan
Strata CA 2018-03-08 https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223 Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Paco Nathan
Strata Singapore 2017 session talk 2017-12-06 https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611 Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models. This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL: * When is HITL indicated vs. when isn’t it applicable? * How do HITL approaches compare/contrast with more “typical” use of Big Data? * What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning? * Experiences training and managing a team which uses HITL at scale * Caveats to know ahead of time: * In what ways do the humans involved learn from the machines? * In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Paco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML Big Data Spain, 2017-11-16 https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models. This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL: * When is HITL indicated vs. when isn't it applicable? * How do HITL approaches compare/contrast with more "typical" use of Big Data? * What's the relationship between use of HITL and preparing an organization to leverage Deep Learning? * Experiences training and managing a team which uses HITL at scale * Caveats to know ahead of time * In what ways do the humans involved learn from the machines? In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
JupyterCon NY 2017-08-24 https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies. The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc. Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases. This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
Nike Tech Talk, Portland, 2017-08-10 https://niketechtalks-aug2017.splashthat.com/ O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner. This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon. Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do. In particular, we'll show two open source projects in Python from O'Reilly's AI team: • pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics • nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Paco Nathan
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx. https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859 https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
Computable Content
Computable Content
Paco Nathan
Strata UK 2017. Computable content leverages Jupyter notebooks to make learning materials more powerful by integrating compute engines, data sources, etc. O’Reilly Media extended this approach to create the new Oriole Online Tutorial medium, publishing notebooks from authors along with video timelines. (A free public tutorial, Regex Golf, by Peter Norvig demonstrates what’s possible with this technology integration.) Each user session launches a Docker container on a Mesos cluster for fully personalized compute environments. The UX is entirely browser based.
Computable Content: Lessons Learned
Computable Content: Lessons Learned
Paco Nathan
See 2020 update: https://derwen.ai/s/h88s SF Python Meetup, 2017-02-08 https://www.meetup.com/sfpython/events/237153246/ PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Paco Nathan
A keynote presentation for Big Data Spain 2015 in Madrid, 2015-10-15 http://www.bigdataspain.org/program/
Data Science in 2016: Moving Up
Data Science in 2016: Moving Up
Paco Nathan
Presented 2015-08-24 at SF Bay ACM, held at the eBay south campus in San Jose. http://meetup.com/SF-Bay-ACM/events/221693508/ Project Jupiter https://jupyter.org/ evolved from IPython notebooks, and now supports a wide variety of programming language back-ends. Notebooks have proven to be effective tools used in Data Science, providing convenient packages for what Don Knuth coined as "literate programming" in the 1980s: code plus exposition in markdown. Results of running the code appear in-line as interactive graphics -- all packaged as collaborative, web-based documents. Some have said that the introduction of cloud-based notebooks is nearly as large of a fundamental change in software practice as the introduction of spreadsheets. O'Reilly Media has been considering the question, "What comes after books and video?" Or, as one might imagine more pointedly, what comes after Kindle? To that point we have collaborated with Project Jupyter to integrate notebooks into our content management process, allowing authors to generate articles, tutorials, reports, and other media products as notebooks that also incorporate video segments. Code dependencies are containerized using Docker, and all of the content gets managed in Git repositories. We have added another layer, an open source project called Thebe that provides a kind of "media player" for embedding the containerized notebooks into web pages
Data Science Reinvents Learning?
Data Science Reinvents Learning?
Paco Nathan
PyData Seattle 2015 sponsored talk about O'Reilly Learning
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
Paco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/ Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Paco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579 In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities. Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as: * What are the trending topic summaries? * Who are the leaders in the community for various topics? * Who discusses most frequently with whom? This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
Microservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472 Big Brains meetup hosted by BloomReach, 2015-06-04 Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
Graph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
Keynote presentation at Universidade da Coruña on 2015-05-27 for the Apache Spark tutorial
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26 http://qconsp.com/presentation/real-time-analytics-spark-streaming This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale. The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale. We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
Data Day Texas 2015 keynote talk http://datadaytexas.com/
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Paco Nathan
Plus de Paco Nathan
(20)
Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
Computable Content
Computable Content
Computable Content: Lessons Learned
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
Télécharger maintenant