SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
BigScience
A one-year research workshop on large multilingual
datasets and large language models
— original slides by Suzana Ilić from HuggingFace @suzatweet —
BigScience
A one-year research workshop on large multilingual
datasets and large language models
— original slides by Suzana Ilić from HuggingFace @suzatweet —
Gérard DUPONT
Research scientist/engineer
working on NLP, IR, ML, RL and
large scale data processing
@ggdupont
Many recent developments in NLP stem from
training larger language models on larger
datasets with compute resources typically
only available in industry.
Brown (2020): Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165
https://hellofuture.orange.com/en/the-gpt-3-language-model-revolution-or-evolution/
https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-53
0b-the-worlds-largest-and-most-powerful-generative-language-model/
https://lair.lighton.ai/akronomicon/
Issues and questions
● Research:
○ Models not designed as general research tools (lack access to training data, private
models, research questions asked after the model is trained, anglo-centric models)
○ Difficult involvement of academic researchers
○ Lack of fields diversity of the research teams building them (limited size of the teams)
● Environmental:
○ Training parallel models in private setting => duplication of energy requirements
○ Carbon footprint not documented/taken into account
● Ethical and societal:
○ Shortcomings in the text corpora used to train these models, ranging from
non-representativeness of populations to a predominance of potentially harmful
stereotypes or the inclusion of personally-identifying information
○ Ethical/bias/usage question are usually asked a-posteriori
The BigScience approach
The Large Hadron Collider is a particle physics research tools which
- has involved 10.000 researchers
- from 100 countries
- lead to the discovery of 59 hadrons
- publication of more than 2.800 papers (😱)
In many scientific fields (epidemiology, space, fusion…), large-scale and worldwide research
collaborations create tools useful for the entire research community, like the LHC, ITER,
ISS…
Isn’t it time to build similar large, diverse, open research collaborations in AI/NLP as well?
Large scale public compute infrastructure exists
Jean Zay supercomputer at IDRIS (South of Paris, France)
● Cumulated peak performance of 28 Pflop/s with a total of 2696 Nvidia V100 GPUs
● Omni-PAth interconnection network 100 Gb/s : 4 links per converged node
● Parallel storage device with a capacity of 2.2 PB SSD disks (GridScaler GS18K SSD)
Short history
- 🐣 Early 2021: Discussions between Thomas Wolf (HuggingFace), Stéphane
Requena (GENCI) and Pierre-François Lavallée (IDRIS)
- 󰔡 Very quickly: HF + the French academic and industrial AI and NLP
research communities joined the discussion
- 📝 February 2021: Grant application for 5 million GPU hours
- 🌐 Following the grant submission
- open/extend to international research community
- organization of the project with the structure of a research workshop
- 🚀 19/04 & 28/04: Grant accepted - Kickoff event - officially started
Concept
- Gather a large research community:
- consider in advance the research questions that would be interesting to answer
- ask as questions as much as possible ‘a-priori’ rather than ‘a-posteriori’
- reflect on and prepare the tools needed to answer these questions
- Create and share research artifacts with the scientific community:
- a very large multilingual corpus constituted in a way that is responsible, diverse, and mindful of
ethical and legal issues
- a very large multilingual language model exhibiting non-trivial zero shot behaviors in a way
that make it accessible to researchers
- code tools associated with these artifacts for simple use
- Find and share processes, documents and infrastructures favoring the
replication of such scientific collaborative efforts in the future
Where are we now?
The largest AI research collaboration to date. More than 800 researchers from 60
different countries and more than 250 institutions have joined BigScience.
A mega-collaboration
Building and investigating the
model from all angles: bias,
social impact, ethics,
capabilities, limitations and
potential improvements,
specific domain performances,
carbon impact, general
AI/cognitive research
landscape
🌕🚀 Data Working Group
A Large Multilingual Dataset for a Large Multilingual Model
● Data Governance and Archival Strategies
● Defining a management and ownership structure for the dataset
● Scoping out legal concerns and societal impact of data choices
● Privacy
● Ethical and Legal Scholarship
● Data Sourcing and Representativeness
● Defining a set of languages and text sources, as well as
frameworks for representativeness / diversity
● Exploring different modes of data collection from web crawling to
participatory methods and collaboration with existing data orgs
● Data Tooling
● Developing tools to gather and process text from the identified
sources to be both easy to use at training time and respectful of
the data subjects’ rights
🛠 Data Tooling
Icons made by Smashicons, Kiranshastry, Pixel perfect, Freepik from Flaticon
Books
Gutemberg
Web Crawled data
Oscar
Document Classifier
🔖 Index, Interconnect &
Persist
📥 Ingest ⚙ Augment & Transform
📤 Filter & Export
https://github.com/bigscience-workshop/data-tooling
We will train the final model on many
distinct data sources. For the
proof-of-concept we include spoken
text, books and web crawled data.
We build a specific connector between
Hugging Face dataset and
Elasticsearch, SQL, & Memmap
backend to simplify indexing and
usage.
We need the ability to dynamically run
classifiers on the corpus and add
features/columns. These information
may be used when exporting a subset
for training. Permits deduping,
detecting lang, masking PII, and
detecting metadata. We export a dataset subsample in
the corresponding jsonl format, with
one file per document. This output is
used for final training on Jean Zay
super-computer.
Visualize & Explore
Allow the exploration of a dataset and
the augmented features to better
understand the samples and biases
in the data.
Dashboards
�� ��
Govern
OAuth & logs
��
Allow us to fulfill ethical and legal
duties.
Spoken Text
OpenSubtitles, Europarl
INA collection
🌕🚀 Working Group on “Engineering/scaling”
A working group discussing
● the technical challenges of training at scale on several hundred GPUs, and
● how to make the best use of the (very large) compute budget we have
The compute budget is given in hours of GPU usage (5 millions GPU hours).
Depending on the scaling efficiency (how much idle time for each GPUs) the overall (1) duration of the
training (2) actual FLOPS can vary in very significant proportions.
This Working Group will collaborate with the modeling team on one hand and with the scaling teams
from NVIDIA/Microsoft/Facebook to ensure that the model is implemented in the most efficient way.
Note that participating to this working group does not imply that you will have direct access to the supercomputer since
there are additional (quite strong) national restrictions on the access to this machine (see some details in the section
on access to compute here). It does mean however that you will participate in the discussion on these aspects.
🌕🚀 Working Group on “Engineering/scaling”
🌕🚀 First Modeling paper
Multitask Prompted Training Enables Zero-Shot Task
Generalization by Sanh et at. (2021)
T0 shows zero-shot task generalization on English natural
language prompts, outperforming GPT-3 on many tasks, while
being 16x smaller!
To create T0, we fine-tuned T5 on a multi-task mixture of
prompted datasets from Promptsource. When evaluated on
zero-shot tasks, we found that it matched or exceeded GPT-3's
performance on 9 of 11 datasets.
Model: https://huggingface.co/bigscience/T0pp
Repo: https://github.com/bigscience-workshop/promptsource
Paper: https://arxiv.org/abs/2110.08207
What’s coming up next?
● Finished training the first test model: a 13B English
decoder-only model trained to investigate instabilities at
large scale, currently training a second model and planning
the first large-scale multilingual model
● Several papers submitted
● Several hackathons (ongoing and upcoming)
● Working towards the main model training
To learn more about the effort and join or follow:
● Website: bigscience.huggingface.co
● Twitter: @BigScienceW
● YouTube: BigScienceResearchWorkshop
Thank you!

Contenu connexe

Tendances

Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptxbodaceacat
 
Introduction NL-HUG (April)
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)Evert Lammerts
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityNodejsFoundation
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Rodrigo Urubatan
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceEvert Lammerts
 
Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingTobias Kuhn
 
Your data isn't that big @ Big Things Meetup 2016-05-16
Your data isn't that big @ Big Things Meetup 2016-05-16Your data isn't that big @ Big Things Meetup 2016-05-16
Your data isn't that big @ Big Things Meetup 2016-05-16Boaz Menuhin
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlowMatthias Feys
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
Tech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemTech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemRishabh Dugar
 
Improving long-term preservation of EOS data by independently mapping HDF4 da...
Improving long-term preservation of EOS data by independently mapping HDF4 da...Improving long-term preservation of EOS data by independently mapping HDF4 da...
Improving long-term preservation of EOS data by independently mapping HDF4 da...The HDF-EOS Tools and Information Center
 
Distributed Cache, bridging C++ to new technologies (Hazelcast)
Distributed Cache, bridging C++ to new technologies (Hazelcast)Distributed Cache, bridging C++ to new technologies (Hazelcast)
Distributed Cache, bridging C++ to new technologies (Hazelcast)Ovidiu Farauanu
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...Big Data Spain
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoopvishnu rao
 
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"Paco Nathan
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Heritage data beyond the GLAM
Heritage data beyond the GLAMHeritage data beyond the GLAM
Heritage data beyond the GLAMdatable_be
 

Tendances (20)

Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Introduction NL-HUG (April)
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
 
Your data isn't that big @ Big Things Meetup 2016-05-16
Your data isn't that big @ Big Things Meetup 2016-05-16Your data isn't that big @ Big Things Meetup 2016-05-16
Your data isn't that big @ Big Things Meetup 2016-05-16
 
Sociopath presentation
Sociopath presentationSociopath presentation
Sociopath presentation
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Tech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemTech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed System
 
Improving long-term preservation of EOS data by independently mapping HDF4 da...
Improving long-term preservation of EOS data by independently mapping HDF4 da...Improving long-term preservation of EOS data by independently mapping HDF4 da...
Improving long-term preservation of EOS data by independently mapping HDF4 da...
 
Distributed Cache, bridging C++ to new technologies (Hazelcast)
Distributed Cache, bridging C++ to new technologies (Hazelcast)Distributed Cache, bridging C++ to new technologies (Hazelcast)
Distributed Cache, bridging C++ to new technologies (Hazelcast)
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoop
 
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Heritage data beyond the GLAM
Heritage data beyond the GLAMHeritage data beyond the GLAM
Heritage data beyond the GLAM
 

Similaire à Tds — big science dec 2021

Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Dataconomy Media
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfGeethaPratyusha
 
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Kaitlin Thaney
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
jlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARjlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARJonathan Lettvin
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaGiorgia Lodi
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningmustafa sarac
 
Frankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectFrankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectGoethe Univeristy
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source frameworkedunextgen
 
Career opportunities in open source framework
Career opportunities in open source framework Career opportunities in open source framework
Career opportunities in open source framework edunextgen
 

Similaire à Tds — big science dec 2021 (20)

Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
 
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
 
On Big Data
On Big DataOn Big Data
On Big Data
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
jlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STARjlettvin.resume.20160922.STAR
jlettvin.resume.20160922.STAR
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenza
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learning
 
Frankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectFrankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee Projeect
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source framework
 
Career opportunities in open source framework
Career opportunities in open source framework Career opportunities in open source framework
Career opportunities in open source framework
 

Dernier

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Tds — big science dec 2021

  • 1. BigScience A one-year research workshop on large multilingual datasets and large language models — original slides by Suzana Ilić from HuggingFace @suzatweet —
  • 2. BigScience A one-year research workshop on large multilingual datasets and large language models — original slides by Suzana Ilić from HuggingFace @suzatweet — Gérard DUPONT Research scientist/engineer working on NLP, IR, ML, RL and large scale data processing @ggdupont
  • 3. Many recent developments in NLP stem from training larger language models on larger datasets with compute resources typically only available in industry. Brown (2020): Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165 https://hellofuture.orange.com/en/the-gpt-3-language-model-revolution-or-evolution/
  • 5. Issues and questions ● Research: ○ Models not designed as general research tools (lack access to training data, private models, research questions asked after the model is trained, anglo-centric models) ○ Difficult involvement of academic researchers ○ Lack of fields diversity of the research teams building them (limited size of the teams) ● Environmental: ○ Training parallel models in private setting => duplication of energy requirements ○ Carbon footprint not documented/taken into account ● Ethical and societal: ○ Shortcomings in the text corpora used to train these models, ranging from non-representativeness of populations to a predominance of potentially harmful stereotypes or the inclusion of personally-identifying information ○ Ethical/bias/usage question are usually asked a-posteriori
  • 6. The BigScience approach The Large Hadron Collider is a particle physics research tools which - has involved 10.000 researchers - from 100 countries - lead to the discovery of 59 hadrons - publication of more than 2.800 papers (😱) In many scientific fields (epidemiology, space, fusion…), large-scale and worldwide research collaborations create tools useful for the entire research community, like the LHC, ITER, ISS… Isn’t it time to build similar large, diverse, open research collaborations in AI/NLP as well?
  • 7. Large scale public compute infrastructure exists Jean Zay supercomputer at IDRIS (South of Paris, France) ● Cumulated peak performance of 28 Pflop/s with a total of 2696 Nvidia V100 GPUs ● Omni-PAth interconnection network 100 Gb/s : 4 links per converged node ● Parallel storage device with a capacity of 2.2 PB SSD disks (GridScaler GS18K SSD)
  • 8. Short history - 🐣 Early 2021: Discussions between Thomas Wolf (HuggingFace), Stéphane Requena (GENCI) and Pierre-François Lavallée (IDRIS) - 󰔡 Very quickly: HF + the French academic and industrial AI and NLP research communities joined the discussion - 📝 February 2021: Grant application for 5 million GPU hours - 🌐 Following the grant submission - open/extend to international research community - organization of the project with the structure of a research workshop - 🚀 19/04 & 28/04: Grant accepted - Kickoff event - officially started
  • 9. Concept - Gather a large research community: - consider in advance the research questions that would be interesting to answer - ask as questions as much as possible ‘a-priori’ rather than ‘a-posteriori’ - reflect on and prepare the tools needed to answer these questions - Create and share research artifacts with the scientific community: - a very large multilingual corpus constituted in a way that is responsible, diverse, and mindful of ethical and legal issues - a very large multilingual language model exhibiting non-trivial zero shot behaviors in a way that make it accessible to researchers - code tools associated with these artifacts for simple use - Find and share processes, documents and infrastructures favoring the replication of such scientific collaborative efforts in the future
  • 10. Where are we now? The largest AI research collaboration to date. More than 800 researchers from 60 different countries and more than 250 institutions have joined BigScience.
  • 11. A mega-collaboration Building and investigating the model from all angles: bias, social impact, ethics, capabilities, limitations and potential improvements, specific domain performances, carbon impact, general AI/cognitive research landscape
  • 12. 🌕🚀 Data Working Group A Large Multilingual Dataset for a Large Multilingual Model ● Data Governance and Archival Strategies ● Defining a management and ownership structure for the dataset ● Scoping out legal concerns and societal impact of data choices ● Privacy ● Ethical and Legal Scholarship ● Data Sourcing and Representativeness ● Defining a set of languages and text sources, as well as frameworks for representativeness / diversity ● Exploring different modes of data collection from web crawling to participatory methods and collaboration with existing data orgs ● Data Tooling ● Developing tools to gather and process text from the identified sources to be both easy to use at training time and respectful of the data subjects’ rights
  • 13. 🛠 Data Tooling Icons made by Smashicons, Kiranshastry, Pixel perfect, Freepik from Flaticon Books Gutemberg Web Crawled data Oscar Document Classifier 🔖 Index, Interconnect & Persist 📥 Ingest ⚙ Augment & Transform 📤 Filter & Export https://github.com/bigscience-workshop/data-tooling We will train the final model on many distinct data sources. For the proof-of-concept we include spoken text, books and web crawled data. We build a specific connector between Hugging Face dataset and Elasticsearch, SQL, & Memmap backend to simplify indexing and usage. We need the ability to dynamically run classifiers on the corpus and add features/columns. These information may be used when exporting a subset for training. Permits deduping, detecting lang, masking PII, and detecting metadata. We export a dataset subsample in the corresponding jsonl format, with one file per document. This output is used for final training on Jean Zay super-computer. Visualize & Explore Allow the exploration of a dataset and the augmented features to better understand the samples and biases in the data. Dashboards �� �� Govern OAuth & logs �� Allow us to fulfill ethical and legal duties. Spoken Text OpenSubtitles, Europarl INA collection
  • 14. 🌕🚀 Working Group on “Engineering/scaling” A working group discussing ● the technical challenges of training at scale on several hundred GPUs, and ● how to make the best use of the (very large) compute budget we have The compute budget is given in hours of GPU usage (5 millions GPU hours). Depending on the scaling efficiency (how much idle time for each GPUs) the overall (1) duration of the training (2) actual FLOPS can vary in very significant proportions. This Working Group will collaborate with the modeling team on one hand and with the scaling teams from NVIDIA/Microsoft/Facebook to ensure that the model is implemented in the most efficient way. Note that participating to this working group does not imply that you will have direct access to the supercomputer since there are additional (quite strong) national restrictions on the access to this machine (see some details in the section on access to compute here). It does mean however that you will participate in the discussion on these aspects.
  • 15. 🌕🚀 Working Group on “Engineering/scaling”
  • 16. 🌕🚀 First Modeling paper Multitask Prompted Training Enables Zero-Shot Task Generalization by Sanh et at. (2021) T0 shows zero-shot task generalization on English natural language prompts, outperforming GPT-3 on many tasks, while being 16x smaller! To create T0, we fine-tuned T5 on a multi-task mixture of prompted datasets from Promptsource. When evaluated on zero-shot tasks, we found that it matched or exceeded GPT-3's performance on 9 of 11 datasets. Model: https://huggingface.co/bigscience/T0pp Repo: https://github.com/bigscience-workshop/promptsource Paper: https://arxiv.org/abs/2110.08207
  • 17.
  • 18. What’s coming up next? ● Finished training the first test model: a 13B English decoder-only model trained to investigate instabilities at large scale, currently training a second model and planning the first large-scale multilingual model ● Several papers submitted ● Several hackathons (ongoing and upcoming) ● Working towards the main model training
  • 19. To learn more about the effort and join or follow: ● Website: bigscience.huggingface.co ● Twitter: @BigScienceW ● YouTube: BigScienceResearchWorkshop Thank you!