SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Mirko Bernardoni & Michael Seddon
Clifford Chance
Revolutionizing the Legal
Industry with Spark, NLP
and Azure Databricks
#UnifiedDataAnalytics #SparkAISummit
3
Mirko’s main role is to build and lead the data science lab.
He strongly believes that the work that Clifford Chance is doing in the
research department offers a unique opportunity to shape and change the
legal sector which is why he also work closely with universities for research.
About Mirko Bernardoni
MIRKO BERNARDONI
Head of Data Science
Clifford Chance
https://www.linkedin.com/i/mirko-bernardoni
4
Michael’s focus is on solving Clifford Chance’s data engineering challenges.
As a core member of the team, he has been involved in the design and development of the
data pipelines crucial to the lab’s success, as well as delivering numerous machine learning
projects helping to reshape Clifford Chance’s operational insights and execution.
About Michael Seddon
MICHAEL SEDDON
Senior Machine Learning Engineer
Clifford Chance
https://www.linkedin.com/in/mrmichaelseddon/
We are one of the world’s
pre-eminent law firms with
significant depth and range of
resources across five continents
Clifford Chance
As a global law firm we are able to support
clients at both a local and international level
across Europe, Asia Pacific, the Americas,
the Middle East and Africa.
Our global, cross-discipline teams advise on a
full range of legal solutions. We have a global
view, and through our sector approach, a
detailed understanding of our clients’ business,
its drivers and competitive landscapes.
Our Responsible Business strategy is integral
to our firm strategy. It guides how we conduct
our core business, how we develop and
support our people, and how we foster closer
collaboration with our clients.
• Doing business – we promote market-
shaping practices in relation to ethics,
professional standards and
risk management
• People – we realise the potential of our
people by creating a safe, healthy and
inclusive workplace, and broadening our
skills and experience
• Community – we partner to support our
community by widening access to
justice, finance and education
• Environment – we manage our
environmental footprint and contribute to
developing a more sustainable world.
At Clifford Chance, we are committed to
delivering a world-class service – providing
the highest quality advice and support
efficiently and effectively, every time.
Our clients, who include corporate
companies across all commercial and
industrial sectors, governments, regulators,
trade bodies and not-for-profit organisations
are at the heart of how we work.
Understanding what our clients value and
aligning with their needs underpins our
approach. We invest heavily to ensure that
clients benefit from our formidable knowledge
and market insights, that they have access to
the best team for the job, and that we bring
the right processes and advanced
technologies to bear on each matter.
Trusted legal advice
for the world’s
Leading businesses of
today and tomorrow
As a single, fully integrated, global partnership, we pride ourselves on our
approachable, collegiate and team-based way of working.
We are a single profit pool, lockstep
partnership. Our ambition is to work
collaboratively across geographies, practices,
product areas and sectors to deliver the best
advice and support to our clients.
Our structure
Our international network Responsible BusinessDelivering value to our clients
Agenda
Data Science in Law
Architecture
First success: Graph Analysis
NLP deep learning in Spark & Databricks
6
Data science in Law
Why would a law firm get involved in creating a data science lab?
7
Digital is changing how business gets done
Current
Law firm +
Digital Technologies
=
Future
Law Firm
• Agile platforms and solutions designed change and adapt
• Human augmentation (AI)
• Draw better insight out of data convert intelligent into action
• Simplify & digital work execution
Digital Trends
• Re-envision existing and enable new business
models
• Embrace different way of bringing people together
• Develop new capabilities that help organizations
transform themselves into digital organizations
New regulations, markets
• Staying ahead by anticipating what’s next
• Industry competitiveness (panels, benchmarking)
What we do in the Data Science Lab
360
information
products
We seek to
better understand our
data, diagnose, predict
business outcomes
and recommend what
to do to achieve
our goals.
Operational
insights,
predictions &
prescriptions
Turn our
data into new
Client
Products &
Services
Provide self-service 360 views of critical data, available to all;
to research further particular topics and enhance the existing
business intelligence reporting.
Provide cross-cutting operational insights, Anticipate what will
happen and recommend what to do to achieve goals.
For example:
• what’s the patterns to our profitability and what actions can we
take?
• can we better predict our fees?
Seek to create new products and services based on our data &
insights we can derive from it.
• Up to your imagination!
10
ELIVER
AI Adoption MATURITY CURVE
Address Cloud
challenges
Innovation pipeline
What are you
trying to do?
AI capability
AImaturity
From research projects to Client Solutions
BI & Apps
• Data driven business
analytics & reporting
Azure Data Services, SQL Server
Power Platform (Power BI, Power Apps,
Power Flow)
SaaS with AI
• Immediate actionable insights
with AI
Already productised: for example for due diligence, risk
management, contract automation,
Also with Office 365, Workplace Analytics)
Ad hoc AI
• AI Research
• Big data insight
• AI idea validations
Azure cloud services
Infrastructure as code
New opportunities
Speed to market
AI “Accelerators”
• Solution specific AI services
& patterns
Azure Cognitive Services
Azure Bot Services
Azure Search
Industry accelerators
AI Products
• Operationalise AI
• Product realisation
• Data Science & Deep AI
capabilities
Open Frameworks
Azure Databricks
Azure Machine Learning
Azure AI Infrastructure
Legal specific
DS-Lab end to end capability
11
Technology pioneers
Evangelization
DevOps / SecOps / Operationalization
Idea & business process management
Data Science
• Research
• Academic collaborations
• Modelling (e.g. ML and DL)
Data pipeline
• Data extraction / collection
• Transformation
• Compliance
• Confidentiality
Production
• Minimum Viable Product
• AI productionisation
• Product development
Architecture
The Legal Industry deals with confidential matters…
is the data science lab system architecture appropriate?
12
Confidentiality
13
User Access
Time validity
Contract limitations
Client Contracts
Audits!
Line of BusinessData Science
Dataset 1
Dataset 3
Dataset 2
Batch
Stream
ML+AI Community and
topics detection
ML+AI Relationship analysis
ML+AI Decision Automation
Data Engineering
Data Processing
Customers
Volume
Variety
Velocity
Many challenges in legal
Blob storage
Data Lake
On-premise
Architecture
15
First success: Graph analysis
What is the data science lab spark?
16
17
Understanding client relationships: a work in progress
Understanding customer
relationships is crucial to our
business and revenue
generation
We have a constant
battle getting people to
use our CRM
(customer relationship
management); it is
manually intensive, and
people's time is
pressured
Use our existing data to
uncover insights into our
customer and client
relationships
Client relationship analysis
18
Client relationship analysis
Key
performance
indicator
Define
datasets Academic
literature
& research
Demo
realization
Production
• Email system
• HR system
• CRM system
• Others..
• Knowledge
graph
• Graph
algorithms
• Dataset analysis
• Research
implementation
• User visualisation
• Minimum
viable product
• Productionise
the research
• Agile Project
• Confidential
19
Client relationship analysis
NLP deep learning in Spark &
Databricks
Let’s get serious
20
21
Document classification
LIBOR, Brexit and similar
exercise deals with huge
number of documents.
Document analysis requires
the identification of the
document type
The ability of
recognizing the
document types is a
requirement for many
matters
We often have urgent use
cases from client tenders
22
Document classification
Key
performance
indicator
Define
datasets Academic
literature
& research
Demo
realization
Production
• EDGAR • Text
classification
• Document
classification
• Deep learning
for NLP
• Work in progress • Minimum
viable product
• Productionise
the research
• Agile Project
• Confidential
Document example
• Document size 300 pages or more (more than 150.000 words)
23
Open EDGAR
Open EDGAR
24
Data download and cleaning
U.S.A
SEC Open EDGAR
Document Analysis
Document Cleaning
Blob
Storage
Text Extraction
15M raw documents
7M cleaned documents
250K state of the art
Metadata
Models development
• Cross validation with 10 datasets of 1000
documents
• Considered the following models/architectures:
– CNN
– LSTM
– Doc2Vec
– SVM
– BERT
25
Hyperparameter optimisation
• How can we choose the optimal
hyperparameters?
– Grid search
– Bayesian optimization
26
Grid search on single core model
27
Hyperparameters
RDD
.map()
Single
core
model
Single
core
model
Single
core
model
Single
core
model
Executor
Single
core
model Dataframe
28
Mlflow – single core
29
How does mlflow manage multithreading
in Spark?
30
• Created a class instead of using global variables
• Implemented __enter__, __exit__, __del__
methods for using the same notation
How does mlflow manage multithreading in
Spark?
class MtMlFlow:
def __init__(self, run_id=None,
experiment_name=None, experiment_id=None, mlflow_run:
Run = None):
@staticmethod
def get_experiment_id(experiment_name):
def __enter__(self):
def __exit__(self, exc_type, exc_val, exc_tb):
def __del__(self):
@property
def run(self):
@property
def run_id(self):
def log_param(self, key, value):
def set_tag(self, key, value):
def log_metric(self, key, value, step=None):
def log_metrics(self, metrics, step=None):
def log_params(self, params):
def set_tags(self, tags):
def log_artifact(self, local_path, …):
def log_artifacts(self, local_dir, …):
@classmethod
def create_experiment(cls, name, …):
def get_artifact_uri(self, artifact_path=None):
31
with MtMlFlow(experiment_name='/Shared/my_exp_results') as experiment:
experiment.log_param("batch", hyperparameters[0])
experiment.log_param("vector_size", hyperparameters[1])
experiment.log_param("window_size", hyperparameters[2])
experiment.log_param("cost", hyperparameters[3])
experiment.log_param("gamma", hyperparameters[4])
...
experiment.log_artifact(file_path)
Bayesian optimization for hyperparameter optimisation
32
Executor
Deep Learning Model
Executor
Deep Learning Model
Notebooks
Deep Learning Model
Hyperopt
Hyperparameters definition
Optimal
hyperparameter
combination
> 500 runsSparkTrials
Hyperopt example
33
Hyperparameters results
34
Hyperparameter optimisation
35
Number of
documents
Number of
executors
Number of CPU
per Executor
Time per
run
Elapsed
Time
Total
CPU time
Number
of runs
Grid search 10000 10 32 75m 8h 100 days 1920
Bayesian
optimisation (TPE*) 10000 10 2 GPU 25m 21h 9 days 500
TPE = Tree of Parzen Estimators
What results?
• The best model achieve F1 score of 98.5 %
36
Q&A
Thank you!
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
40
Run_parallel implementation
41

Contenu connexe

Tendances

Sql server 2019 new features
Sql server 2019 new featuresSql server 2019 new features
Sql server 2019 new featuresGeorge Walters
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019Amit Banerjee
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure DatabricksSascha Dittmann
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingAmazon Web Services
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
What is Informatica Powercenter
What is Informatica PowercenterWhat is Informatica Powercenter
What is Informatica PowercenterBigClasses Com
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
Cognitive Search: Announcing the smartest enterprise search engine, now with ...
Cognitive Search: Announcing the smartest enterprise search engine, now with ...Cognitive Search: Announcing the smartest enterprise search engine, now with ...
Cognitive Search: Announcing the smartest enterprise search engine, now with ...Microsoft Tech Community
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureDATAVERSITY
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Cathrine Wilhelmsen
 

Tendances (20)

Sql server 2019 new features
Sql server 2019 new featuresSql server 2019 new features
Sql server 2019 new features
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure Databricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data Warehousing
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
What is Informatica Powercenter
What is Informatica PowercenterWhat is Informatica Powercenter
What is Informatica Powercenter
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Azure purview
Azure purviewAzure purview
Azure purview
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Cognitive Search: Announcing the smartest enterprise search engine, now with ...
Cognitive Search: Announcing the smartest enterprise search engine, now with ...Cognitive Search: Announcing the smartest enterprise search engine, now with ...
Cognitive Search: Announcing the smartest enterprise search engine, now with ...
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
 

Similaire à Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance

Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseDatabricks
 
Entry Points – How to Get Rolling with Big Data Analytics
Entry Points – How to Get Rolling with Big Data AnalyticsEntry Points – How to Get Rolling with Big Data Analytics
Entry Points – How to Get Rolling with Big Data AnalyticsInside Analysis
 
The Agile Analyst: Solving the Data Problem with Virtualization
The Agile Analyst: Solving the Data Problem with VirtualizationThe Agile Analyst: Solving the Data Problem with Virtualization
The Agile Analyst: Solving the Data Problem with VirtualizationInside Analysis
 
Acctiva: expertise in Business Intelligence, Data Warehousing, Data Governance
Acctiva: expertise in Business Intelligence, Data Warehousing, Data GovernanceAcctiva: expertise in Business Intelligence, Data Warehousing, Data Governance
Acctiva: expertise in Business Intelligence, Data Warehousing, Data GovernanceAcctiva Ltd.
 
Age of Exploration: How to Achieve Enterprise-Wide Discovery
Age of Exploration: How to Achieve Enterprise-Wide DiscoveryAge of Exploration: How to Achieve Enterprise-Wide Discovery
Age of Exploration: How to Achieve Enterprise-Wide DiscoveryInside Analysis
 
Valuing the data asset
Valuing the data assetValuing the data asset
Valuing the data assetBala Iyer
 
Enterprise Data Management: Managing your Business’s Entire Data Lifecycle
Enterprise Data Management: Managing your Business’s Entire Data LifecycleEnterprise Data Management: Managing your Business’s Entire Data Lifecycle
Enterprise Data Management: Managing your Business’s Entire Data LifecycleNIXUnited
 
Enterprise Data Management: Managing your Business’s Entire Data Lifecycle
Enterprise Data Management: Managing your Business’s Entire Data LifecycleEnterprise Data Management: Managing your Business’s Entire Data Lifecycle
Enterprise Data Management: Managing your Business’s Entire Data LifecycleErinDempsey17
 
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Innovation Enterprise
 
CI or FS Poly Cleared Job Fair Handbook | May 18
CI or FS Poly Cleared Job Fair Handbook | May 18 CI or FS Poly Cleared Job Fair Handbook | May 18
CI or FS Poly Cleared Job Fair Handbook | May 18 ClearedJobs.Net
 
Graphs in Life Sciences
Graphs in Life SciencesGraphs in Life Sciences
Graphs in Life SciencesNeo4j
 
reStartEvents April 22nd DC metro Cleared Virtual Career Fair Employer Directory
reStartEvents April 22nd DC metro Cleared Virtual Career Fair Employer DirectoryreStartEvents April 22nd DC metro Cleared Virtual Career Fair Employer Directory
reStartEvents April 22nd DC metro Cleared Virtual Career Fair Employer DirectoryKen Fuller
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
 
IDC Executive Programs Brochure
IDC Executive Programs Brochure IDC Executive Programs Brochure
IDC Executive Programs Brochure IDC Research Inc.
 
Cloudera Fast Forward Labs: Accelerate machine learning
Cloudera Fast Forward Labs: Accelerate machine learningCloudera Fast Forward Labs: Accelerate machine learning
Cloudera Fast Forward Labs: Accelerate machine learningCloudera, Inc.
 
iiiQbets company profile
iiiQbets company profileiiiQbets company profile
iiiQbets company profileBhagyaV2
 
ZIGRAM Introduction September 2020
ZIGRAM Introduction September 2020ZIGRAM Introduction September 2020
ZIGRAM Introduction September 2020ZIGRAM
 
Cubodrom profile
Cubodrom profileCubodrom profile
Cubodrom profilecubodrom
 

Similaire à Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance (20)

Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Entry Points – How to Get Rolling with Big Data Analytics
Entry Points – How to Get Rolling with Big Data AnalyticsEntry Points – How to Get Rolling with Big Data Analytics
Entry Points – How to Get Rolling with Big Data Analytics
 
The Agile Analyst: Solving the Data Problem with Virtualization
The Agile Analyst: Solving the Data Problem with VirtualizationThe Agile Analyst: Solving the Data Problem with Virtualization
The Agile Analyst: Solving the Data Problem with Virtualization
 
Acctiva: expertise in Business Intelligence, Data Warehousing, Data Governance
Acctiva: expertise in Business Intelligence, Data Warehousing, Data GovernanceAcctiva: expertise in Business Intelligence, Data Warehousing, Data Governance
Acctiva: expertise in Business Intelligence, Data Warehousing, Data Governance
 
Age of Exploration: How to Achieve Enterprise-Wide Discovery
Age of Exploration: How to Achieve Enterprise-Wide DiscoveryAge of Exploration: How to Achieve Enterprise-Wide Discovery
Age of Exploration: How to Achieve Enterprise-Wide Discovery
 
Future ready
Future readyFuture ready
Future ready
 
Valuing the data asset
Valuing the data assetValuing the data asset
Valuing the data asset
 
Enterprise Data Management: Managing your Business’s Entire Data Lifecycle
Enterprise Data Management: Managing your Business’s Entire Data LifecycleEnterprise Data Management: Managing your Business’s Entire Data Lifecycle
Enterprise Data Management: Managing your Business’s Entire Data Lifecycle
 
Enterprise Data Management: Managing your Business’s Entire Data Lifecycle
Enterprise Data Management: Managing your Business’s Entire Data LifecycleEnterprise Data Management: Managing your Business’s Entire Data Lifecycle
Enterprise Data Management: Managing your Business’s Entire Data Lifecycle
 
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
 
CI or FS Poly Cleared Job Fair Handbook | May 18
CI or FS Poly Cleared Job Fair Handbook | May 18 CI or FS Poly Cleared Job Fair Handbook | May 18
CI or FS Poly Cleared Job Fair Handbook | May 18
 
Web focus overview presentation 2015
Web focus overview presentation 2015Web focus overview presentation 2015
Web focus overview presentation 2015
 
Graphs in Life Sciences
Graphs in Life SciencesGraphs in Life Sciences
Graphs in Life Sciences
 
reStartEvents April 22nd DC metro Cleared Virtual Career Fair Employer Directory
reStartEvents April 22nd DC metro Cleared Virtual Career Fair Employer DirectoryreStartEvents April 22nd DC metro Cleared Virtual Career Fair Employer Directory
reStartEvents April 22nd DC metro Cleared Virtual Career Fair Employer Directory
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
 
IDC Executive Programs Brochure
IDC Executive Programs Brochure IDC Executive Programs Brochure
IDC Executive Programs Brochure
 
Cloudera Fast Forward Labs: Accelerate machine learning
Cloudera Fast Forward Labs: Accelerate machine learningCloudera Fast Forward Labs: Accelerate machine learning
Cloudera Fast Forward Labs: Accelerate machine learning
 
iiiQbets company profile
iiiQbets company profileiiiQbets company profile
iiiQbets company profile
 
ZIGRAM Introduction September 2020
ZIGRAM Introduction September 2020ZIGRAM Introduction September 2020
ZIGRAM Introduction September 2020
 
Cubodrom profile
Cubodrom profileCubodrom profile
Cubodrom profile
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Dernier

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Dernier (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Mirko Bernardoni & Michael Seddon Clifford Chance Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks #UnifiedDataAnalytics #SparkAISummit
  • 3. 3 Mirko’s main role is to build and lead the data science lab. He strongly believes that the work that Clifford Chance is doing in the research department offers a unique opportunity to shape and change the legal sector which is why he also work closely with universities for research. About Mirko Bernardoni MIRKO BERNARDONI Head of Data Science Clifford Chance https://www.linkedin.com/i/mirko-bernardoni
  • 4. 4 Michael’s focus is on solving Clifford Chance’s data engineering challenges. As a core member of the team, he has been involved in the design and development of the data pipelines crucial to the lab’s success, as well as delivering numerous machine learning projects helping to reshape Clifford Chance’s operational insights and execution. About Michael Seddon MICHAEL SEDDON Senior Machine Learning Engineer Clifford Chance https://www.linkedin.com/in/mrmichaelseddon/
  • 5. We are one of the world’s pre-eminent law firms with significant depth and range of resources across five continents Clifford Chance As a global law firm we are able to support clients at both a local and international level across Europe, Asia Pacific, the Americas, the Middle East and Africa. Our global, cross-discipline teams advise on a full range of legal solutions. We have a global view, and through our sector approach, a detailed understanding of our clients’ business, its drivers and competitive landscapes. Our Responsible Business strategy is integral to our firm strategy. It guides how we conduct our core business, how we develop and support our people, and how we foster closer collaboration with our clients. • Doing business – we promote market- shaping practices in relation to ethics, professional standards and risk management • People – we realise the potential of our people by creating a safe, healthy and inclusive workplace, and broadening our skills and experience • Community – we partner to support our community by widening access to justice, finance and education • Environment – we manage our environmental footprint and contribute to developing a more sustainable world. At Clifford Chance, we are committed to delivering a world-class service – providing the highest quality advice and support efficiently and effectively, every time. Our clients, who include corporate companies across all commercial and industrial sectors, governments, regulators, trade bodies and not-for-profit organisations are at the heart of how we work. Understanding what our clients value and aligning with their needs underpins our approach. We invest heavily to ensure that clients benefit from our formidable knowledge and market insights, that they have access to the best team for the job, and that we bring the right processes and advanced technologies to bear on each matter. Trusted legal advice for the world’s Leading businesses of today and tomorrow As a single, fully integrated, global partnership, we pride ourselves on our approachable, collegiate and team-based way of working. We are a single profit pool, lockstep partnership. Our ambition is to work collaboratively across geographies, practices, product areas and sectors to deliver the best advice and support to our clients. Our structure Our international network Responsible BusinessDelivering value to our clients
  • 6. Agenda Data Science in Law Architecture First success: Graph Analysis NLP deep learning in Spark & Databricks 6
  • 7. Data science in Law Why would a law firm get involved in creating a data science lab? 7
  • 8. Digital is changing how business gets done Current Law firm + Digital Technologies = Future Law Firm • Agile platforms and solutions designed change and adapt • Human augmentation (AI) • Draw better insight out of data convert intelligent into action • Simplify & digital work execution Digital Trends • Re-envision existing and enable new business models • Embrace different way of bringing people together • Develop new capabilities that help organizations transform themselves into digital organizations New regulations, markets • Staying ahead by anticipating what’s next • Industry competitiveness (panels, benchmarking)
  • 9. What we do in the Data Science Lab 360 information products We seek to better understand our data, diagnose, predict business outcomes and recommend what to do to achieve our goals. Operational insights, predictions & prescriptions Turn our data into new Client Products & Services Provide self-service 360 views of critical data, available to all; to research further particular topics and enhance the existing business intelligence reporting. Provide cross-cutting operational insights, Anticipate what will happen and recommend what to do to achieve goals. For example: • what’s the patterns to our profitability and what actions can we take? • can we better predict our fees? Seek to create new products and services based on our data & insights we can derive from it. • Up to your imagination!
  • 10. 10 ELIVER AI Adoption MATURITY CURVE Address Cloud challenges Innovation pipeline What are you trying to do? AI capability AImaturity From research projects to Client Solutions BI & Apps • Data driven business analytics & reporting Azure Data Services, SQL Server Power Platform (Power BI, Power Apps, Power Flow) SaaS with AI • Immediate actionable insights with AI Already productised: for example for due diligence, risk management, contract automation, Also with Office 365, Workplace Analytics) Ad hoc AI • AI Research • Big data insight • AI idea validations Azure cloud services Infrastructure as code New opportunities Speed to market AI “Accelerators” • Solution specific AI services & patterns Azure Cognitive Services Azure Bot Services Azure Search Industry accelerators AI Products • Operationalise AI • Product realisation • Data Science & Deep AI capabilities Open Frameworks Azure Databricks Azure Machine Learning Azure AI Infrastructure Legal specific
  • 11. DS-Lab end to end capability 11 Technology pioneers Evangelization DevOps / SecOps / Operationalization Idea & business process management Data Science • Research • Academic collaborations • Modelling (e.g. ML and DL) Data pipeline • Data extraction / collection • Transformation • Compliance • Confidentiality Production • Minimum Viable Product • AI productionisation • Product development
  • 12. Architecture The Legal Industry deals with confidential matters… is the data science lab system architecture appropriate? 12
  • 13. Confidentiality 13 User Access Time validity Contract limitations Client Contracts Audits!
  • 14. Line of BusinessData Science Dataset 1 Dataset 3 Dataset 2 Batch Stream ML+AI Community and topics detection ML+AI Relationship analysis ML+AI Decision Automation Data Engineering Data Processing Customers Volume Variety Velocity Many challenges in legal Blob storage Data Lake On-premise
  • 16. First success: Graph analysis What is the data science lab spark? 16
  • 17. 17 Understanding client relationships: a work in progress Understanding customer relationships is crucial to our business and revenue generation We have a constant battle getting people to use our CRM (customer relationship management); it is manually intensive, and people's time is pressured Use our existing data to uncover insights into our customer and client relationships Client relationship analysis
  • 18. 18 Client relationship analysis Key performance indicator Define datasets Academic literature & research Demo realization Production • Email system • HR system • CRM system • Others.. • Knowledge graph • Graph algorithms • Dataset analysis • Research implementation • User visualisation • Minimum viable product • Productionise the research • Agile Project • Confidential
  • 20. NLP deep learning in Spark & Databricks Let’s get serious 20
  • 21. 21 Document classification LIBOR, Brexit and similar exercise deals with huge number of documents. Document analysis requires the identification of the document type The ability of recognizing the document types is a requirement for many matters We often have urgent use cases from client tenders
  • 22. 22 Document classification Key performance indicator Define datasets Academic literature & research Demo realization Production • EDGAR • Text classification • Document classification • Deep learning for NLP • Work in progress • Minimum viable product • Productionise the research • Agile Project • Confidential
  • 23. Document example • Document size 300 pages or more (more than 150.000 words) 23
  • 24. Open EDGAR Open EDGAR 24 Data download and cleaning U.S.A SEC Open EDGAR Document Analysis Document Cleaning Blob Storage Text Extraction 15M raw documents 7M cleaned documents 250K state of the art Metadata
  • 25. Models development • Cross validation with 10 datasets of 1000 documents • Considered the following models/architectures: – CNN – LSTM – Doc2Vec – SVM – BERT 25
  • 26. Hyperparameter optimisation • How can we choose the optimal hyperparameters? – Grid search – Bayesian optimization 26
  • 27. Grid search on single core model 27 Hyperparameters RDD .map() Single core model Single core model Single core model Single core model Executor Single core model Dataframe
  • 29. 29
  • 30. How does mlflow manage multithreading in Spark? 30 • Created a class instead of using global variables • Implemented __enter__, __exit__, __del__ methods for using the same notation
  • 31. How does mlflow manage multithreading in Spark? class MtMlFlow: def __init__(self, run_id=None, experiment_name=None, experiment_id=None, mlflow_run: Run = None): @staticmethod def get_experiment_id(experiment_name): def __enter__(self): def __exit__(self, exc_type, exc_val, exc_tb): def __del__(self): @property def run(self): @property def run_id(self): def log_param(self, key, value): def set_tag(self, key, value): def log_metric(self, key, value, step=None): def log_metrics(self, metrics, step=None): def log_params(self, params): def set_tags(self, tags): def log_artifact(self, local_path, …): def log_artifacts(self, local_dir, …): @classmethod def create_experiment(cls, name, …): def get_artifact_uri(self, artifact_path=None): 31 with MtMlFlow(experiment_name='/Shared/my_exp_results') as experiment: experiment.log_param("batch", hyperparameters[0]) experiment.log_param("vector_size", hyperparameters[1]) experiment.log_param("window_size", hyperparameters[2]) experiment.log_param("cost", hyperparameters[3]) experiment.log_param("gamma", hyperparameters[4]) ... experiment.log_artifact(file_path)
  • 32. Bayesian optimization for hyperparameter optimisation 32 Executor Deep Learning Model Executor Deep Learning Model Notebooks Deep Learning Model Hyperopt Hyperparameters definition Optimal hyperparameter combination > 500 runsSparkTrials
  • 35. Hyperparameter optimisation 35 Number of documents Number of executors Number of CPU per Executor Time per run Elapsed Time Total CPU time Number of runs Grid search 10000 10 32 75m 8h 100 days 1920 Bayesian optimisation (TPE*) 10000 10 2 GPU 25m 21h 9 days 500 TPE = Tree of Parzen Estimators
  • 36. What results? • The best model achieve F1 score of 98.5 % 36
  • 37. Q&A
  • 39. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 40. 40