Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks

From zero to data science in a legal firm: how one of the world’s largest law firms is reshaping operations with advanced analytics. Clifford Chance LLP is one of the ten largest law firms in the world. With thousands of global clients their teams handle millions of legal documents every year.

The data science team will share their approach to building an agile data science lab from zero on top of Apache Spark, Azure Databricks and MLflow. They will deep dive into how they used deep learning for natural language processing in the classification of large documents using MLflow and Hyperopt for model comparison and hyperparameter optimization.


1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks Mirko Bernardoni & Michael Seddon Clifford Chance #UnifiedDataAnalytics #SparkAISummit

3.About Mirko Bernardoni MIRKO BERNARDONI Head of Data Science Clifford Chance Mirko’s main role is to build and lead the data science lab. He strongly believes that the work that Clifford Chance is doing in the research department offers a unique opportunity to shape and change the legal sector which is why he also work closely with universities for research. 3

4.About Michael Seddon MICHAEL SEDDON Senior Machine Learning Engineer Clifford Chance Michael’s focus is on solving Clifford Chance’s data engineering challenges. As a core member of the team, he has been involved in the design and development of the data pipelines crucial to the lab’s success, as well as delivering numerous machine learning projects helping to reshape Clifford Chance’s operational insights and execution. 4

5.Clifford Chance Trusted legal advice We are one of the world’s for the world’s pre-eminent law firms with Leading businesses of significant depth and range of today and tomorrow resources across five continents As a single, fully integrated, global partnership, we pride ourselves on our approachable, collegiate and team-based way of working. Our international network Delivering value to our clients Responsible Business As a global law firm we are able to support At Clifford Chance, we are committed to Our Responsible Business strategy is integral clients at both a local and international level delivering a world-class service – providing to our firm strategy. It guides how we conduct across Europe, Asia Pacific, the Americas, the highest quality advice and support our core business, how we develop and efficiently and effectively, every time. support our people, and how we foster closer the Middle East and Africa. collaboration with our clients. Our global, cross-discipline teams advise on a Our clients, who include corporate • Doing business – we promote market- full range of legal solutions. We have a global companies across all commercial and shaping practices in relation to ethics, view, and through our sector approach, a industrial sectors, governments, regulators, professional standards and detailed understanding of our clients’ business, trade bodies and not-for-profit organisations risk management its drivers and competitive landscapes. are at the heart of how we work. • People – we realise the potential of our Understanding what our clients value and people by creating a safe, healthy and aligning with their needs underpins our inclusive workplace, and broadening our Our structure approach. We invest heavily to ensure that skills and experience clients benefit from our formidable knowledge • Community – we partner to support our We are a single profit pool, lockstep partnership. Our ambition is to work and market insights, that they have access to community by widening access to the best team for the job, and that we bring justice, finance and education collaboratively across geographies, practices, the right processes and advanced • Environment – we manage our product areas and sectors to deliver the best technologies to bear on each matter. advice and support to our clients. environmental footprint and contribute to developing a more sustainable world.

6.Agenda Data Science in Law Architecture First success: Graph Analysis NLP deep learning in Spark & Databricks 6

7.Why would a law firm get involved in creating a data science lab? Data science in Law 7

8.Digital is changing how business gets done Digital Technologies • Agile platforms and solutions designed change and adapt • Human augmentation (AI) • Draw better insight out of data convert intelligent into action • Simplify & digital work execution Current Digital Trends Future Law firm + • • • Re-envision existing and enable new business models Embrace different way of bringing people together Develop new capabilities that help organizations = Law Firm transform themselves into digital organizations New regulations, markets • Staying ahead by anticipating what’s next • Industry competitiveness (panels, benchmarking)

9. What we do in the Data Science Lab Provide self-service 360 views of critical data, available to all; to research further particular topics and enhance the existing 360 information business intelligence reporting. products Provide cross-cutting operational insights, Anticipate what will We seek to better understand our happen and recommend what to do to achieve goals. Operational data, diagnose, predict insights, For example: business outcomes predictions & • what’s the patterns to our profitability and what actions can we and recommend what prescriptions take? to do to achieve our goals. • can we better predict our fees? Turn our data into new Seek to create new products and services based on our data & Client Products & insights we can derive from it. Services • Up to your imagination!

10. AI Adoption MATURITY CURVE AI Products Open Frameworks • Operationalise AI Azure Databricks • Product realisation Azure Machine Learning • Data Science & Deep AI Azure AI Infrastructure Legal specific capabilities From research projects to Client Solutions AI “Accelerators” Azure Cognitive Services • Solution specific AI services Azure Bot Services & patterns Azure Search Industry accelerators Innovation pipeline Ad hoc AI Azure cloud services AI maturity • AI Research Infrastructure as code • Big data insight New opportunities • AI idea validations Speed to market Address Cloud challenges SaaS with AI Already productised: for example for due diligence, risk • Immediate actionable insights management, contract automation, with AI Also with Office 365, Workplace Analytics) BI & Apps Azure Data Services, SQL Server • Data driven business Power Platform (Power BI, Power Apps, analytics & reporting Power Flow) What are you trying to do? AI capability ELIVER 10

11.DS-Lab end to end capability Idea & business process management Data pipeline Data Science Production • Data extraction / collection • Research • Minimum Viable Product • Transformation • Academic collaborations • AI productionisation • Compliance • Modelling (e.g. ML and DL) • Product development • Confidentiality Technology pioneers DevOps / SecOps / Operationalization Evangelization 11

12.The Legal Industry deals with confidential matters… is the data science lab system architecture appropriate? Architecture 12

13.Confidentiality Audits! User Access Time validity Contract limitations Client Contracts 13

14.Many challenges in legal Data Processing Data Science Line of Business Volume ML+AI Community and Variety topics detection Blob storage Velocity Data Lake Dataset 1 Batch On-premise Dataset 2 Stream Dataset 3 ML+AI Relationship analysis ML+AI Decision Automation Data Engineering Customers

15.Architecture 15

16.What is the data science lab spark? First success: Graph analysis 16

17.Client relationship analysis Understanding client relationships: a work in progress We have a constant battle getting people to Understanding customer use our CRM Use our existing data to relationships is crucial to our (customer relationship uncover insights into our business and revenue management); it is customer and client generation manually intensive, and relationships people's time is pressured 17

18.Client relationship analysis Define Demo Key datasets Academic realization performance literature Production indicator & research • Confidential • Email system • Knowledge • Dataset analysis • Minimum graph viable product • HR system • Research • Graph implementation • Productionise • CRM system algorithms the research • User visualisation • Others.. • Agile Project 18

19.Client relationship analysis 19

20.Let’s get serious NLP deep learning in Spark & Databricks 20

21.Document classification LIBOR, Brexit and similar exercise deals with huge The ability of number of documents. recognizing the We often have urgent use document types is a Document analysis requires cases from client tenders requirement for many the identification of the matters document type 21

22.Document classification Define Demo Key datasets Academic realization performance literature Production indicator & research • Confidential • EDGAR • Text • Work in progress • Minimum classification viable product • Document • Productionise classification the research • Deep learning • Agile Project for NLP 22

23.Document example • Document size 300 pages or more (more than 150.000 words) 23

24.Data download and cleaning Open EDGAR U.S.A Open EDGAR Document Analysis SEC Open EDGAR Blob Storage 15M raw documents 7M cleaned documents Document Cleaning 250K state of the art Text Extraction Metadata 24

25.Models development • Cross validation with 10 datasets of 1000 documents • Considered the following models/architectures: – CNN – LSTM – Doc2Vec – SVM – BERT 25

26.Hyperparameter optimisation • How can we choose the optimal hyperparameters? – Grid search – Bayesian optimization 26

27. Grid search on single core model Hyperparameters RDD Single Single Single .map() core core core model model model Da Single Single taf ram core core e model model Executor 27

28.Mlflow – single core 28