Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Graph Realities

343 vues

Publié le

Graph applications were once considered “exotic” and expensive. Until recently, few software engineers had much experience putting graphs to work. However, the use cases are now becoming more commonplace.

This talk explores a practical use case, one which addresses key issues of data governance and reproducible research, and depends on sophisticated use of graph technology.

Consider: some academic disciplines such as astronomy enjoy a wealth of data — mostly open data. Popular machine learning algorithms, open source Python libraries, and distributed systems all owe much to those disciplines and their history of big data.

Other disciplines require strong guarantees for privacy and security. Datasets used in social science research involve confidential details about human subjects: medical histories, wages, home addresses for family members, police records, etc.

Those cannot be shared openly, which impedes researchers from learning about related work by others. Reproducibility of research and the pace of science in general are limited. Nonetheless, social science research is vital for civil governance, especially for evidence-based policymaking (US federal law since 2018).

Even when data may be too sensitive to share openly, often the metadata can be shared. Constructing knowledge graphs of metadata about datasets — along with metadata about authors, their published research, methods used, data providers, data stewards, and so on — that provides effective means to tackle hard problems in data governance.

Knowledge graph work supports use cases such as entity linking, discovery and recommendations, axioms to infer about compliance, etc. This talk reviews the Rich Context AI competition and the related ADRF framework used now by more than 15 federal agencies in the US.

We’ll explore knowledge graph use cases, use of open standards and open source, and how this enhances reproducible research. Social science research for the public sector has much in common with data use in industry.

Issues of privacy, security, and compliance overlap, pointing toward what will be required of banks, media channels, etc., and what technologies apply. We’ll look at comparable work emerging in other parts of industry: open source projects, open standards emerging, and in particular a new set of features in Project Jupyter that support knowledge graphs about data governance.

Publié dans : Données & analyses
  • Identifiez-vous pour voir les commentaires

Graph Realities

  1. 1. Graph Realities Paco Nathan @pacoid
  2. 2. Personal Background • applied math, machine learning, distributed systems • R&D for neural networks, incl. hardware (1986-1997) • “guinea pig” for early cloud (2005-ff) • led data teams in industry • assisted popular open source projects: 
 Spark, Jupyter, etc. • development focus on natural language plus adjacent knowledge graph use cases • since 2018, increasingly working at the intersection 
 of public sector + enterprise + open source
  3. 3. Motivations Not all that long ago, graph applications were considered exotic and expensive. Until recently, few software engineers had much experience putting graphs to work; however, those use cases have now become much more commonplace. This talk explores a practical use case, one that addresses key issues of data governance and reproducible research, and depends on sophisticated use of graph technology. First, some perspectives and industry analysis…
  4. 4. part 1:
 graph perspectives
  5. 5. Perspectives • the ubiquity of linked data • the tyranny of “thinking relational” • the primacy of working with graphs 
 (and their math analog, tensors) • nouns vs. verbs vs. adjectives
 (extreme nominalization) • evolution of hardware, cloud, 
 and cluster topologies • the power of graph embeddings
  6. 6. Historical Context “Data Science: Past and Future”
 Rev 2 (2019-05-24) slides “What is Data Science?”
 IBM Data Science Community (2019-03-04) Just Enough Math
 O’Reilly Media (2014) • John Tukey: data analytics as an intrinsically empirical 
 and interdisciplinary field (1962) • most popular data frameworks leveraged some graph 
 processing, albeit obscured, ad-hoc, clumsy… • they did well, given the hardware available at the time
  7. 7. Beauty in sparsity… SuiteSparse Matrix Collection: 
 a widely used set of sparse matrix benchmarks collected from a wide 
 range of applications sparse.tamu.edu/ …for when you really, really, need some interesting graph data
  8. 8. Theme 1: Stuffing graphs into matrices algebraic graph theory allowed reuse of linear algebra impl v u w x u v w x u 0 1 0 1 v 1 0 1 1 w 0 1 0 1 x 1 1 1 0 • e.g., transform graph to an adjacency matrix • most will be relatively sparse • use LINPACK, BLAS, or libraries built atop • much to leverage: SVD, power method, QR decomp, etc.
  9. 9. Theme 1: Stuffing graphs into matrices for many real-world problems, the data are essentially graphs 1. real-world data 2. graph theory for representation 3. convert to sparse matrix for production 4. cost-effective parallel processing at scale ergo, leverage low dimensional structure in high dimensional data
  10. 10. N Dims good, 2 Dims baa-d However, complex graphs cannot be represented as 2D matrices without serious information loss. Ideally, tensors would be a better representation 
 to use for linear algebra libraries. While tensor decomposition is a hard problem, 
 the general class of problems became much 
 more interesting after 2012…
  11. 11. N Dims good, 2 Dims baa-d However, complex graphs cannot be represented as 2D matrices without serious information loss. Ideally, tensors would be a better representation 
 to use for linear algebra libraries. While tensor decomposition is a hard problem, 
 the general class of problems became much 
 more interesting after 2012… “The real problem is that programmers have spent far too
 much time worrying about efficiency in the wrong place
 and at the wrong times; premature optimization is the
 root of all evil (or at least most of it) in programming.” Don Knuth
  12. 12. Theme 2: Nouns, Verbs, Adjectives Tracing back to the origins of relational databases, 
 Edgar Codd was furious about how badly SQL and RDBMS had misinterpreted his mathematical modeling of relations. Years of EDW reinforced a sense of an extreme nominalization with so much of the data representation being reduced into dimensions, facts, indexes
  13. 13. Theme 2: Nouns, Verbs, Adjectives a carry-over of extreme nominalization into graph DBs also
 over-emphasizes the role of nodes and centrality for adjusting the granularity of graph representations: • discounts the importance of relations • “mostly nouns, a few verbs, some adjectives” • serious information loss IMO: graph DB frameworks tend to err in this aspect, 
 both in terms of representation and algorithm support.
  14. 14. Part of a long-term narrative arc in IT… • arguably, circa 2001 was the heyday of DW+BI – later 
 acting as an “embedded institution” w.r.t. data science • Agile Manifesto became another “embedded institution” • a generation of developers equated “database” with “relational”,
 with a belief that legibility of systems == legibility of the data • even so, first-movers collectively made a sudden turn 
 toward NoSQL, partly in reaction to RDBMS pricing • see also: 
 “Statistical Modeling: The Two Cultures”
 Leo Breiman UC Berkeley (2001)
  15. 15. Adjusting data resolution in graphs In contrast, consider: “Extracting the multiscale backbone of complex weighted networks” M. Ángeles Serrano, Marián Boguña, Alessandro Vespignani
 PNAS (2009-04-21) Filtering large noisy graphs based on both 
 nodes and edges can be useful for automated 
 approaches in knowledge graph construction, 
 see: github.com/DerwenAI/disparity_filter
  16. 16. An emerging trend disrupts the past 15-20 years 
 of software engineering practice: hardware > software > process Hardware is now evolving more rapidly than software, which is evolving more rapidly than effective process Moore’s Law is all but dead, although ironically 
 many inefficiencies had been based on it See also: Pete Warden (2018) regarding
 TensorFlow.js on low-power devices Theme 3: Hardware in perspective
  17. 17. Theme 3: Evolution of cloud patterns UC Berkeley published a 2009 report about early use cases for cloud computing, which foresaw the shape of industry deployments over much of the next decade, and led directly to Apache Mesos and Apache Spark It’s fascinating to study the contrasts between that 2009 report and its 2019 follow-up. (minor footnote: vimeo.com/3616394) 2009
  18. 18. Theme 3: Evolution of cloud patterns Early cloud was intentionally “dumbed down” to resemble popular virtualization software – recognizable by IT staff – to support migration. That approach is no longer needed. Also, the physics + economics of cloud use tend to imply less “framework” layers. More contemporary patterns will force a restructuring – for efficiency and security – 
 i.e., decoupling computation and storage. 2019
  19. 19. Theme 3: Cluster topologies, by generation Opinion: one problem with software/hardware interface for distributed systems is that it’s taken decades to prioritize the need for handling graphs/tensors directly within popular, accessible open source libraries, without having some commercial database vendor intermediate. 1990s mid-2000s current
  20. 20. Theme 3: Cluster topologies, by generation 1990s mid-2000s current see also: Jeff Dean (2013) youtu.be/S9twUcX1Zp0 NB: graph
  21. 21. part 2:
 industry analysis
  22. 22. “Two Cultures” for AI A more useful distinction: • ML is about the tools and technologies • AI is about use case impact on social systems
  23. 23. Industry surveys for AI and Cloud adoption • “Three surveys of AI adoption reveal key advice 
 from more mature practices”
 Ben Lorica, Paco Nathan O’Reilly Media (2019-02-20) • Episode 7, Domino: surveying “ABC” adoption in enterprise
 (2019-03-03)
  24. 24. Trends: Knowledge Graphs We found that close to one quarter of respondents were using knowledge graphs.

  25. 25. Trends: Knowledge Graphs Healthcare is adopting use 
 of knowledge graphs more 
 so than in Finance. 

  26. 26. Trends: Knowledge Graphs Mature practices show more interest in use of knowledge graphs than firms which are 
 still evaluating ML use cases.
  27. 27. Trends: an accelerating gap in AI funding Note: firms with early advantage are investing more, moving still further away from the pack.

  28. 28. Overview of Data Governance Paco Nathan @pacoid Overview of Data Governance derwen.ai/s/6fqt cloud 3 cloud 2 cloud 1 security security security compliance db mobile devices mobile devices mobile devices edge cache web servers dw business analytics dat govdurable store logs models other data sci workflows cluster compute external data external data external data edge inference edge inference edge inference models streaming data dat gov dat gov dat gov dat gov dat gov dat gov dat gov we noted a resurgence in data 
 governance – this report examines
 key themes, vendors, issues, etc.
  29. 29. Unpacking AutoML derwen.ai/s/yvkg we noted an uptick in adoption for a third aspect, co-evolving along with DG and MLOps meta-learning feature selection hyperparameter optimization model selection auto scaling feature engineering train models evaluate results integrate, deploy data prep data platform usecases
  30. 30. Data Gov dovetails with MLOps and AutoML meta-learning feature selection hyperparameter optimization model selection auto scaling feature engineering train models evaluate results integrate, deploy data prep data platform usecases data gov trends augment AutoML data gov practices follow MLOps
  31. 31. Emerging category: watch the “AI Natives” Projects (mostly OSS) that leverage knowledge graph 
 of metadata about datasets and their usage: • Amundsen @ Lyft
 data discovery and metadata • Databook @ Uber
 manage metadata about datasets (pending OSS) • Marquez @ WeWork, Stitch Fix
 collect, aggregate, visualize metadata • Data Hub @ LinkedIn
 data discovery and lineage • Metcat @ Netflix
 data discovery, metadata service • Dataportal @ Airbnb
 integrated data-space (not OSS)
  32. 32. part 3:
 case study – rich context
  33. 33. Administrative Data Research Facility Coleridge Initiative
 Julia Lane, et al. NYU Wagner • FedRAMP-compliant ADRF framework on AWS GovCloud: 
 “public agency capacity to accelerate the effective use of 
 new datasets” • for research projects using cross-agency sensitive data, 
 in US and EU (and UK) – now in use by 15+ agencies • cited as the first federal example of Secure Access to Confidential Data in the final report of the Commission 
 on Evidence-Based Policymaking • augments Data Stewardship practices; collaboration 
 with Project Jupyter on the related data gov features • funding by Schmidt Futures, Sloan, Overdeck
  34. 34. ADRF and Rich Context Coleridge Initiative
 Julia Lane, et al. NYU Wagner • Rich Context: knowledge graph of metadata about datasets, used for entity linking, link prediction, recommendations, etc. • benefits: agencies, researchers, publishers, data stewards, data providers – see white paper • ongoing ML competition for linking research publications with dataset attribution (first comp. won by Allen AI) • see “Human-in-the-loop AI for scholarly infrastructure” • upcoming book:
 Rich Search and Discovery for Research Datasets: Building the next generation of scholarly infrastructure
  35. 35. AI for Scholarly Infrastructure Rich Context overall scope leaderboard competition publisher use cases HITL: RePEC, etc. authors accept/reject links models infer links corpus research pubs leaderboard evals results inferred linked data 1 2 3 • collaboration with SAGE Pub, Digital Science,
 RePEc, etc.; partnering with Bundesbank (EU) • knowledge graph vocabulary integrates W3C metadata standards: DCAT, PAV, DCMI, CITO, FaBiO, FOAF, etc. • data as a strategic asset: knowledge graph 
 produces an open corpus for the leaderboard competition • human-in-the-loop AI used to infer metadata 
 then confirm with authors via RePEC, etc. • adjacent work: graph embedding, meta-learning, persistent identifiers, reproducible research
  36. 36. Related work at Project Jupyter Make datasets and projects top-level constructs, support metadata exchange and privacy-preserving telemetry from notebook usage (due Oct 2019): • JupyterLab Commenting and real-time collab 
 similar to Google Docs • JupyterLab Data Explorer: register datasets 
 within research projects • JupyterLab Metadata Explorer: browse metadata descriptions, get recommendations through knowledge graph inference (via extension) • Data Registry (original proposal) • Telemetry (privacy-preserving, reports usage)
  37. 37. Related work at Project Jupyter
  38. 38. Active Learning as a data strategy Experts decide about edge cases, providing examples Experts learn through Customer interactions Customers request Sales, Marketing, Service, Training Experts gain insights via Model explanations ML Models Models focus Experts (e.g., weak supervision) Organizational Learning Human Experts Examples, Actions Customers Models act on decisions when possible Customer Use Cases Models explore uncertainty when needed derwen.ai/s/d8b7 teams of people + machines,
 leveraging the complementary
 strengths of both
  39. 39. Parting thought In many ways, we’re at a point in the industry with graph data – particularly for use of knowledge graph of metadata about dataset usage – which resembles conditions immediately before “Web 2.0” became big news. The emerging category of “AI natives” projects mentioned earlier could be parlayed into data utilities more flexible than the AI services which the current hyperscalers are fielding. Watch this space.
  40. 40. Just Enough Math Rich Context Hylbert-Speys Themes + Confs per Pacoid publicaXons, interviews, conference summaries… https://derwen.ai/paco
 @pacoid Rev conf

×