Latent Semantic Analysis of Wikipedia with Spark

•

15 j'aime•9,463 vues

These are the slides from my talk for Text by the Bay in April 2015.

1© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with Spark
Sandy Ryza | Senior Data Scientist

2© Cloudera, Inc. All rights reserved.
Me
• Data scientist at Cloudera
• Recently lead Cloudera’s Apache Spark
development
• Author of Advanced Analytics with Spark

3© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with Spark
Sandy Ryza | Senior Data Scientist

4© Cloudera, Inc. All rights reserved.
Latent Semantic Analysis
• Fancy name for applying a matrix decomposition (SVD) to text data

5© Cloudera, Inc. All rights reserved.

6© Cloudera, Inc. All rights reserved.

7© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

8© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

9© Cloudera, Inc. All rights reserved.
Wikipedia Content Data Set
• http://dumps.wikimedia.org/enwiki/latest/
• XML-formatted
• 46 GB uncompressed

10© Cloudera, Inc. All rights reserved.
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>584215651</id>
<parentid>584213644</parentid>
<timestamp>2013-12-02T15:14:01Z</timestamp>
<contributor>
<username>AnomieBOT</username>
<id>7611264</id>
</contributor>
<comment>Rescuing orphaned refs ("autogenerated1" from rev
584155010; "bbc" from rev 584155010)</comment>
<text xml:space="preserve">{{Redirect|Anarchist|the fictional character|
Anarchist (comics)}}
{{Redirect|Anarchists}}
{{pp-move-indef}}
{{Anarchism sidebar}}
'''Anarchism''' is a [[political philosophy]] that advocates [[stateless society|
stateless societies]] often defined as [[self-governance|self-governed]] voluntary
institutions,<ref>"ANARCHISM, a social philosophy that rejects
authoritarian government and maintains that voluntary institutions are best suited
to express man's natural social tendencies." George Woodcock.
"Anarchism" at The Encyclopedia of Philosophy</ref><ref>
"In a society developed on these lines, the voluntary associations which
already now begin to cover all the fields of human activity would take a still
greater extension so as to substitute
...

11© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

12© Cloudera, Inc. All rights reserved.
import org.apache.mahout.text.wikipedia.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "hdfs:///user/ds/wikidump.xml"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path,
classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text],
conf)
val rawXmls = kvs.map(p => p._2.toString)

13© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

14© Cloudera, Inc. All rights reserved.
Lemmatization
“the boy’s cars are different colors”
“the boy car be different color”

15© Cloudera, Inc. All rights reserved.
CoreNLP
def createNLPPipeline(): StanfordCoreNLP = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
new StanfordCoreNLP(props)
}

16© Cloudera, Inc. All rights reserved.
Stop Words
“the boy car be different color”
“boy car different color”

17© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

18© Cloudera, Inc. All rights reserved.
Tail Monkey Algorithm Scala
Document 1 1.5 1.8
Document 2 2.0 4.3
Document 3 1.4 6.7
Document 4 1.6
Document 5 1.2
Term-Document Matrix

19© Cloudera, Inc. All rights reserved.
tf-idf
• (Term Frequency) * (Inverse Document Frequency)
• tf(document, word) = # times word appears in document
• idf(word) = 1 / (# documents that contain word)

20© Cloudera, Inc. All rights reserved.
val rowVectors: RDD[Vector] = ...

21© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

22© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V
• m = # documents
• n = # terms
• U is m x n
• S is n x n
• V is n x n

23© Cloudera, Inc. All rights reserved.
Low Rank Approximation
• Account for synonymy by condensing related terms.
• Account for polysemy by placing less weight on terms that have multiple meanings.
• Throw out noise.
SVD can find the rank-k approximation that has the lowest Frobenius distance from the
original matrix.

24© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V
• m = # documents
• n = # terms
• k = # concepts
• U is m x n
• S is k x k
• V is k x n

25© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

26© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

27© Cloudera, Inc. All rights reserved.
rowVectors.cache()
val mat = new RowMatrix(rowVectors)
val k = 1000
val svd = mat.computeSVD(k, computeU=true)

28© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

29© Cloudera, Inc. All rights reserved.
What are the top “concepts”?
I.e. what dimensions in term-space and document-space explain most of the variance of
the data?

30© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

31© Cloudera, Inc. All rights reserved.
U S V

32© Cloudera, Inc. All rights reserved.
U S V

$33© Cloudera, Inc. All rights reserved. def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms) }$

$34© Cloudera, Inc. All rights reserved. def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs) }$

35© Cloudera, Inc. All rights reserved.
Concept 1
Terms: department, commune, communes, insee, france, see, also,
southwestern, oise, marne, moselle, manche, eure, aisne, isère
Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle,
Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle,
Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin,
Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne,
Saint-Astier, Lot-et-Garonne

36© Cloudera, Inc. All rights reserved.
Concept 2
Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum,
snail, database, natural, find, geometridae, reference, museum, noctuidae
Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini,
Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus),
Alpina (moth), Arycanda (moth)

37© Cloudera, Inc. All rights reserved.
Querying
• Given a set of terms, find the closest documents in the latent space

38© Cloudera, Inc. All rights reserved.
Reconstructed Matrix
(U * S * V)
Doc
Term

$39© Cloudera, Inc. All rights reserved. def topTermsForTerm( normalizedVS : BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10) } val VS = multiplyByDiagonalMatrix(svd.V, svd.s) val normalizedVS = rowsNormalized(VS) topTermsForTerm(normalizedVS, id, termIds)$

40© Cloudera, Inc. All rights reserved.
printRelevantTerms("radiohead")
radiohead 0.9999999999999993
lyrically 0.8837403315233519
catchy 0.8780717902060333
riff 0.861326571452104
lyricsthe 0.8460798060853993
lyric 0.8434937575368959
upbeat 0.8410212279939793
Term Similarity

41© Cloudera, Inc. All rights reserved.
printRelevantTerms("algorithm")
algorithm 1.000000000000002
heuristic 0.8773199836391916
compute 0.8561015487853708
constraint 0.8370707630657652
optimization 0.8331940333186296
complexity 0.823738607119692
algorithmic 0.8227315888559854
Term Similarity

42© Cloudera, Inc. All rights reserved.
(algorithm,1.000000000000002), (heuristic,0.8773199836391916),
(compute,0.8561015487853708), (constraint,0.8370707630657652),
(optimization,0.8331940333186296), (complexity,0.823738607119692),
(algorithmic,0.8227315888559854), (iterative,0.822364922633442),
(recursive,0.8176921180556759), (minimization,0.8160188481409465)

$43© Cloudera, Inc. All rights reserved. def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10) }$

44© Cloudera, Inc. All rights reserved.
printRelevantDocs("fir")
Silver tree 0.006292909647173194
See the forest for the trees 0.004785047583508223
Eucalyptus tree 0.004592837783089319
Sequoia tree 0.004497446632469554
Willow tree 0.004429936059594164
Coniferous tree 0.004381572286629475
Tulip Tree 0.004374705020233878
Document Similarity

45© Cloudera, Inc. All rights reserved.
• https://github.com/sryza/aas/tree/master/ch06-lsa
• https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
•
More detail?

46© Cloudera, Inc. All rights reserved.
Thank you
@sandysifting

Contenu connexe

Tendances

ML-Ops how to bring your data science to production

ML-Ops how to bring your data science to production

ML-Ops how to bring your data science to production

Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the event streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and also used to be called Complex Event Processing (CEP). In the last 3 years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Apache Samza as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Event and Stream Processing and present what differences you might find between the more traditional CEP and the more modern Stream Processing solutions and show that a combination of both will bring the most value.

Introduction to Streaming Analytics

Introduction to Streaming Analytics

Introduction to Streaming Analytics

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. This session presents the main concepts of Kafka and Storm and then shows how a simple stream processing application is implemented using these two technologies.

Kafka and Storm - event processing in realtime

Kafka and Storm - event processing in realtime

Kafka and Storm - event processing in realtime

eBay Architecture

eBay Architecture

eBay Architecture

Search is everywhere, and therefore so is Apache Lucene. While providing amazing out-of-the-box defaults, there’s enough projects weird enough to require custom search scoring and ranking. In this talk, I’ll walk through how to use Lucene to implement your custom scoring and search ranking. We’ll see how you can achieve both amazing power (and responsibility) over your search results. We’ll see the flexibility of Lucene’s data structures and explore the pros/cons of custom Lucene scoring vs other methods of improving search relevancy.

Hacking Lucene for Custom Search Results

Hacking Lucene for Custom Search Results

Hacking Lucene for Custom Search Results

OpenSource Connections

The World of StackOverflow

The World of StackOverflow

The World of StackOverflow

Deep Dive into Azure Data Factory v2

Deep Dive into Azure Data Factory v2

Deep Dive into Azure Data Factory v2

Real-time Data Processing Using AWS Lambda

Real-time Data Processing Using AWS Lambda

Real-time Data Processing Using AWS Lambda

Amazon Web Services

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf

Stavros Kontopoulos

Mass Migrations to AWS

Mass Migrations to AWS

Mass Migrations to AWS

Amazon Web Services

Dapr.io is an open source product, originated from Microsoft and embraced by a broad coalition of cloud suppliers (part of CNFC) and open source projects. Dapr is a runtime framework that can support any application and that especially shines with distributed applications - for example microservices - that run in containers, spread over clouds and / or edge devices. With Dapr you give an application a "sidecar" - a kind of personal assistant that takes care of all kinds of common responsibilities. Capturing and retrieving state, publishing and consuming messages or events. Reading secrets and configuration data. Shielding and load balancing over service endpoints. Calling and subscribing to all kinds of SaaS and PaaS facilities. Logging traces across all kinds of application components and logically routing calls between microservices and other application components. Dapr provides generic APIs to the application (HTTP and gRPC) for calling all these generic services – and provides implementations of these APIs for all public clouds and dozens of technology components. This means that your application can easily make use of a wide range of relevant features - with a strict separation between the language the application uses for this (generic, simple) and the configuration of the specific technology (e.g. Redis, MySQL, CosmosDB, Cassandra, PostgreSQL, Oracle Database, MongoDB, Azure SQL etc) that the Dapr sidecar uses. Changing technology does not affect the application, but affects the configuration of the Sidecar. Dapr can be used from applications in any technology - from Java and C#/.NET to Go, Python, Node, Rust and PHP. Or whatever can talk HTTP (or gRPC). In this Code Café I will introduce you to Dapr.io. I will show you what Dapr can do for you (application) and how you can Dapr-izen an application. I'll show you how an asynchronously collaborative system of microservices - implemented in different technologies - can be easily connected to Dapr, first to Redis as a Pub/Sub mechanism and then also to Apache Kafka without modifications. Then we do - with the interested parties - also a hands-on in which you will apply Dapr yourself . In a short time you get a good feel for how you can use Dapr for different aspects of your applications. And if nothing else, Dapr is a very easy way to get your code with Kafka, S3, Redis, Azure EventGrid, HashiCorp Consul, Twillio, Pulsar, RabbitMQ, HashiCorp Vault, AWS Secret Manager, Azure KeyVault, Cron, SMTP, Twitter, AWS SQS & SNS, GCP Pub/Sub and dozens of other technology components talk.

Introducing Dapr.io - the open source personal assistant to microservices and...

Introducing Dapr.io - the open source personal assistant to microservices and...

Introducing Dapr.io - the open source personal assistant to microservices and...

Apache Kafka in the Transportation and Logistics

Apache Kafka in the Transportation and Logistics

Apache Kafka in the Transportation and Logistics

Seldon: Deploying Models at Scale

Seldon: Deploying Models at Scale

Seldon: Deploying Models at Scale

Neo4j y GenAI

We will present the design and evolution of Nvidia's 100% Self-Service Streaming Big-Data Platform (ETL, Analytics, AI Training & Inferencing) powered by Spark and Nvidia GPUs. We will discuss the architecture, major challenges that we faced, and lessons learned along the way. Nvidia's data platform processes 10's of billions of events per day, supporting several Nvidia products like GPU Cloud, GeForce NOW Cloud Gaming, AI Smart Cities, DriveSim for Self Driving cars etc. In this talk, we are going to deep dive on Nvidia's next generation data platform with new custom built frameworks, automation tools, and a monitoring system on top of Spark. Thus empowering our developers to build new Spark-powered applications at the speed of light (SOL) with full self-service unified data flows. We will showcase these new tools : a) Zero-engineering dashboards, b) Out-of-the box Spark Streaming applications with automated schema management, c) Custom Spark Streaming to Elastic search connector with enhanced security, d) GDPR compliant SQL access control and auditing with a new custom token management framework, e) Migration from logstash clusters to Spark Streaming for log parsing, etc. We will discuss how decoupling Data-Platform and Applications helped us achieve the next level of scale, self-service, and, security. Finally, we will demo our Platform's App-Store, where developers can shop for new Apps and deploy them with ease - with automated dashboards, streaming ETL, analytics, monitoring, AI training and inferencing. Extended Description: With structured telemetry events and unstructured logs growing at 1000% rate year-over-year, it is extremely important to handle this scale with strict SLAs and high reliability while maintaining extremely low latency. We will discuss how we handled these scaling & security concerns to solve business requirements. Additionally, we will be open-sourcing some of our custom spark frameworks during the talk. Speakers: Satish Dandu, Rohit Kulkarni

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka. No matter if you think about open source Apache Kafka, a cloud service like Confluent Cloud, or another technology using the Kafka protocol like Redpanda or Pulsar, check out this slide deck. A detailed article about this topic: https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/

When NOT to use Apache Kafka?

When NOT to use Apache Kafka?

When NOT to use Apache Kafka?

The Rise of Data in Motion in the Healthcare Industry - Use Cases, Architectures and Examples powered by Apache Kafka. Use Cases for Data in Motion in the Healthcare Industry: - Know Your Patient (= “Customer 360”) - Operations (Healthcare 4.0 including Drug R&D, Patient Care, etc.) - IT Perspective (Cybersecurity, Mainframe Offload, Hybrid Cloud, Streaming ETL, etc) Real-world examples include Covid-19 Electronic Lab Reporting, Cerner, Optum, Centene, Humana, Invitae, Bayer, Celmatix, Care.com.

Apache Kafka in the Healthcare Industry

Apache Kafka in the Healthcare Industry

Apache Kafka in the Healthcare Industry

In this workshop we covered an introduction to Generative AI and Large Language Models (LLMs), an explanation of AWS Foundation Models and their role in providing pre-trained LLMs, the benefits of leveraging LLMs in enterprises, deploying LLMs on AWS Infrastructure including infrastructure requirements and available AWS services and tools, and a demo showcasing Text-to-Image and Text Summarization using Foundation Models, as well as utilising Retrieval Augmented Generation and LangChain with AWS tools for Enterprise use cases. Connect with me for interesting session in future @https://www.linkedin.com/in/jayyanar/

AWS_Meetup_BLR_July_22_Social.pdf

AWS_Meetup_BLR_July_22_Social.pdf

AWS_Meetup_BLR_July_22_Social.pdf

Ayyanar Jeyakrishnan

Introduce AWS Lambda for newbie and Non-IT อธิบาย ความเป็นมาของ Serverless และ AWS Lambda คืออะไร ดีอย่างไร เพื่อให้คนไม่รู้จักและคนที่ไม่ใช่ IT ได้เข้าใจง่ายๆ Index - What's Serverless - What's AWS Lambda - Working with AWS Lambda - AWS Lambda Life-Cycle - AWS Lambda Anatomy - Beware Cold Start - How to debug - Do and Don't to implement - Pricing structure and example - Advantage/Disadvantage Presentation is English Version Blog is Thai Version : https://myifew.com/5166/understand-serverless-with-aws-lambda-for-newbie/

Introduce AWS Lambda for newbie and Non-IT

Introduce AWS Lambda for newbie and Non-IT

Introduce AWS Lambda for newbie and Non-IT

Chitpong Wuttanan

Azure App Service Deep Dive

Azure App Service Deep Dive

Azure App Service Deep Dive

Azure Riyadh User Group

Tendances (20)

ML-Ops how to bring your data science to production

ML-Ops how to bring your data science to production

ML-Ops how to bring your data science to production

Introduction to Streaming Analytics

Introduction to Streaming Analytics

Introduction to Streaming Analytics

Kafka and Storm - event processing in realtime

Kafka and Storm - event processing in realtime

Kafka and Storm - event processing in realtime

eBay Architecture

eBay Architecture

eBay Architecture

Hacking Lucene for Custom Search Results

Hacking Lucene for Custom Search Results

Hacking Lucene for Custom Search Results

The World of StackOverflow

The World of StackOverflow

The World of StackOverflow

Deep Dive into Azure Data Factory v2

Deep Dive into Azure Data Factory v2

Deep Dive into Azure Data Factory v2

Real-time Data Processing Using AWS Lambda

Real-time Data Processing Using AWS Lambda

Real-time Data Processing Using AWS Lambda

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf

Mass Migrations to AWS

Mass Migrations to AWS

Mass Migrations to AWS

Introducing Dapr.io - the open source personal assistant to microservices and...

Introducing Dapr.io - the open source personal assistant to microservices and...

Introducing Dapr.io - the open source personal assistant to microservices and...

Apache Kafka in the Transportation and Logistics

Apache Kafka in the Transportation and Logistics

Apache Kafka in the Transportation and Logistics

Seldon: Deploying Models at Scale

Seldon: Deploying Models at Scale

Seldon: Deploying Models at Scale

Neo4j y GenAI

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

When NOT to use Apache Kafka?

When NOT to use Apache Kafka?

When NOT to use Apache Kafka?

Apache Kafka in the Healthcare Industry

Apache Kafka in the Healthcare Industry

Apache Kafka in the Healthcare Industry

AWS_Meetup_BLR_July_22_Social.pdf

AWS_Meetup_BLR_July_22_Social.pdf

AWS_Meetup_BLR_July_22_Social.pdf

Introduce AWS Lambda for newbie and Non-IT

Introduce AWS Lambda for newbie and Non-IT

Introduce AWS Lambda for newbie and Non-IT

Azure App Service Deep Dive

Azure App Service Deep Dive

Azure App Service Deep Dive

En vedette

Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD). Bio: Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache Spark

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Latent Semantic Indexing and Analysis

Latent Semantic Indexing and Analysis

Latent Semantic Indexing and Analysis

Mercy Livingstone

Big data app meetup 2016-06-15

Big data app meetup 2016-06-15

Big data app meetup 2016-06-15

Illia Polosukhin

Driving Healthcare Operations with Data Science

Driving Healthcare Operations with Data Science

Driving Healthcare Operations with Data Science

Latent Semantic Analysis (LSA) is an advanced form of statistical language modeling that cuts through the noise to expose contextual meaning, and is a key feature of Oracle Social Relationship Management platform. Come and learn how LSA delivers more accurate,contextualized, precise and relevant insights. Oracle SRM listening capabilities go well beyond keyword and Boolean searches to reveal insights like consumer intent, political intent, product likes/dislikes, and customer service issues.

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

Social Media Camp

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

Introduction to Probabilistic Latent Semantic Analysis

Introduction to Probabilistic Latent Semantic Analysis

Introduction to Probabilistic Latent Semantic Analysis

NYC Predictive Analytics

Algorithm ethics

Algorithm ethics

Algorithm ethics

Joerg Blumtritt

Empreendedorismo Agil

Empreendedorismo Agil

Empreendedorismo Agil

Ili structuredauthoring

Ili structuredauthoring

Ili structuredauthoring

Wanna know more about machine learning and predictive analytics? We’re thrilled to welcome Rob Craft - Group Product Manager at Google Cloud Platform during lunch break! Coming from San-Francisco, Rob Craft is the lead Product Manager for the Cloud Intelligence team in Google Cloud Platform. He is responsible for Cloud Machine Learning, Cloud Search, Internet of Things, and Cloud Pub/Sub. Rob will discuss how you can leverage the power of ML whether you have a machine learning team of your own or if you just want to use ML as a service. Google has invested deeply in machine learning for many years and is using it across its highly successful consumer businesses. Google Cloud Platform is using the power of that work, adding in powerful data management tools, support for collaborative experiments, and predictions at Google scale. Rob has spent many years focused on building great products on the leading edge of what's next. This time spans early days working on OS kernels and .Net, High Performance Computing, and transforming the cloud with machine learning.

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight Presented at SemTech Berlin 2012 Wikipedia is one of the most important repositories of human knowledge, containing millions of interlinked articles. The DBpedia project extracts and combines Wikipedia information into a large multilingual knowledge base that enables semantic processing in a wide range of applications. We have built DBpedia Spotlight, a tool that recognizes ambiguous terms in text and automatically assigns unambiguous definitions to those terms by connecting them to DBpedia. Such interconnection enriches information by providing explicit semantic relationships, enabling semantic indexing, faceted exploration, among other data processing enhancements. In this talk we will describe how DBpedia Spotlight can be applied to establish a virtuous cycle of semantic enhancement. On the one hand, it can enhance knowledge interconnectivity in document collections. On the other hand, it learns how to better annotate from user feedback. Such a positive feedback loop can be applied on Wikipedia itself, or in enterprises to alleviate the cold start problem and knowledge management costs.

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

DBpedia Spotlight is a tool employed in the Extraction stage of the LOD Lyfe Cycle, performing Entity Recognition and Linking. Although the tool currently specializes in English language, the support for other languages is currently being tested, and demos for German, Dutch and others are available or underway. The tool can be used to enable faceted browsing, semantic search, among other applications. In this webinar we will describe what is DBpedia Spotlight, how it works and how can you benefit from it in your application. If you are interested in Linked (Open) Data principles and mechanisms, LOD tools & services and concrete use cases that can be realised using LOD then join us in the free LOD2 webinar series! http://lod2.eu/BlogPost/webinar-series

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Creating Knowledge out of Interlinked Data

DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided configuration parameters. These parameters can include score thresholds, entity types, and even arbitrary "type" definitions through SPARQL queries. This is the presentation at the best paper award session at I-SEMANTICS 2011.

DBpedia Spotlight at I-SEMANTICS 2011

DBpedia Spotlight at I-SEMANTICS 2011

DBpedia Spotlight at I-SEMANTICS 2011

Google Cloud Machine Learning

Google Cloud Machine Learning

Google Cloud Machine Learning

Clustering – K-means – FCM (fuzzyc-means) – SMC (simple membership based clustering) – AP(affinity propagation) – FLAB(fuzzy locally adaptive Bayesian) – Spectral Clustering Methods ShapeModeling – M-reps – Active Shape Models (ASM) – Oriented Active Shape Models (OASM) – Application in anatomy recognition and segmentation – Comparison of ASM and OASM ActiveContour(Snake) • LevelSet • Applications Enhancement, Noise Reduction, and Signal Processing • MedicalImageRegistration • MedicalImageSegmentation • MedicalImageVisualization • Machine Learning in Medical Imaging • Shape Modeling/Analysis of Medical Images Deep Learning in Radiology Fuzzy Connectivity (FC) – Affinity functions • Absolute FC • Relative FC (and Iterative Relative FC) • Successful example applications of FC in medical imaging • Segmentation of Airway and Airway Walls using RFC based method Energy functional – Data and Smoothness terms • GraphCut – Min cut – Max Flow • ApplicationsinRadiologyImages

Lec13: Clustering Based Medical Image Segmentation Methods

Lec13: Clustering Based Medical Image Segmentation Methods

Lec13: Clustering Based Medical Image Segmentation Methods

Presented at SISAP 2016 (http://sisap.org/2016/index.html) Paper: https://goo.gl/xAcyTq Abstract: We propose a novel knowledge-based technique for inter-document similarity, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a vector able to represent the context of a document. We show how such a Semantic Context Vector can be effectively exploited to compute inter-document similarity. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones of specialized methods.

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Fabio Benedetti

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

GoogLeNet Insights

GoogLeNet Insights

GoogLeNet Insights

En vedette (20)

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache Spark

Unsupervised Learning with Apache Spark

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Latent Semantic Indexing and Analysis

Latent Semantic Indexing and Analysis

Latent Semantic Indexing and Analysis

Big data app meetup 2016-06-15

Big data app meetup 2016-06-15

Big data app meetup 2016-06-15

Driving Healthcare Operations with Data Science

Driving Healthcare Operations with Data Science

Driving Healthcare Operations with Data Science

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

Introduction to Probabilistic Latent Semantic Analysis

Introduction to Probabilistic Latent Semantic Analysis

Introduction to Probabilistic Latent Semantic Analysis

Algorithm ethics

Algorithm ethics

Algorithm ethics

Empreendedorismo Agil

Empreendedorismo Agil

Empreendedorismo Agil

Ili structuredauthoring

Ili structuredauthoring

Ili structuredauthoring

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Webinar Series: DBpedia Spotlight

LOD2 Webinar Series: DBpedia Spotlight

DBpedia Spotlight at I-SEMANTICS 2011

DBpedia Spotlight at I-SEMANTICS 2011

DBpedia Spotlight at I-SEMANTICS 2011

Google Cloud Machine Learning

Google Cloud Machine Learning

Google Cloud Machine Learning

Lec13: Clustering Based Medical Image Segmentation Methods

Lec13: Clustering Based Medical Image Segmentation Methods

Lec13: Clustering Based Medical Image Segmentation Methods

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

GoogLeNet Insights

GoogLeNet Insights

GoogLeNet Insights

Similaire à Latent Semantic Analysis of Wikipedia with Spark

Spark etl

Spark 2.0 includes many exciting new features including Structured Streaming, and the unification of Datasets (new in 1.6) with DataFrames. Structured Streaming allows one to define recurrent queries on a stream of data that is handled as an infinite DataFrame. This query is incrementally updated with new data. This allows for code reuse between batch and streaming and an easier logical model to reason about. Datasets, an extension of DataFrames, were added as an experimental feature in Spark 1.6. They allow us to manipulate collections of objects in a type-safe fashion. In Spark 2.0 the two abstractions have been unified and now DataFrame = Dataset[Row]. We will discuss both of these new features and look at practical real world examples.

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

"Petascale Genomics with Spark", Sean Owen, Director of Data Science at Cloudera YouTube Link: https://www.youtube.com/watch?v=HY93FdK5i60 Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J Visit the conference website to learn more: www.datanatives.io Follow Data Natives: https://www.facebook.com/DataNatives https://twitter.com/DataNativesConf Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS About the Author: Sean is Director of Data Science at Cloudera, based in London. Before Cloudera, he founded Myrrix Ltd, a company commercializing large-scale real-time recommender systems on Apache Hadoop. He has been a primary committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google. He holds and MBA from the London Business School and a BA in Computer Science from Harvard.

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

Dataconomy Media

Overview of running R in the Oracle Database

Overview of running R in the Oracle Database

Overview of running R in the Oracle Database

Brendan Tierney

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Bids talk 9.18

Travis Oliphant

La infraestructura como código (IaC) es una de las prácticas relacionadas con la cultura DevOps que está cogiendo más tracción en el desarrollo de software y Terraform es una de las herramientas más recomendadas para ello. Se suele relacionar sobre todo con la creación de infraestructura en los grandes servicios “Cloud” -AWS, Azure, Google Cloud,…- pero es además algo aplicable a otros aspectos de IT como podrían ser la creación de usuarios en servicios de terceros o propios (Github, bases de datos,…), configuración de dominios (Dyn, GoDaddy,…), configuración de alertas (Grafana, OpsGenie)… Durante esta sesión se explicará su funcionamiento básico y veremos en directo despliegues en varias de estas plataformas.

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Python Data Ecosystem: Thoughts on Building for the Future

Python Data Ecosystem: Thoughts on Building for the Future

Python Data Ecosystem: Thoughts on Building for the Future

Scala Days, Amsterdam, 2015: Lambda Architecture - Batch and Streaming with Spark, Cassandra, Kafka, Akka and Scala; Fault Tolerance, Data Pipelines, Data Flows, Data Locality, Akka Actors, Spark, Spark Cassandra Connector, Big Data, Asynchronous data flows. Time series data, KillrWeather, Scalable Infrastructure, Partition For Scale, Replicate For Resiliency, Parallelism Isolation, Data Locality, Location Transparency

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Spark zeppelin-cassandra at synchrotron

Spark zeppelin-cassandra at synchrotron

Spark zeppelin-cassandra at synchrotron

Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster. Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in. In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.

Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark Streaming

Hari Shreedharan

Speaker: Hari Shreedharan Data Day Texas 2015 Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster. Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in. In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Lucene Revolution 2016, Boston: Talk by Josef Adersberger (@adersberger, CTO at QAware). Abstract: A lot of data is best represented as time series: Operational data, financial data, and even in data warehouses the dominant dimension is often time. We present Chronix, a time series database based on Apache Solr and Spark which is able to handle trillions of time series data points and perform interactive queries. Chronix Spark is open source software and battle-proven at a German car manufacturer and an international telco. We demonstrate several use cases of Chronix from real-life. Afterwards we lift the curtain and deep-dive into the Chronix architecture esp. how we're using Solr to store time series data and how we've hooked up Solr with Spark. We provide some benchmarks showing how Chronix has outperformed other time series databases in both performance and storage-efficiency. Chronix is open source under the Apache License (http://chronix.io).

Time Series Analysis

Time Series Analysis

Time Series Analysis

A lot of data is best represented as time series: Operational data, financial data, and even in data warehouses the dominant dimension is often time. We present Chronix, a time series database based on Apache Solr and Spark which is able to handle trillions of time series data points and perform interactive queries. Chronix Spark is open source software and battle-proven at a German car manufacturer and an international telco. We demonstrate several use cases of Chronix from real-life. Afterwards we lift the curtain and deep-dive into the Chronix architecture esp. how we're using Solr to store time series data and how we've hooked up Solr with Spark. We provide some benchmarks showing how Chronix has outperformed other time series databases in both performance and storage-efficiency. Chronix is open source under the Apache License (http://chronix.io).

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark

Josef Adersberger

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure. Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc. To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5

Why your Spark Job is Failing

Why your Spark Job is Failing

Why your Spark Job is Failing

DataWorks Summit

Why Your Apache Spark Job is Failing

Why Your Apache Spark Job is Failing

Why Your Apache Spark Job is Failing

Similaire à Latent Semantic Analysis of Wikipedia with Spark (20)

Spark etl

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

Overview of running R in the Oracle Database

Overview of running R in the Oracle Database

Overview of running R in the Oracle Database

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Bids talk 9.18

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Aprovisionamiento multi-proveedor con Terraform - Plain Concepts DevOps day

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Cassandra + Spark (You’ve got the lighter, let’s start a fire)

Python Data Ecosystem: Thoughts on Building for the Future

Python Data Ecosystem: Thoughts on Building for the Future

Python Data Ecosystem: Thoughts on Building for the Future

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Spark zeppelin-cassandra at synchrotron

Spark zeppelin-cassandra at synchrotron

Spark zeppelin-cassandra at synchrotron

Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark Streaming

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Time Series Analysis

Time Series Analysis

Time Series Analysis

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5

Why your Spark Job is Failing

Why your Spark Job is Failing

Why your Spark Job is Failing

Why Your Apache Spark Job is Failing

Why Your Apache Spark Job is Failing

Why Your Apache Spark Job is Failing

Dernier

Heather Hedden, Senior Consultant at Enterprise Knowledge, presented “The Role of Taxonomy and Ontology in Semantic Layers” at a webinar hosted by Progress Semaphore on April 16, 2024. Taxonomies at their core enable effective tagging and retrieval of content, and combined with ontologies they extend to the management and understanding of related data. There are even greater benefits of taxonomies and ontologies to enhance your enterprise information architecture when applying them to a semantic layer. A survey by DBP-Institute found that enterprises using a semantic layer see their business outcomes improve by four times, while reducing their data and analytics costs. Extending taxonomies to a semantic layer can be a game-changing solution, allowing you to connect information silos, alleviate knowledge gaps, and derive new insights. Hedden, who specializes in taxonomy design and implementation, presented how the value of taxonomies shouldn’t reside in silos but be integrated with ontologies into a semantic layer. Learn about: - The essence and purpose of taxonomies and ontologies in information and knowledge management; - Advantages of semantic layers leveraging organizational taxonomies; and - Components and approaches to creating a semantic layer, including the integration of taxonomies and ontologies

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Enterprise Knowledge

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Artificial Intelligence: Facts and Myths

Artificial Intelligence: Facts and Myths

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Scaling API-first – The story of a global engineering organization

Scaling API-first – The story of a global engineering organization

Scaling API-first – The story of a global engineering organization

BooK Now Call us at +918448380779 to hire a gorgeous and seductive call girl for sex. Take a Delhi Escort Service. The help of our escort agency is mostly meant for men who want sexual Indian Escorts In Delhi NCR. It should be noted that any impersonator will get 100 attention from our Young Girls Escorts in Delhi. They will assume the position of reliable allies. VIP Call Girl With Original Photos Book Tonight +918448380779 Our Cheap Price 1 Hour not available 2 Hours 5000 Full Night 8000 TAG: Call Girls in Delhi, Noida, Gurgaon, Ghaziabad, Connaught Place, Greater Kailash Delhi, Lajpat Nagar Delhi, Mayur Vihar Delhi, Chanakyapuri Delhi, New Friends Colony Delhi, Majnu Ka Tilla, Karol Bagh, Malviya Nagar, Saket, Khan Market, Noida Sector 18, Noida Sector 76, Noida Sector 51, Gurgaon Mg Road, Iffco Chowk Gurgaon, Rajiv Chowk Gurgaon All Delhi Ncr Free Home Deliver

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Delhi Call girls

A Call to Action for Generative AI in 2024

A Call to Action for Generative AI in 2024

A Call to Action for Generative AI in 2024

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

GenCyber Cyber Security Day Presentation

GenCyber Cyber Security Day Presentation

GenCyber Cyber Security Day Presentation

Michael W. Hawkins

Finology Group – Insurtech Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Advantages of Hiring UIUX Design Service Providers for Your Business

Advantages of Hiring UIUX Design Service Providers for Your Business

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

In an era where artificial intelligence (AI) stands at the forefront of business innovation, Information Architecture (IA) is at the core of functionality. See “There’s No AI Without IA” – (from 2016 but even more relevant today) Understanding and leveraging how Information Architecture (IA) supports AI synergies between knowledge engineering and prompt engineering is critical for senior leaders looking to successfully deploy AI for internal and externally facing knowledge processes. This webinar be a high-level overview of the methodologies that can elevate AI-driven knowledge processes supporting both employees and customers. Core Insights Include: Strategic Knowledge Engineering: Delve into how structuring AI's knowledge base is required to prevent hallucinations, enable contextual retrieval of accurate information. This will include discussion of gold standard libraries of use cases support testing various LLMs and structures and configurations of knowledge base. Precision in Prompt Engineering: Learn the art of crafting prompts that direct AI to deliver targeted, relevant responses, thereby optimizing customer experiences and business outcomes. Unified Approach for Enhanced AI Performance: Explore the intersection of knowledge and prompt engineering to develop AI systems that are not only more responsive but also aligned with overarching business strategies. Guiding Principles for Implementation: Equip yourself with best practices, ethical guidelines, and strategic considerations for embedding these technologies into your business ecosystem effectively. This webinar is designed to empower business and technology leaders with the knowledge to harness the full potential of AI, ensuring their organizations not only keep pace with digital transformation but lead the charge. Join us to map a roadmap to fully leverage Information Architecture (IA) and AI chart a course towards a future where AI is a key pillar of strategic innovation and business success.

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Earley Information Science

2024: Domino Containers - The Next Step. News from the Domino Container commu...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Explore 'The Codex of Business: Writing Software for Real-World Solutions,' a compelling SlideShare presentation that delves into digital transformation in healthcare. Discover through a detailed case study how Agile methodologies empower healthcare providers to develop, iterate, and refine digital solutions that address real-world challenges. Learn how strategic planning, user feedback, and continuous improvement drive success in deploying technologies that enhance patient care and operational efficiency. Ideal for healthcare professionals, IT specialists, and digital transformation advocates seeking actionable insights and practical examples of technology making a real difference.

The Codex of Business Writing Software for Real-World Solutions 2.pptx

The Codex of Business Writing Software for Real-World Solutions 2.pptx

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Malak Abu Hammad

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

The Raspberry Pi 5 was announced on October 2023. This new version of the popular embedded device comes with a new iteration of Broadcom’s VideoCore GPU platform, and was released with a fully open source driver stack, developed by Igalia. The presentation will discuss some of the major changes required to support this new Video Core iteration, the challenges we faced in the process and the solutions we provided in order to deliver conformant OpenGL ES and Vulkan drivers. The talk will also cover the next steps for the open source Raspberry Pi 5 graphics stack. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://eoss24.sched.com/event/1aBEx

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Dernier (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Exploring the Future Potential of AI-Enabled Smartphone Processors

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

Artificial Intelligence: Facts and Myths

Artificial Intelligence: Facts and Myths

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

Boost PC performance: How more available memory can improve productivity

Scaling API-first – The story of a global engineering organization

Scaling API-first – The story of a global engineering organization

Scaling API-first – The story of a global engineering organization

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

A Call to Action for Generative AI in 2024

A Call to Action for Generative AI in 2024

A Call to Action for Generative AI in 2024

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

GenCyber Cyber Security Day Presentation

GenCyber Cyber Security Day Presentation

GenCyber Cyber Security Day Presentation

Finology Group – Insurtech Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Advantages of Hiring UIUX Design Service Providers for Your Business

Advantages of Hiring UIUX Design Service Providers for Your Business

Advantages of Hiring UIUX Design Service Providers for Your Business

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

2024: Domino Containers - The Next Step. News from the Domino Container commu...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

The Codex of Business Writing Software for Real-World Solutions 2.pptx

The Codex of Business Writing Software for Real-World Solutions 2.pptx

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Latent Semantic Analysis of Wikipedia with Spark

1. 1© Cloudera, Inc. All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist

2. 2© Cloudera, Inc. All rights reserved. Me • Data scientist at Cloudera • Recently lead Cloudera’s Apache Spark development • Author of Advanced Analytics with Spark

3. 3© Cloudera, Inc. All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist

4. 4© Cloudera, Inc. All rights reserved. Latent Semantic Analysis • Fancy name for applying a matrix decomposition (SVD) to text data

5. 5© Cloudera, Inc. All rights reserved.

6. 6© Cloudera, Inc. All rights reserved.

7. 7© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

8. 8© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

9. 9© Cloudera, Inc. All rights reserved. Wikipedia Content Data Set • http://dumps.wikimedia.org/enwiki/latest/ • XML-formatted • 46 GB uncompressed

10. 10© Cloudera, Inc. All rights reserved. <page> <title>Anarchism</title> <ns>0</ns> <id>12</id> <revision> <id>584215651</id> <parentid>584213644</parentid> <timestamp>2013-12-02T15:14:01Z</timestamp> <contributor> <username>AnomieBOT</username> <id>7611264</id> </contributor> <comment>Rescuing orphaned refs ("autogenerated1" from rev 584155010; "bbc" from rev 584155010)</comment> <text xml:space="preserve">{{Redirect|Anarchist|the fictional character| Anarchist (comics)}} {{Redirect|Anarchists}} {{pp-move-indef}} {{Anarchism sidebar}} '''Anarchism''' is a [[political philosophy]] that advocates [[stateless society| stateless societies]] often defined as [[self-governance|self-governed]] voluntary institutions,<ref>"ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies." George Woodcock. "Anarchism" at The Encyclopedia of Philosophy</ref><ref> "In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute ...

11. 11© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

12. 12© Cloudera, Inc. All rights reserved. import org.apache.mahout.text.wikipedia.XmlInputFormat import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io._ val path = "hdfs:///user/ds/wikidump.xml" val conf = new Configuration() conf.set(XmlInputFormat.START_TAG_KEY, "<page>") conf.set(XmlInputFormat.END_TAG_KEY, "</page>") val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text], conf) val rawXmls = kvs.map(p => p._2.toString)

13. 13© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

14. 14© Cloudera, Inc. All rights reserved. Lemmatization “the boy’s cars are different colors” “the boy car be different color”

15. 15© Cloudera, Inc. All rights reserved. CoreNLP def createNLPPipeline(): StanfordCoreNLP = { val props = new Properties() props.put("annotators", "tokenize, ssplit, pos, lemma") new StanfordCoreNLP(props) }

16. 16© Cloudera, Inc. All rights reserved. Stop Words “the boy car be different color” “boy car different color”

17. 17© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

18. 18© Cloudera, Inc. All rights reserved. Tail Monkey Algorithm Scala Document 1 1.5 1.8 Document 2 2.0 4.3 Document 3 1.4 6.7 Document 4 1.6 Document 5 1.2 Term-Document Matrix

19. 19© Cloudera, Inc. All rights reserved. tf-idf • (Term Frequency) * (Inverse Document Frequency) • tf(document, word) = # times word appears in document • idf(word) = 1 / (# documents that contain word)

20. 20© Cloudera, Inc. All rights reserved. val rowVectors: RDD[Vector] = ...

21. 21© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

22. 22© Cloudera, Inc. All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • U is m x n • S is n x n • V is n x n

23. 23© Cloudera, Inc. All rights reserved. Low Rank Approximation • Account for synonymy by condensing related terms. • Account for polysemy by placing less weight on terms that have multiple meanings. • Throw out noise. SVD can find the rank-k approximation that has the lowest Frobenius distance from the original matrix.

24. 24© Cloudera, Inc. All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • k = # concepts • U is m x n • S is k x k • V is k x n

25. 25© Cloudera, Inc. All rights reserved. Docs: Terms:U S V

26. 26© Cloudera, Inc. All rights reserved. Docs: Terms:U S V

27. 27© Cloudera, Inc. All rights reserved. rowVectors.cache() val mat = new RowMatrix(rowVectors) val k = 1000 val svd = mat.computeSVD(k, computeU=true)

28. 28© Cloudera, Inc. All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results

29. 29© Cloudera, Inc. All rights reserved. What are the top “concepts”? I.e. what dimensions in term-space and document-space explain most of the variance of the data?

30. 30© Cloudera, Inc. All rights reserved. Docs: Terms:U S V

31. 31© Cloudera, Inc. All rights reserved. U S V

32. 32© Cloudera, Inc. All rights reserved. U S V

33. 33© Cloudera, Inc. All rights reserved. def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms) }

34. 34© Cloudera, Inc. All rights reserved. def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs) }

35. 35© Cloudera, Inc. All rights reserved. Concept 1 Terms: department, commune, communes, insee, france, see, also, southwestern, oise, marne, moselle, manche, eure, aisne, isère Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle, Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle, Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin, Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne, Saint-Astier, Lot-et-Garonne

36. 36© Cloudera, Inc. All rights reserved. Concept 2 Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum, snail, database, natural, find, geometridae, reference, museum, noctuidae Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini, Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus), Alpina (moth), Arycanda (moth)

37. 37© Cloudera, Inc. All rights reserved. Querying • Given a set of terms, find the closest documents in the latent space

38. 38© Cloudera, Inc. All rights reserved. Reconstructed Matrix (U * S * V) Doc Term

39. 39© Cloudera, Inc. All rights reserved. def topTermsForTerm( normalizedVS : BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10) } val VS = multiplyByDiagonalMatrix(svd.V, svd.s) val normalizedVS = rowsNormalized(VS) topTermsForTerm(normalizedVS, id, termIds)

40. 40© Cloudera, Inc. All rights reserved. printRelevantTerms("radiohead") radiohead 0.9999999999999993 lyrically 0.8837403315233519 catchy 0.8780717902060333 riff 0.861326571452104 lyricsthe 0.8460798060853993 lyric 0.8434937575368959 upbeat 0.8410212279939793 Term Similarity

41. 41© Cloudera, Inc. All rights reserved. printRelevantTerms("algorithm") algorithm 1.000000000000002 heuristic 0.8773199836391916 compute 0.8561015487853708 constraint 0.8370707630657652 optimization 0.8331940333186296 complexity 0.823738607119692 algorithmic 0.8227315888559854 Term Similarity

42. 42© Cloudera, Inc. All rights reserved. (algorithm,1.000000000000002), (heuristic,0.8773199836391916), (compute,0.8561015487853708), (constraint,0.8370707630657652), (optimization,0.8331940333186296), (complexity,0.823738607119692), (algorithmic,0.8227315888559854), (iterative,0.822364922633442), (recursive,0.8176921180556759), (minimization,0.8160188481409465)

43. 43© Cloudera, Inc. All rights reserved. def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10) }

44. 44© Cloudera, Inc. All rights reserved. printRelevantDocs("fir") Silver tree 0.006292909647173194 See the forest for the trees 0.004785047583508223 Eucalyptus tree 0.004592837783089319 Sequoia tree 0.004497446632469554 Willow tree 0.004429936059594164 Coniferous tree 0.004381572286629475 Tulip Tree 0.004374705020233878 Document Similarity

45. 45© Cloudera, Inc. All rights reserved. • https://github.com/sryza/aas/tree/master/ch06-lsa • https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html • More detail?

46. 46© Cloudera, Inc. All rights reserved. Thank you @sandysifting