SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
About
1.  CEO of DevRain Solutions – software development
(specialization: Windows Phone and Windows 8).
2.  Microsoft Regional Director.
3.  Microsoft Windows Phone Most Valuable Professional.
4.  Telerik Most Valuable Professional.
5.  Best Professional in Software Architecture (Ukrainian IT
Award).
6.  Ph.D.
7.  Speaker and IT blogger.
#1: A lot of information
1.  “No information”
problem is transformed
to the “a lot of
information” problem.
2.  Amount of information
increases every year in
geometric progression.
3.  Big data.
#2: Duplicates
1.  Different chrome not the
content.
2.  Copyrighting and
plagiarism.
3.  Partially solved for news.
#3: Information waste
1.  Level 1: noisy information such as
advertisement, copyright, decoration, etc.
2.  Level 2: useful information, but not very
relevant to the topic of the page, such as
navigation, directory, etc.
3.  Level 3: relevant information to the theme
of the page, but not with prominent
importance, such as related topics, topic
index, etc.
4.  Level 4: the most prominent part of the
page, such as headlines, main content,
etc.
#4: Searching time
Every second user is
watching 5-10 pages to find
needed information.
My record: 8 hours of
uninterrupted search. Found at
23th page on MSN.
#5: Domain
“Snow Leopard”
Can be “cat” or “operation
system” from Apple.
Solutions?
Data Mining – intellectual analysis of big amounts of data
•  clustering, associated rules, GA, Ant optimization, visualization,
decision trees, neural networks.
R&D – new algorithms, methods
•  Microsoft Research, Yahoo! Research, Google Labs, Arc90 Lab and
others.
Let’s mix!
#01: A lot of information
1.  Filtering not ranking
2.  Clustering and categorization
3.  Semantic web
#02: Duplicates. NLP
1.  Readability score
2.  NER
Dbpedia Spotlight,
Reuters OpenCalais
3.  WordNet
4.  Shingles
Shingles
#3: Information waste
Readability
An Arc90 Lab
Readability turns any web page
into a clean view for reading now or
later on your computer,
smartphone, or tablet.
https://www.readability.com
Vision-based Page Segmentation Algorithm
Presents an automatic top-down,
tag-tree independent approach to
detect web content structure. It
simulates how a user understands
web layout structure based on his
visual perception.
Based on DOM structure analysis
and subjective rules.
http://research.microsoft.com/apps/
pubs/default.aspx?id=70027
Vision-based Page Segmentation Algorithm
Different pages have different
visual margins so quality of
segmentation algorithm
depends on certain web page.
If comment is bigger than
main content (e.g. habrahabr)
the result will not be very
precise.
Learning Important Models
1.  Spatial Features
{BlockCenterX, BlockCenterY, BlockRectWidth,
BlockRectHeight}
2.  Content features
{FontSize, FontWeight, InnerTextLength,
InnerHtmlLength, ImgNum, ImgSize, LinkNum,
LinkTextLength, InteractionNum,
InteractionSize, FormNum, FormSize,
OptionNum, OptionTextLength, TableNum,
ParaNum}
http://www.sigkdd.org/sites/default/files/issues/
6-2-2004-12/2-song.pdf
Semantic and SEO
1.  Semantic tags (article,
aside, footer, header etc.)
2.  SEO (meta description,
keywords)
3.  Microformats (RSS,
hCalendar, hCardetc.)
4.  CMS, common engines and
social networks.
SeoRank
1.  Title 2 text.
2.  Meta keywords 2 text.
3.  Headers 2 text.
4.  Meta description 2 text.
5.  WordsIndex, SentencesIndex,
WordsInSentencesIndex,
LinksIndex, WordsAsLinksIndex,
ImgsIndex, ImgsAsLinksIndex etc.
Regression model
1.  Detect valuable properties.
2.  Select model type (linear).
3.  After regression analysis we
will get content important
model:
.305,0002,0267,0
594,0056,0008,0249,0324,0
171614
127653
xxx
xxxxxy
⋅+⋅+⋅−
−⋅−⋅+⋅−⋅−⋅=
SmartBrowser
Software for
determining the most
relevant content of
the HTML pages.
h"p://smartbrowser.codeplex.com/	
  	
  
Search optimal path
1.  Graph analysis (similar
pages, clustering and
categorization).
2.  Ant simulations (search
optimal path using complex
criterion).
http://touchgraph.com/TGGoogleBrowser.html
http://walk2web.com
Ant algorithm
The ant colony algorithm is an algorithm
for finding optimal paths that is based on
the behavior of ants searching for food.
Because the ant-colony works on a very
dynamic system, the ant colony algorithm
works very well in graphs with changing
topologies. Examples of such systems
include computer networks, and artificial
intelligence simulations of workers.
Search optimal path algorithm
1.  User makes a search.
2.  Clustering (removing not relevant
cluster pages).
3.  Main content determination and
duplicates removal.
4.  Graph structure optimization.
5.  Analyzing content importance and
completeness (sorting from most
important to less one).
6.  Show the shortest path for viewing
searching results.
Trends
1.  Social Search (Facebook, Twitter)
and real-time search.
2.  Visual search (Bing).
3.  Expert systems (Wolfram Alpha,
Siri and Cortana).
4.  Copyright issues solving.
References
1.  Data Mining SDK http://datamining.codeplex.com/
2.  Microsoft Research Asia http://research.microsoft.com/en-us/labs/asia/
3.  Information search lectures by Yandex http://company.yandex.ru/public/seminars/schedule
4.  How Google Works Videos http://bit.ly/bRfUav
5.  How Bing Works http://neotracks.blogspot.com/2009/06/ranknethow-bing-works.html
6.  Data Mining hub http://habrahabr.ru/hub/data_mining/
7.  http://cstheory.stackexchange.com/ and http://math.stackexchange.com/
8.  Сравнительный анализ методов определения нечетких дубликатов для Web-документов
Зеленков Ю.Г, Сегалович И.В. 2007. http://rcdl2007.pereslavl.ru/papers/paper_65_v1.pdf
9.  Shingles approach http://www.codeisart.ru/part-1-shingles-algorithm-for-web-documents/
Q&A
alex.krakovetskiy@devrain.com
@msugvnua

Contenu connexe

Tendances

OpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for LibrariansOpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for Librarians
tfmorris
 
csresume_aug2016
csresume_aug2016csresume_aug2016
csresume_aug2016
Anne Latsko
 
AjayBhullar_Resume (5)
AjayBhullar_Resume (5)AjayBhullar_Resume (5)
AjayBhullar_Resume (5)
Ajay Bhullar
 

Tendances (19)

How to put an annotation in html
How to put an annotation in htmlHow to put an annotation in html
How to put an annotation in html
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
AINL 2016: Kozerenko
AINL 2016: Kozerenko AINL 2016: Kozerenko
AINL 2016: Kozerenko
 
Reproducible research
Reproducible researchReproducible research
Reproducible research
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications
 
Extracting insights from textual data
Extracting insights from textual dataExtracting insights from textual data
Extracting insights from textual data
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
OpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for LibrariansOpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for Librarians
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked Data
 
Word Puzzles with Neo4j and Py2neo
Word Puzzles with Neo4j and Py2neoWord Puzzles with Neo4j and Py2neo
Word Puzzles with Neo4j and Py2neo
 
csresume_aug2016
csresume_aug2016csresume_aug2016
csresume_aug2016
 
Mapping Australian User-Created Content: Methodological, Technological and E...
Mapping Australian User-Created Content: Methodological, Technological and E...Mapping Australian User-Created Content: Methodological, Technological and E...
Mapping Australian User-Created Content: Methodological, Technological and E...
 
AjayBhullar_Resume (5)
AjayBhullar_Resume (5)AjayBhullar_Resume (5)
AjayBhullar_Resume (5)
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype Properties
 
Milex 2010 final
Milex 2010 finalMilex 2010 final
Milex 2010 final
 

En vedette

AI&BigData Lab 2016. Максим Терещенко: #DataForGood - как изменить мир к лучш...
AI&BigData Lab 2016. Максим Терещенко: #DataForGood - как изменить мир к лучш...AI&BigData Lab 2016. Максим Терещенко: #DataForGood - как изменить мир к лучш...
AI&BigData Lab 2016. Максим Терещенко: #DataForGood - как изменить мир к лучш...
GeeksLab Odessa
 

En vedette (11)

Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский Дмитрий
 
Стартапы в AI&BigData_Виталий Гончарук
Стартапы в AI&BigData_Виталий ГончарукСтартапы в AI&BigData_Виталий Гончарук
Стартапы в AI&BigData_Виталий Гончарук
 
Моделирование структурными уравнениями_Алексей Гаевский
Моделирование структурными уравнениями_Алексей ГаевскийМоделирование структурными уравнениями_Алексей Гаевский
Моделирование структурными уравнениями_Алексей Гаевский
 
"AI&Big Data для путешественников"_Кузнецов Юрий
"AI&Big Data для путешественников"_Кузнецов Юрий "AI&Big Data для путешественников"_Кузнецов Юрий
"AI&Big Data для путешественников"_Кузнецов Юрий
 
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья СвиридовManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
 
Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"
 
Тимашев Дмитрий "Что такое визуализация данных, или почему специалисты, работ...
Тимашев Дмитрий "Что такое визуализация данных, или почему специалисты, работ...Тимашев Дмитрий "Что такое визуализация данных, или почему специалисты, работ...
Тимашев Дмитрий "Что такое визуализация данных, или почему специалисты, работ...
 
Deep learning: Cложный анализ данных простыми словами_Сергей Шелпук
Deep learning: Cложный анализ данных простыми словами_Сергей ШелпукDeep learning: Cложный анализ данных простыми словами_Сергей Шелпук
Deep learning: Cложный анализ данных простыми словами_Сергей Шелпук
 
AI&BigData Lab 2016. Максим Терещенко: #DataForGood - как изменить мир к лучш...
AI&BigData Lab 2016. Максим Терещенко: #DataForGood - как изменить мир к лучш...AI&BigData Lab 2016. Максим Терещенко: #DataForGood - как изменить мир к лучш...
AI&BigData Lab 2016. Максим Терещенко: #DataForGood - как изменить мир к лучш...
 
Презентация Ukraine Global Scholars
Презентация Ukraine Global Scholars Презентация Ukraine Global Scholars
Презентация Ukraine Global Scholars
 
освіта калуш New.pptx
освіта калуш New.pptxосвіта калуш New.pptx
освіта калуш New.pptx
 

Similaire à "Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий Александр

Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
ijceronline
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
Tim Weninger
 
Machine Learning Applications
Machine Learning ApplicationsMachine Learning Applications
Machine Learning Applications
butest
 
ACOMP_2014_submission_70
ACOMP_2014_submission_70ACOMP_2014_submission_70
ACOMP_2014_submission_70
David Nguyen
 

Similaire à "Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий Александр (20)

Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
 
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...
 
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Machine Learning Applications
Machine Learning ApplicationsMachine Learning Applications
Machine Learning Applications
 
Seo report
Seo reportSeo report
Seo report
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
3 Understanding Search
3 Understanding Search3 Understanding Search
3 Understanding Search
 
E017624043
E017624043E017624043
E017624043
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
 
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web PagesWSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
 
Adaptive Blue Sem Tech Meetup Nyc
Adaptive Blue Sem Tech Meetup NycAdaptive Blue Sem Tech Meetup Nyc
Adaptive Blue Sem Tech Meetup Nyc
 
ACOMP_2014_submission_70
ACOMP_2014_submission_70ACOMP_2014_submission_70
ACOMP_2014_submission_70
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a library
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
Architecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionArchitecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC Solution
 

Plus de GeeksLab Odessa

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
GeeksLab Odessa
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
GeeksLab Odessa
 
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
GeeksLab Odessa
 
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
GeeksLab Odessa
 
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
GeeksLab Odessa
 

Plus de GeeksLab Odessa (20)

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
 
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
 
DataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский ВикторDataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский Виктор
 
DataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображениеDataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображение
 
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
 
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
 
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
 
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
 
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
 
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
 
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
 
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
 
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
 

Dernier

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Dernier (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий Александр

  • 1.
  • 2. About 1.  CEO of DevRain Solutions – software development (specialization: Windows Phone and Windows 8). 2.  Microsoft Regional Director. 3.  Microsoft Windows Phone Most Valuable Professional. 4.  Telerik Most Valuable Professional. 5.  Best Professional in Software Architecture (Ukrainian IT Award). 6.  Ph.D. 7.  Speaker and IT blogger.
  • 3. #1: A lot of information 1.  “No information” problem is transformed to the “a lot of information” problem. 2.  Amount of information increases every year in geometric progression. 3.  Big data.
  • 4. #2: Duplicates 1.  Different chrome not the content. 2.  Copyrighting and plagiarism. 3.  Partially solved for news.
  • 5. #3: Information waste 1.  Level 1: noisy information such as advertisement, copyright, decoration, etc. 2.  Level 2: useful information, but not very relevant to the topic of the page, such as navigation, directory, etc. 3.  Level 3: relevant information to the theme of the page, but not with prominent importance, such as related topics, topic index, etc. 4.  Level 4: the most prominent part of the page, such as headlines, main content, etc.
  • 6. #4: Searching time Every second user is watching 5-10 pages to find needed information. My record: 8 hours of uninterrupted search. Found at 23th page on MSN.
  • 7. #5: Domain “Snow Leopard” Can be “cat” or “operation system” from Apple.
  • 8. Solutions? Data Mining – intellectual analysis of big amounts of data •  clustering, associated rules, GA, Ant optimization, visualization, decision trees, neural networks. R&D – new algorithms, methods •  Microsoft Research, Yahoo! Research, Google Labs, Arc90 Lab and others. Let’s mix!
  • 9. #01: A lot of information 1.  Filtering not ranking 2.  Clustering and categorization 3.  Semantic web
  • 10. #02: Duplicates. NLP 1.  Readability score 2.  NER Dbpedia Spotlight, Reuters OpenCalais 3.  WordNet 4.  Shingles
  • 12. #3: Information waste Readability An Arc90 Lab Readability turns any web page into a clean view for reading now or later on your computer, smartphone, or tablet. https://www.readability.com
  • 13. Vision-based Page Segmentation Algorithm Presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Based on DOM structure analysis and subjective rules. http://research.microsoft.com/apps/ pubs/default.aspx?id=70027
  • 14. Vision-based Page Segmentation Algorithm Different pages have different visual margins so quality of segmentation algorithm depends on certain web page. If comment is bigger than main content (e.g. habrahabr) the result will not be very precise.
  • 15. Learning Important Models 1.  Spatial Features {BlockCenterX, BlockCenterY, BlockRectWidth, BlockRectHeight} 2.  Content features {FontSize, FontWeight, InnerTextLength, InnerHtmlLength, ImgNum, ImgSize, LinkNum, LinkTextLength, InteractionNum, InteractionSize, FormNum, FormSize, OptionNum, OptionTextLength, TableNum, ParaNum} http://www.sigkdd.org/sites/default/files/issues/ 6-2-2004-12/2-song.pdf
  • 16. Semantic and SEO 1.  Semantic tags (article, aside, footer, header etc.) 2.  SEO (meta description, keywords) 3.  Microformats (RSS, hCalendar, hCardetc.) 4.  CMS, common engines and social networks.
  • 17. SeoRank 1.  Title 2 text. 2.  Meta keywords 2 text. 3.  Headers 2 text. 4.  Meta description 2 text. 5.  WordsIndex, SentencesIndex, WordsInSentencesIndex, LinksIndex, WordsAsLinksIndex, ImgsIndex, ImgsAsLinksIndex etc.
  • 18. Regression model 1.  Detect valuable properties. 2.  Select model type (linear). 3.  After regression analysis we will get content important model: .305,0002,0267,0 594,0056,0008,0249,0324,0 171614 127653 xxx xxxxxy ⋅+⋅+⋅− −⋅−⋅+⋅−⋅−⋅=
  • 19. SmartBrowser Software for determining the most relevant content of the HTML pages. h"p://smartbrowser.codeplex.com/    
  • 20. Search optimal path 1.  Graph analysis (similar pages, clustering and categorization). 2.  Ant simulations (search optimal path using complex criterion). http://touchgraph.com/TGGoogleBrowser.html http://walk2web.com
  • 21. Ant algorithm The ant colony algorithm is an algorithm for finding optimal paths that is based on the behavior of ants searching for food. Because the ant-colony works on a very dynamic system, the ant colony algorithm works very well in graphs with changing topologies. Examples of such systems include computer networks, and artificial intelligence simulations of workers.
  • 22. Search optimal path algorithm 1.  User makes a search. 2.  Clustering (removing not relevant cluster pages). 3.  Main content determination and duplicates removal. 4.  Graph structure optimization. 5.  Analyzing content importance and completeness (sorting from most important to less one). 6.  Show the shortest path for viewing searching results.
  • 23. Trends 1.  Social Search (Facebook, Twitter) and real-time search. 2.  Visual search (Bing). 3.  Expert systems (Wolfram Alpha, Siri and Cortana). 4.  Copyright issues solving.
  • 24. References 1.  Data Mining SDK http://datamining.codeplex.com/ 2.  Microsoft Research Asia http://research.microsoft.com/en-us/labs/asia/ 3.  Information search lectures by Yandex http://company.yandex.ru/public/seminars/schedule 4.  How Google Works Videos http://bit.ly/bRfUav 5.  How Bing Works http://neotracks.blogspot.com/2009/06/ranknethow-bing-works.html 6.  Data Mining hub http://habrahabr.ru/hub/data_mining/ 7.  http://cstheory.stackexchange.com/ and http://math.stackexchange.com/ 8.  Сравнительный анализ методов определения нечетких дубликатов для Web-документов Зеленков Ю.Г, Сегалович И.В. 2007. http://rcdl2007.pereslavl.ru/papers/paper_65_v1.pdf 9.  Shingles approach http://www.codeisart.ru/part-1-shingles-algorithm-for-web-documents/