SlideShare une entreprise Scribd logo
1  sur  16
Palakorn Nakphong
Founder: Nextzy Technologies Co.,ltd.
[“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer];
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Jsoup
Java HTML Parser
Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Complex DOM Element
Old Web Scraping
How to get data in tag?
Regular expression is F*uk
String expr = "<td><spans+class="flagicon"[^>]*>"
+ ".*?</span><a href=""
+ "([^"]+)" // first piece of data goes up to quote
+ ""[^>]*>" // end quote, then skip to end of tag
+ "([^<]+)" // name is data up to next tag
+ "</a>.*?</td>"; // end a tag, then skip to the td close tag
New Web Scraping
Using Jsoup
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.1</version>
</dependency>
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
WhatisJSoupLibrary?
• Jsoup can scrape and parse HTML from a URL, file, or string
• Jsoup can find and extract data, using DOM traversal or CSS selectors
• Jsoup allows you to manipulate the HTML elements, attributes, and text
• Jsoup provides clean user-submitted content against a safe white-list, to
prevent XSS attacks
• Jsoup also output tidy HTML
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Example DOM Element
Document doc = Jsoup.connect("http://www.nextzy.com/").get();
String title = doc.title();
<html>
<head>
<title>My title</title>
</head>
<body>
<h1>My header</h1>
<a href="test.html">My link</a>
</body>
</html>
File input = new File("/file/nextzy.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://nextzy.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Get Element By …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Elements links = doc.select("a[href]");
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultLinks = doc.select("h3.active > a");
Like CSS Selector …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/“
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
Work with URL …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
มาร่วมเป็นโจรสลัดกับเรา...
https://www.blognone.com/node/64996
Thanks You
Nextzy Technologies Co.,ltd. Jsoup

Contenu connexe

Tendances

Perl behind the Wall
Perl behind the Wall Perl behind the Wall
Perl behind the Wall
Andrew Shitov
 

Tendances (20)

Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
 
Command line Data Tools
Command line Data ToolsCommand line Data Tools
Command line Data Tools
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
 
Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"
 
Web Scrapping with Python
Web Scrapping with PythonWeb Scrapping with Python
Web Scrapping with Python
 
Intro to XML in libraries
Intro to XML in librariesIntro to XML in libraries
Intro to XML in libraries
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
 
Web Browsers And Other Mistakes
Web Browsers And Other MistakesWeb Browsers And Other Mistakes
Web Browsers And Other Mistakes
 
Keynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C eventKeynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C event
 
Introduction to the DOM
Introduction to the DOMIntroduction to the DOM
Introduction to the DOM
 
INTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREEINTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREE
 
Test Slide
Test SlideTest Slide
Test Slide
 
CouchDB in The Room
CouchDB in The RoomCouchDB in The Room
CouchDB in The Room
 
Perl behind the Wall
Perl behind the Wall Perl behind the Wall
Perl behind the Wall
 
fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14
 
Ruby on Rails and the Semantic Web
Ruby on Rails and the Semantic WebRuby on Rails and the Semantic Web
Ruby on Rails and the Semantic Web
 

En vedette (7)

IT Outsource Meetup
IT Outsource MeetupIT Outsource Meetup
IT Outsource Meetup
 
Bangkok university Speaker
Bangkok university SpeakerBangkok university Speaker
Bangkok university Speaker
 
Numbers
NumbersNumbers
Numbers
 
Nextzy Office Environment
Nextzy Office EnvironmentNextzy Office Environment
Nextzy Office Environment
 
Spring
SpringSpring
Spring
 
ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%
 
Nextzy Technologies Company profile
Nextzy Technologies Company profileNextzy Technologies Company profile
Nextzy Technologies Company profile
 

Similaire à Nextzy Technologies Co.,ltd. Jsoup

Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients
Positive Hack Days
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
Ramamohan Chokkam
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
 

Similaire à Nextzy Technologies Co.,ltd. Jsoup (20)

Ruby Isn't Just About Rails
Ruby Isn't Just About RailsRuby Isn't Just About Rails
Ruby Isn't Just About Rails
 
Javazone 2010-lift-framework-public
Javazone 2010-lift-framework-publicJavazone 2010-lift-framework-public
Javazone 2010-lift-framework-public
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients
 
Native Phone Development 101
Native Phone Development 101Native Phone Development 101
Native Phone Development 101
 
Xml
XmlXml
Xml
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
4 JVM Web Frameworks
4 JVM Web Frameworks4 JVM Web Frameworks
4 JVM Web Frameworks
 
Html5 and web technology update
Html5 and web technology updateHtml5 and web technology update
Html5 and web technology update
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
 
Presentation of JSConf.eu
Presentation of JSConf.euPresentation of JSConf.eu
Presentation of JSConf.eu
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
 
Javascript Templating
Javascript TemplatingJavascript Templating
Javascript Templating
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
Adventurous Merb
Adventurous MerbAdventurous Merb
Adventurous Merb
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
HTML5
HTML5HTML5
HTML5
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Nextzy Technologies Co.,ltd. Jsoup

  • 1. Palakorn Nakphong Founder: Nextzy Technologies Co.,ltd. [“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer]; fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 2. Jsoup Java HTML Parser Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 5. How to get data in tag?
  • 6. Regular expression is F*uk String expr = "<td><spans+class="flagicon"[^>]*>" + ".*?</span><a href="" + "([^"]+)" // first piece of data goes up to quote + ""[^>]*>" // end quote, then skip to end of tag + "([^<]+)" // name is data up to next tag + "</a>.*?</td>"; // end a tag, then skip to the td close tag
  • 10. • Jsoup can scrape and parse HTML from a URL, file, or string • Jsoup can find and extract data, using DOM traversal or CSS selectors • Jsoup allows you to manipulate the HTML elements, attributes, and text • Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks • Jsoup also output tidy HTML fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 11. Example DOM Element Document doc = Jsoup.connect("http://www.nextzy.com/").get(); String title = doc.title(); <html> <head> <title>My title</title> </head> <body> <h1>My header</h1> <a href="test.html">My link</a> </body> </html>
  • 12. File input = new File("/file/nextzy.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://nextzy.com/"); Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); } Get Element By … fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 13. Elements links = doc.select("a[href]"); Elements pngs = doc.select("img[src$=.png]"); Element masthead = doc.select("div.masthead").first(); Elements resultLinks = doc.select("h3.active > a"); Like CSS Selector … fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 14. Document doc = Jsoup.connect("http://jsoup.org").get(); Element link = doc.select("a").first(); String relHref = link.attr("href"); // == "/“ String absHref = link.attr("abs:href"); // "http://jsoup.org/" Work with URL … fb.com/codingz @Codingz th.linkedin.com/in/palakorn