Axa Assurance Maroc - Insurer Innovation Award 2024
Nextzy Technologies Co.,ltd. Jsoup
1. Palakorn Nakphong
Founder: Nextzy Technologies Co.,ltd.
[“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer];
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
2. Jsoup
Java HTML Parser
Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
6. Regular expression is F*uk
String expr = "<td><spans+class="flagicon"[^>]*>"
+ ".*?</span><a href=""
+ "([^"]+)" // first piece of data goes up to quote
+ ""[^>]*>" // end quote, then skip to end of tag
+ "([^<]+)" // name is data up to next tag
+ "</a>.*?</td>"; // end a tag, then skip to the td close tag
10. • Jsoup can scrape and parse HTML from a URL, file, or string
• Jsoup can find and extract data, using DOM traversal or CSS selectors
• Jsoup allows you to manipulate the HTML elements, attributes, and text
• Jsoup provides clean user-submitted content against a safe white-list, to
prevent XSS attacks
• Jsoup also output tidy HTML
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
11. Example DOM Element
Document doc = Jsoup.connect("http://www.nextzy.com/").get();
String title = doc.title();
<html>
<head>
<title>My title</title>
</head>
<body>
<h1>My header</h1>
<a href="test.html">My link</a>
</body>
</html>
12. File input = new File("/file/nextzy.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://nextzy.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Get Element By …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
13. Elements links = doc.select("a[href]");
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultLinks = doc.select("h3.active > a");
Like CSS Selector …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
14. Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/“
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
Work with URL …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn