MK Study Journal: Jsoup

Monday, 14 August 2017

JSOUP: - HTML Parser

JSOUP library provide features to parse HTML pages.

Please refer https://jsoup.org/ for api details.

Jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

1. scrape and parse HTML from a URL, file, or string

2. find and extract data, using DOM traversal or CSS selectors

3. manipulate the HTML elements, attributes, and text

4. clean user-submitted content against a safe white-list, to prevent XSS attacks

5. output tidy HTML

Example:-

package com.test.main;

import java.io.IOException;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.select.Elements;

public class Test1 {

public static void main(String[] args) throws IOException {

Document doc = Jsoup.connect("https://jsoup.org/").get();

Elements div = doc.select("div");

System.out.println(div.html());

}

Monday, 14 August 2017