You are currently viewing Web Scraping Made Easy: A Step-by-Step Guide to Extracting Data from Websites with Java and Jsoup

Web Scraping Made Easy: A Step-by-Step Guide to Extracting Data from Websites with Java and Jsoup

There is an abundance of information available on the web. Extracting relevant data from websites can be a valuable skill for various purposes, such as market research, data analysis, and automation. In this guide, we will explore how to extract data from websites using Java and the Jsoup library. Specifically, we will be crawling the Worldometers website to retrieve world population data.

What is Jsoup?

Jsoup is a popular open-source Java library that provides a convenient API for extracting and manipulating data from HTML and XML documents. It simplifies the process of web scraping by providing powerful methods to navigate, query, and extract data from HTML structures.

Setting Up Jsoup

To get started, you’ll need to include the Jsoup library in your Java project. You can download the Jsoup JAR file from the official Jsoup website and add it to your project’s dependencies.

Creating Crawler Class

To begin our data extraction journey, let’s start by implementing a crawler class in Java. This class will utilize the Jsoup library, which provides convenient methods for parsing HTML and extracting data from websites. Below is the code for our Crawler class:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Crawler {

    private final Connection connection;

    public Crawler(String url) {
        this.connection = Jsoup.connect(url);
    }

    public Document getDocument() throws IOException {
        return this.connection.get();
    }

    static class TablePrinter {

        private final Element table;

        public TablePrinter(Element table) {
            this.table = table;
        }

        public void print() {

            Elements rows = this.table.getElementsByTag("tr");

            rows.forEach(row -> {

                Elements headerElements = row.getElementsByTag("th");
                Elements dataElements = row.getElementsByTag("td");

                // Print headers
                printElements(headerElements);

                // Print table data
                printElements(dataElements);

            });

        }

        private void printElements(Elements elements) {

            if (!elements.isEmpty()) {

                // Print each element with a fixed width of 30
                elements.eachText().forEach(elementText -> System.out.printf("%30s\t", elementText));

                // Move to a new line after printing a row
                System.out.println();

            }
        }

    }
}

Now that we have implemented the Crawler class, let’s explore how to utilize it to extract data from websites. The code snippet below demonstrates how to query the Worldometers website for population data:


import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;

public class Main {

    public static void main(String[] args) {

        String url = "https://www.worldometers.info/world-population/population-by-country/";

        // Create a new crawler instance
        Crawler crawler = new Crawler(url);

        try {

            // Get the document from the specified URL
            Document document = crawler.getDocument();

            // Find the table with the given ID
            Element table = document.getElementById("example2");

            if(table != null) {

                // Create a table printer instance
                Crawler.TablePrinter tablePrinter = new Crawler.TablePrinter(table);

                // Print the table
                tablePrinter.print();

            }

        } catch (IOException exception) {

            // Handle any IO errors
            System.out.println(exception.getMessage());

        }

    }

}

There is more to this code than meets the eye. I highly recommend running the code to see the output it produces. When executed, the code generates output in a tabular format, as depicted in the image below:

Extracting Data from Websites in Java using Jsoup: A Guide to Web Scraping

That was all I had to share with you guys. If you found this code informative and would love to see more, don’t forget to subscribe to our newsletter! 😊

Leave a Reply