May 13, 2025

How to Build a Multi-Threaded Web Crawler in Java

jsoup

java

javawebcrawler

webcrawler

multithreading

webdevelopment

javatutorial

webscraping

Mia Garcia

@mia-garcia

Share what you learn in this blog to prepare for your interview, create your forever-free profile now, and explore how to monetize your valuable knowledge.

Ever wondered how search engines provide nanosecond results from the web? The secret is in web crawlers, automatic programs that gather and index data from websites. Making a web crawler in Java is easier than it seems. Single-threaded crawlers fail as data expands. Multi-threading saves the day with speed and scalability.
Building a web crawler with several threads in Java is explained in this article. Before you build a crawler, make sure you understand the fundamentals. Let's get into it!

Understanding the Basics of Web Crawling

Crawling a website involves finding pages, gathering data, and navigating to other sites via links. It gathers URLs, not threads like a spider. It seems easy, but circular links, duplicate content, and robots.txt ethics make it difficult.

One-threaded crawlers function well for little jobs but struggle with large-scale crawling. Multi-threading speeds processing and resource use by distributing the burden over numerous threads.

Setting Up Your Java Project

Start by creating a Java project in your IDE. Install JSoup to ease HTML parsing and web page interaction. Maven users may add the following dependency to pom.xml:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.4</version>
</dependency>

With your project set up, you're ready to write some code!

Building a Single-Threaded Crawler

Begin with a single-threaded crawler. Create a program that gets a web page and finds all the links on it.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import java.io.IOException;

public class SingleThreadedCrawler {
    public static void main(String[] args) {
        String url = "https://example.com";
        try {
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a[href]");
            for (var link : links) {
                System.out.println(link.attr("abs:href"));
            }
        } catch (IOException e) {
            System.err.println("Error fetching the URL: " + e.getMessage());
        }
    }
}

It extracts and prints links from one URL at a time. This is fine for tiny websites but not large-scale crawling.

Introducing Multi-Threading

Scaling requires multi-threading. ExecutorService simplifies thread management by generating a thread pool to run multiple tasks. We can develop a multi-threaded crawler using it:

import java.util.concurrent.*;
import java.util.*;

public class MultiThreadedCrawler {
    private static final int THREAD_COUNT = 10;
    private static final Set<String> visitedUrls = ConcurrentHashMap.newKeySet();
    private static final BlockingQueue<String> urlQueue = new LinkedBlockingQueue<>();

    public static void main(String[] args) {
        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        urlQueue.add("https://example.com");

        while (!urlQueue.isEmpty()) {
            String url = urlQueue.poll();
            if (url != null && visitedUrls.add(url)) {
                executor.execute(() -> crawl(url));
            }
        }

        executor.shutdown();
    }

    private static void crawl(String url) {
        try {
            System.out.println("Crawling: " + url);
            // Fetch and parse page, extract links, and add to queue
        } catch (Exception e) {
            System.err.println("Error crawling " + url + ": " + e.getMessage());
        }
    }
}

In this approach, BlockingQueue handles the URLs to crawl and ConcurrentHashMap prevents duplicate visits. The thread pool crawls each URL simultaneously, speeding it up.

Optimizing and Enhancing the Crawler

You can boost your crawler with many upgrades. Handle slow or failed URLs graciously using timeouts and retries. Thread.sleep delays prevent server overload. To crawl ethically, utilize user-agent headers and obey robots.txt.

Here's an example of an optimized crawl method:

private static void crawl(String url) {
    try {
        Thread.sleep(1000); // Throttle requests
        Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0").get();
        doc.select("a[href]").forEach(link -> urlQueue.add(link.attr("abs:href")));
    } catch (Exception e) {
        System.err.println("Error crawling " + url + ": " + e.getMessage());
    }
}

Testing and Deploying Your Crawler

Test your crawler on a basic website before deploying it to confirm it works. Verify site rules and log errors. Do not crawl huge or restricted sites without authorization.

Conclusion

You created a Java multi-threaded web crawler! Threading lets your crawler analyze URLs faster and manage bigger datasets. Consider adding distributed crawling or a database for results as you improve your code. When you have enormous crawling power, you must crawl ethically and respect websites.

642 views

Please Login to create a Question

Posts

Questions

Blogs

Jobs