ALL >> Computer-Programming >> View Article
How To Build A Web Scraping Api Using Java, Spring Boot, And Jsoup?

Overview
At 3i Data Scraping, we will create an API for scraping data from a couple of vehicle selling sites as well as extract the ads depending on vehicle models that we pass for an API. This type of API could be used from the UI as well as show different ads from various websites in one place.
Web Scraping
IntelliJ as IDE of option
Maven 3.0+ as a building tool
JDK 1.8+
Getting Started
Initially, we require to initialize the project using a spring initializer
It can be done by visiting http://start.spring.io/
Ensure to choose the given dependencies also:
Lombok: Java library, which makes a code cleaner as well as discards boilerplate codes.
Spring WEB: It is a product of the Spring community, with a focus on making document-driven web services.
After starting the project, we would be utilizing two-third party libraries JSOUP as well as Apache commons. The dependencies could be added in the pom.xml file.
org.springframework.boot
spring-boot-starter-web
...
...
org.jsoup
jsoup
1.13.1
org.apache.commons
commons-lang3
3.11
org.projectlombok
lombok
true
org.springframework.boot
spring-boot-starter-test
test
Analyze HTML to Extract Data
Before starting the implementation of API, we need to visit https://riyasewana.com/ and https://ikman.lk/ to locate data, which we need to extract from these sites.
We can perform that by launching the given sites on the browser as well as inspecting HTML code with Dev tools.
If you are using Chrome, just right-click on the page as well as choose inspect.
Its result will look like this:
screenshot
screenshot
After opening different websites we need to navigate through HTML for identifying a DOM where the ad list is positioned. These identified elements would be utilized in the spring boot project for getting relevant data.
From navigating through ikman.lk HTML, it’s easy to see a list of ads are positioned under a class name’s list — 3NxGO.
screenshot
After that, we need to perform the same with Riyasewana.com where ad data is positioned under a div with id content.
screenshot
After recognizing all the data, let’s create our API for scraping the data!!!.
Implementation
Initially, we need to define website URLs in the file called application.yml/application.properties
website:
urls: https://ikman.lk/en/ads/sri-lanka/vehicles?sort=relevance&buy_now=0&urgent=0&query=,https://riyasewana.com/search/
After that, create an easy model class for mapping data using HTML.
package com.scraper.api.model;
import lombok.Data;
@Data
public class ResponseDTO {
String title;
String url;
}
In the given code, we utilize Data annotation generation setters and getters for attributes.
After that, it’s time to create a service layer as well as scrape data from these websites.
package com.scraper.api.service;
import com.scraper.api.model.ResponseDTO;
import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
@Service
public class ScraperServiceImpl implements ScraperService {
//Reading data from property file to a list
@Value("#{'${website.urls}'.split(',')}")
List urls;
@Override
public Set getVehicleByModel(String vehicleModel) {
//Using a set here to only store unique elements
Set responseDTOS = new HashSet();
//Traversing through the urls
for (String url: urls) {
if (url.contains("ikman")) {
//method to extract data from Ikman.lk
extractDataFromIkman(responseDTOS, url + vehicleModel);
} else if (url.contains("riyasewana")) {
//method to extract Data from riyasewana.com
extractDataFromRiyasewana(responseDTOS, url + vehicleModel);
}
}
return responseDTOS;
}
private void extractDataFromRiyasewana(Set responseDTOS, String url) {
try {
//loading the HTML to a Document Object
Document document = Jsoup.connect(url).get();
//Selecting the element which contains the ad list
Element element = document.getElementById("content");
//getting all the tag elements inside the content div tag
Elements elements = element.getElementsByTag("a");
//traversing through the elements
for (Element ads: elements) {
ResponseDTO responseDTO = new ResponseDTO();
if (!StringUtils.isEmpty(ads.attr("title")) ) {
//mapping data to the model class
responseDTO.setTitle(ads.attr("title"));
responseDTO.setUrl(ads.attr("href"));
}
if (responseDTO.getUrl() != null) responseDTOS.add(responseDTO);
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
private void extractDataFromIkman(Set responseDTOS, String url) {
try {
//loading the HTML to a Document Object
Document document = Jsoup.connect(url).get();
//Selecting the element which contains the ad list
Element element = document.getElementsByClass("list--3NxGO").first();
//getting all the tag elements inside the list- -3NxGO class
Elements elements = element.getElementsByTag("a");
for (Element ads: elements) {
ResponseDTO responseDTO = new ResponseDTO();
if (StringUtils.isNotEmpty(ads.attr("href"))) {
//mapping data to our model class
responseDTO.setTitle(ads.attr("title"));
responseDTO.setUrl("https://ikman.lk"+ ads.attr("href"));
}
if (responseDTO.getUrl() != null) responseDTOS.add(responseDTO);
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
After writing the scraping logic for a service layer, we can now implement the RestController for fetching data from these websites.
package com.scraper.api.controller;
import com.scraper.api.model.ResponseDTO;
import com.scraper.api.service.ScraperService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.Set;
@RestController
@RequestMapping(path = "/")
public class ScraperController {
@Autowired
ScraperService scraperService;
@GetMapping(path = "/{vehicleModel}")
public Set getVehicleByModel(@PathVariable String vehicleModel) {
return scraperService.getVehicleByModel(vehicleModel);
}
}
When everything is completed. We need to Run this Project as well as a test this API!
Go to the RestClient as well as call API through offering a vehicle model.
For example http://localhost:8080/axio
screenshot
Here, you can observe that you have all the ad URLs as well as titles associated to given vehicle models from both these websites.
Conclusion
In this blog, you have learned about how to manipulate the HTML document using jsoup as well as spring boot to extract data from these two websites. The next step will be:
Improving this API to help pagination in these websites.
Implementing the UI for consuming the API
For more information on building web scraping API with Java, Spring Boot, or Jsoup, you can contact 3i Data Scraping or ask for a free quote!
3i Data Scraping is an Experienced Web Scraping Services Company in the USA. We are Providing a Complete Range of Web Scraping, Mobile App Scraping, Data Extraction, Data Mining, and Real-Time Data Scraping (API) Services. We have 11+ Years of Experience in Providing Website Data Scraping Solutions to Hundreds of Customers Worldwide.
Add Comment
Computer Programming Articles
1. From Zero To Coder: Tcci's Programming RoadmapAuthor: TCCI - Tririd Computer Coaching Institute
2. Best Full Stack Developer Course In Ahmedabad
Author: TCCI - Tririd Computer Coaching Institute
3. New: Tcci's Ai & Machine Learning Course, Ahmedabad
Author: TCCI - Tririd Computer Coaching Institute
4. Job-ready Web Development Course At Tcci, Ahmedabad
Author: TCCI - Tririd Computer Coaching Institute
5. Python Mastery In Bopal Ahmedabad (tcci Course)
Author: TCCI - Tririd Computer Coaching Institute
6. Java/c++ Classes In Ahmedabad? Choose Tcci!
Author: TCCI - Tririd Computer Coaching Institute
7. Authenticity In The Ai Age: A Deep Dive Into Detext.ai's Capabilities
Author: Raoul Schulist
8. Master Automation Testing With Testng Tutorial And Best Practices
Author: Tech Point
9. Jmeter Tutorial: Learn Load And Performance Testing Tools In Simple Steps
Author: Tech Point
10. Full Stack Career Path: Best Computer Course Ahmedabad
Author: TCCI - Tririd Computer Coaching Institute
11. Jagdish Mahapatra Md Apj Google Cloud Security On Securing The Cloud & Leading With Purpose
Author: Orson Amiri
12. Enough Is Enough: How To Hire The One Web Development Company In Calgary That Gets Roi
Author: It Master
13. Appium Tutorial: Learn How To Test Mobile Applications Like A Pro
Author: Tech Point
14. Why Software Maintenance Is More Important Than Development Itself
Author: Aimbeat Insights
15. How Load Balancing Routers In India Ensure Stable, Fast Connectivity
Author: shivani






