ALL >> Computer-Programming >> View Article
How Beautifulsoup Is Used To Web Scrape Movie Database?

You want to use machine learning to forecast what will be the next popular film. You try and attempt to locate clean data to develop a machine learning model, but you can't seem to find any. So, you decide to create your data. However, you are afraid to gather your information because you may not be familiar with HTML or web scraping.
Beautiful Soup is a Python web scraping module that makes it simple to scrape HTML and XML files. The documentary on the library can be found here: Documentation
By following this lesson, you will obtain a considerable understanding of how to produce your data if you already know how to use Python.
Steps for Web Scraping:
Determine what data you want to extract from the website.
Examine the page
Beautiful Soup is a great place to start scraping.
Target
For scraping the database of Ghibli studio and look for characteristics that make a Ghibli movie better.
On the first page:
Title
URL: for future web scraping
Image
Ranks
Ratings
Examining the page:
Right-click on the page and then click on inspect
Right-click-on-the-page-and-then-click-on-inspect
...
... Then click on the icon to select the element on the website to inspect
Then-click-on-the-icon-to-select-the-element-on-the-website-to-inspect
For individuals who are unfamiliar with HTML, the text can appear confusing at first. But don't worry; after working with a few of them, you'll quickly grasp the framework. You can then click on any information on the website that interests you after clicking on the button indicated earlier. Do you want to start by scraping the rating? Simply click on the location of the rating. The specific placement of the rating on the HTML program code will be known to you.
The-specific-placement-of-the-rating-on-the-HTML-program-code-will-be-known-to-you
Starting with BeautifulSoup
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopenhtml = urlopen('https://www.themoviedb.org/list/4309')
bsObj = BeautifulSoup(html)
bsObj.prettify()
The result is a tangled HTML document. However, keep in mind that the movie list is encased in an unordered list element with attributeid = "list page 1." Find all to extract data (tag, attribute). Then, within the tag, get a list of all tags.
movies = bsObj.find('ul', id = "list_page_1")
movies = movies.find_all('li')
len(movies)
for movie in movies:
print(movie, len(movie), "\n\n")
The function find all() creates a set of objects that have the tag provided in the brackets. We understand the duration of the film item is 3 depending on length of each item in the list. As a result, the items having a length greater than 0 would provide us with information on the movies.
movies = [movie for movie in movies if len(movie) > 0]
The data we really want scrape on the first page is title, URL, image, rank, and rating.
Each film is wrapped in a unique tag that follows the same format. So, to see if we get what we want, we may start by extracting information from the first movie. Then, using identical scripts, loop over all of the videos to retrieve all of the data we care about.
Begin on the first page. The image's URL and title are enclosed within the img tag, as we can see. li is our current tag. We're looking for a means to acquire entry to the img tag and its characteristics.
movie_1 = movies[0]
movie_1.img.attrs
Let's have a look at this in more detail. The img tag that is wrapped inside the current tag is returned by.img. The img tag's attribute can be accessed using attrs.
{'class': ['poster', 'lazyload', 'fade'],
'data-sizes': 'auto',
'data-src': 'https://image.tmdb.org/t/p/w185_and_h278_bestv2/hnYowHwLq0iUWriAHtiiCWsI2dP.jpg',
'data-srcset': 'https://image.tmdb.org/t/p/w185_and_h278_bestv2/hnYowHwLq0iUWriAHtiiCWsI2dP.jpg 1x, https://image.tmdb.org/t/p/w370_and_h556_bestv2/hnYowHwLq0iUWriAHtiiCWsI2dP.jpg 2x',
'alt': 'Nausicaä of the Valley of the Wind'}
Searching for the Title
movie_1.img.attrs['alt']
Result
'Nausicaä of the Valley of the Wind'
Check if the link to the movie's image is working, find the URL and use Image to see the image.
from IPython.display import Image
image_url = movie_1.find('img').attrs['data-src']
Image(url= image_url)
Extracting URL by accessing to tag and then to its attribute href.
a = movie_1.a.attrs
for value in a.values():
url = value
print(url)
Outcome
/movie/81
https://www.themoviedb.org/movie/81 is the URL for the film ".
full_url = "https://www.themoviedb.org" + url
Access the div tag with class='number' to find the rank. The span tag is then accessed. The text inside the tag is returned by text.
int(movie_1.find('div', {'class':'number'}).span.text)
Perform the previous steps to search for the ratings:
float(movie_1.find_all('span',{'class':'rating'})[1].text)
Loop through all the movies using the script:
html = urlopen('https://www.themoviedb.org/list/4309')
bsObj = BeautifulSoup(html)
#Create 4 lists that contains all the url, movie's name, rank, and rating
urls = []
names = []
ranks = []
ratings = []
images = []
for movie in movies:
for value in movie.a.attrs.values():
url = value
urls.append("https://www.themoviedb.org" +url)
names.append(movie.img.attrs['alt'])
ranks.append(int(movie.find('div', {'class':'number'}).span.text))
ratings.append(float(movie.find_all('span',{'class':'rating'})[1].text))
images.append(movie.find('img').attrs['data-src'])
Search the Next Tag and Scrape Various Pages
In the movie list, select one of the films. Extract data about the film from the URL you found earlier, such as the summary, director, language, runtime, budget, revenue, and genre.
If you want to get to the tag that contains the information you're interested in but the tag doesn't have a specific characteristic to call, you can use. Find_next_ siblings() to get to the p tag by using the same hierarchy as the ul tag above.
Now we are ready to scrape various pages at same time
url = urlopen("https://www.themoviedb.org/movie/81")
soup = BeautifulSoup(url)
#find summary
soup.find('div', {'class':'overview'}).p.get_text()
#find director
soup.find('li', {'class':'profile'}).a.get_text()
#Find language, runtime, budget, revenue, and genre
inf = soup.find('ul', {'class':'releases'}).find_next_siblings()
language = inf[0].text
runtime = inf[1].text
budget = inf[2].text
rev = inf[3].text
#Find Genre
section = soup.find('section',{'class':'genres right_column'})
[li.text for li in section.find_all('li')]
# Scrap every page
summaries = []
languages = []
runtimes = []
budgets = []
revenues = []
genres = []
directors = []
for url in urls:
soup = BeautifulSoup(urlopen(url))
summaries.append(soup.find('div', {'class':'overview'}).p.get_text())
inf = soup.find('ul', {'class':'releases'}).find_next_siblings()
languages.append(inf[0].text)
runtimes.append(inf[1].text)
budgets.append(inf[2].text)
revenues.append(inf[3].text)
directors.append(soup.find('li', {'class':'profile'}).a.get_text())
section = soup.find('section',{'class':'genres right_column'})
genres.append([li.text for li in section.find_all('li')])
Put Data into DataFrame
import pandas as pd
ghibli = pd.DataFrame(list(zip(names, ranks, ratings, languages, runtimes, budgets, revenues, genres, summaries)),
columns=['name','rank','rating','language','runtime', 'budget','revenue','genre','summary'])
ghibli.head(10)
Final Result
Use this method to extract the information, that you want to. If you're not sure where to begin, try practicing with The Movie Database on an area that interests you. You'd be astonished at how simple scraping is. Plus, watching what you can scrape with just a few lines of code will be entertaining.
Contact 3i Data Scraping for any queries,
Request for a quote!!
3i Data Scraping is an Experienced Web Scraping Services Company in the USA. We are Providing a Complete Range of Web Scraping, Mobile App Scraping, Data Extraction, Data Mining, and Real-Time Data Scraping (API) Services. We have 11+ Years of Experience in Providing Website Data Scraping Solutions to Hundreds of Customers Worldwide.
Add Comment
Computer Programming Articles
1. Agentic Ai Development Services: Unlocking The Future Of Smarter AutomationAuthor: Albert
2. Why The Best Data Science Institute In Bhopal Is The Gateway To Top It Jobs
Author: Rohan Rajput
3. Complete Php Tutorial: Master Core Php Concepts With Examples
Author: Tech Point
4. Framework7 Tutorial For Beginners – Create Powerful Hybrid Apps
Author: Tech Point
5. Why Spadegaming Is Dominating The Asian Igaming Market
Author: Alex
6. Why Express.js And Mongodb Are A Dynamic Duo For Node.js Development
Author: Andy
7. Discover The Best Data Science Institute In Bhopal For A Future-ready Career
Author: Rohan Rajput
8. Best Data Science Institute In Bhopal: Learn Ai, Ml & Analytics With Experts
Author: Rohan Rajput
9. Jstl Tutorial – Simplifying Jsp Development
Author: Tech Point
10. Easy Java I/o Tutorial For Beginners To Learn File Operations
Author: Tech Point
11. Ai Chatbot Development Vs. Traditional Chatbot Development
Author: Albert
12. Good Schools In Bhopal Offering Academics With All-round Growth
Author: Ronit Sharma
13. Top Data Science Academy In Bhopal
Author: Rohan Rajput
14. Premier Data Science Courses In Bhopal
Author: Rohan Rajput
15. Jsf Tutorial: Everything You Need To Know About Javaserver Faces
Author: Tech Point