Though common expressions are nice for sample matching usually, typically it’s simpler to make use of an HTML parser that’s explicitly designed for parsing out HTML pages. There are various Python instruments written for this goal, however the Beautiful Soup library is an effective one to begin with.
Set up Lovely Soup
To put in Lovely Soup, you possibly can run the next in your terminal:
$ python -m pip set up beautifulsoup4
With this command, you’re putting in the most recent model of Lovely Soup into your international Python surroundings.
Create a BeautifulSoup
Object
Sort the next program into a brand new editor window:
# beauty_soup.py
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/dionysus"
web page = urlopen(url)
html = web page.learn().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
This program does three issues:
-
Opens the URL
http://olympus.realpython.org/profiles/dionysus
through the use ofurlopen()
from theurllib.request
module -
Reads the HTML from the web page as a string and assigns it to the
html
variable -
Creates a
BeautifulSoup
object and assigns it to thesoup
variable
The BeautifulSoup
object assigned to soup
is created with two arguments. The primary argument is the HTML to be parsed, and the second argument, the string "html.parser"
, tells the item which parser to make use of behind the scenes. "html.parser"
represents Python’s built-in HTML parser.
Use a BeautifulSoup
Object
Save and run the above program. When it’s completed operating, you need to use the soup
variable within the interactive window to parse the content material of html
in numerous methods.
Word: In the event you’re not utilizing IDLE, then you possibly can run your program with the -i
flag to enter interactive mode. One thing like python -i beauty_soup.py
will first run your program after which depart you in a REPL the place you possibly can discover your objects.
For instance, BeautifulSoup
objects have a .get_text()
methodology that you need to use to extract all of the textual content from the doc and mechanically take away any HTML tags.
Sort the next code into IDLE’s interactive window or on the finish of the code in your editor:
>>> print(soup.get_text())
Profile: Dionysus
Title: Dionysus
Hometown: Mount Olympus
Favourite animal: Leopard
Favourite Colour: Wine
There are plenty of clean traces on this output. These are the results of newline characters within the HTML doc’s textual content. You possibly can take away them with the .change()
string methodology if it’s essential.
Typically, it’s essential get solely particular textual content from an HTML doc. Utilizing Lovely Soup first to extract the textual content after which utilizing the .discover()
string methodology is usually simpler than working with common expressions.
Nevertheless, different instances the HTML tags themselves are the weather that time out the information you wish to retrieve. As an illustration, maybe you wish to retrieve the URLs for all the photographs on the web page. These hyperlinks are contained within the src
attribute of <img>
HTML tags.
On this case, you need to use find_all()
to return a listing of all situations of that specific tag:
>>> soup.find_all("img")
[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]
This returns a listing of all <img>
tags within the HTML doc. The objects within the checklist appear like they is perhaps strings representing the tags, however they’re really situations of the Tag
object supplied by Lovely Soup. Tag
objects present a easy interface for working with the knowledge they comprise.
You possibly can discover this somewhat by first unpacking the Tag
objects from the checklist:
>>> image1, image2 = soup.find_all("img")
Every Tag
object has a .identify
property that returns a string containing the HTML tag kind:
You possibly can entry the HTML attributes of the Tag
object by placing their names between sq. brackets, simply as if the attributes had been keys in a dictionary.
For instance, the <img src="https://realpython.com/static/dionysus.jpg"/>
tag has a single attribute, src
, with the worth "https://realpython.com/static/dionysus.jpg"
. Likewise, an HTML tag such because the hyperlink <a href="https://realpython.com" goal="_blank">
has two attributes, href
and goal
.
To get the supply of the photographs within the Dionysus profile web page, you entry the src
attribute utilizing the dictionary notation talked about above:
>>> image1["src"]
'/static/dionysus.jpg'
>>> image2["src"]
'/static/grapes.png'
Sure tags in HTML paperwork will be accessed by properties of the Tag
object. For instance, to get the <title>
tag in a doc, you need to use the .title
property:
>>> soup.title
<title>Profile: Dionysus</title>
In the event you take a look at the supply of the Dionysus profile by navigating to the profile page, right-clicking on the web page, and deciding on View web page supply, then you definately’ll discover that the <title>
tag is written in all caps with areas:

Lovely Soup mechanically cleans up the tags for you by eradicating the additional area within the opening tag and the extraneous ahead slash (/
) within the closing tag.
You can even retrieve simply the string between the title tags with the .string
property of the Tag
object:
>>> soup.title.string
'Profile: Dionysus'
One of many options of Lovely Soup is the power to seek for particular sorts of tags whose attributes match sure values. For instance, if you wish to discover all of the <img>
tags which have a src
attribute equal to the worth /static/dionysus.jpg
, then you possibly can present the next further argument to .find_all()
:
>>> soup.find_all("img", src="/static/dionysus.jpg")
[<img src="/static/dionysus.jpg"/>]
This instance is considerably arbitrary, and the usefulness of this system is probably not obvious from the instance. In the event you spend a while looking numerous web sites and viewing their web page sources, then you definately’ll discover that many web sites have extraordinarily difficult HTML buildings.
When scraping knowledge from web sites with Python, you’re usually concerned with explicit elements of the web page. By spending a while trying by means of the HTML doc, you possibly can determine tags with distinctive attributes that you need to use to extract the information you want.
Then, as an alternative of counting on difficult common expressions or utilizing .discover()
to look by means of the doc, you possibly can straight entry the actual tag that you simply’re concerned with and extract the information you want.
In some instances, you might discover that Lovely Soup doesn’t supply the performance you want. The lxml library is considerably trickier to get began with however affords way more flexibility than Lovely Soup for parsing HTML paperwork. It’s possible you’ll wish to test it out when you’re comfy utilizing Lovely Soup.
Word: HTML parsers like Lovely Soup can prevent plenty of effort and time with regards to finding particular knowledge in internet pages. Nevertheless, typically HTML is so poorly written and disorganized that even a classy parser like Lovely Soup can’t interpret the HTML tags correctly.
On this case, you’re usually left with utilizing .discover()
and common expression strategies to attempt to parse out the knowledge that you simply want.
Lovely Soup is nice for scraping knowledge from a web site’s HTML, but it surely doesn’t present any strategy to work with HTML kinds. For instance, if it’s essential search a web site for some question after which scrape the outcomes, then Lovely Soup alone received’t get you very far.
Verify Your Understanding
Broaden the block beneath to examine your understanding.
Write a program that grabs the total HTML from the page on the URL http://olympus.realpython.org/profiles
.
Utilizing Lovely Soup, print out a listing of all of the hyperlinks on the web page by on the lookout for HTML tags with the identify a
and retrieving the worth taken on by the href
attribute of every tag.
The ultimate output ought to appear like this:
http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus
Just remember to solely have one slash (/
) between the bottom URL and the relative URL.
You possibly can increase the block beneath to see an answer:
First, import the urlopen
perform from the urlib.request
module and the BeautifulSoup
class from the bs4
package deal:
from urllib.request import urlopen
from bs4 import BeautifulSoup
Every hyperlink URL on the /profiles
web page is a relative URL, so create a base_url
variable with the bottom URL of the web site:
base_url = "http://olympus.realpython.org"
You possibly can construct a full URL by concatenating base_url
with a relative URL.
Now open the /profiles
web page with urlopen()
and use .learn()
to get the HTML supply:
html_page = urlopen(base_url + "/profiles")
html_text = html_page.learn().decode("utf-8")
With the HTML supply downloaded and decoded, you possibly can create a brand new BeautifulSoup
object to parse the HTML:
soup = BeautifulSoup(html_text, "html.parser")
soup.find_all("a")
returns a listing of all of the hyperlinks within the HTML supply. You possibly can loop over this checklist to print out all of the hyperlinks on the internet web page:
for hyperlink in soup.find_all("a"):
link_url = base_url + hyperlink["href"]
print(link_url)
You possibly can entry the relative URL for every hyperlink by means of the "href"
subscript. Concatenate this worth with base_url
to create the total link_url
.
While you’re prepared, you possibly can transfer on to the following part.