Though common expressions are nice for sample matching usually, typically it’s simpler to make use of an HTML parser that’s explicitly designed for parsing out HTML pages. There are various Python instruments written for this goal, however the Beautiful Soup library is an effective one to begin with.
Set up Lovely Soup
To put in Lovely Soup, you possibly can run the next in your terminal:
$ python -m pip set up beautifulsoup4
With this command, you’re putting in the most recent model of Lovely Soup into your international Python surroundings.
Sort the next program into a brand new editor window:
# beauty_soup.py from bs4 import BeautifulSoup from urllib.request import urlopen url = "http://olympus.realpython.org/profiles/dionysus" web page = urlopen(url) html = web page.learn().decode("utf-8") soup = BeautifulSoup(html, "html.parser")
This program does three issues:
Opens the URL
http://olympus.realpython.org/profiles/dionysusthrough the use of
Reads the HTML from the web page as a string and assigns it to the
BeautifulSoupobject and assigns it to the
BeautifulSoup object assigned to
soup is created with two arguments. The primary argument is the HTML to be parsed, and the second argument, the string
"html.parser", tells the item which parser to make use of behind the scenes.
"html.parser" represents Python’s built-in HTML parser.
Save and run the above program. When it’s completed operating, you need to use the
soup variable within the interactive window to parse the content material of
html in numerous methods.
Word: In the event you’re not utilizing IDLE, then you possibly can run your program with the
-i flag to enter interactive mode. One thing like
python -i beauty_soup.py will first run your program after which depart you in a REPL the place you possibly can discover your objects.
BeautifulSoup objects have a
.get_text() methodology that you need to use to extract all of the textual content from the doc and mechanically take away any HTML tags.
Sort the next code into IDLE’s interactive window or on the finish of the code in your editor:
>>> print(soup.get_text()) Profile: Dionysus Title: Dionysus Hometown: Mount Olympus Favourite animal: Leopard Favourite Colour: Wine
There are plenty of clean traces on this output. These are the results of newline characters within the HTML doc’s textual content. You possibly can take away them with the
.change() string methodology if it’s essential.
Typically, it’s essential get solely particular textual content from an HTML doc. Utilizing Lovely Soup first to extract the textual content after which utilizing the
.discover() string methodology is usually simpler than working with common expressions.
Nevertheless, different instances the HTML tags themselves are the weather that time out the information you wish to retrieve. As an illustration, maybe you wish to retrieve the URLs for all the photographs on the web page. These hyperlinks are contained within the
src attribute of
<img> HTML tags.
On this case, you need to use
find_all() to return a listing of all situations of that specific tag:
>>> soup.find_all("img") [<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]
This returns a listing of all
<img> tags within the HTML doc. The objects within the checklist appear like they is perhaps strings representing the tags, however they’re really situations of the
Tag object supplied by Lovely Soup.
Tag objects present a easy interface for working with the knowledge they comprise.
You possibly can discover this somewhat by first unpacking the
Tag objects from the checklist:
>>> image1, image2 = soup.find_all("img")
Tag object has a
.identify property that returns a string containing the HTML tag kind:
You possibly can entry the HTML attributes of the
Tag object by placing their names between sq. brackets, simply as if the attributes had been keys in a dictionary.
For instance, the
<img src="https://realpython.com/static/dionysus.jpg"/> tag has a single attribute,
src, with the worth
"https://realpython.com/static/dionysus.jpg". Likewise, an HTML tag such because the hyperlink
<a href="https://realpython.com" goal="_blank"> has two attributes,
To get the supply of the photographs within the Dionysus profile web page, you entry the
src attribute utilizing the dictionary notation talked about above:
>>> image1["src"] '/static/dionysus.jpg' >>> image2["src"] '/static/grapes.png'
Sure tags in HTML paperwork will be accessed by properties of the
Tag object. For instance, to get the
<title> tag in a doc, you need to use the
>>> soup.title <title>Profile: Dionysus</title>
In the event you take a look at the supply of the Dionysus profile by navigating to the profile page, right-clicking on the web page, and deciding on View web page supply, then you definately’ll discover that the
<title> tag is written in all caps with areas:
Lovely Soup mechanically cleans up the tags for you by eradicating the additional area within the opening tag and the extraneous ahead slash (
/) within the closing tag.
You can even retrieve simply the string between the title tags with the
.string property of the
>>> soup.title.string 'Profile: Dionysus'
One of many options of Lovely Soup is the power to seek for particular sorts of tags whose attributes match sure values. For instance, if you wish to discover all of the
<img> tags which have a
src attribute equal to the worth
/static/dionysus.jpg, then you possibly can present the next further argument to
>>> soup.find_all("img", src="/static/dionysus.jpg") [<img src="/static/dionysus.jpg"/>]
This instance is considerably arbitrary, and the usefulness of this system is probably not obvious from the instance. In the event you spend a while looking numerous web sites and viewing their web page sources, then you definately’ll discover that many web sites have extraordinarily difficult HTML buildings.
When scraping knowledge from web sites with Python, you’re usually concerned with explicit elements of the web page. By spending a while trying by means of the HTML doc, you possibly can determine tags with distinctive attributes that you need to use to extract the information you want.
Then, as an alternative of counting on difficult common expressions or utilizing
.discover() to look by means of the doc, you possibly can straight entry the actual tag that you simply’re concerned with and extract the information you want.
In some instances, you might discover that Lovely Soup doesn’t supply the performance you want. The lxml library is considerably trickier to get began with however affords way more flexibility than Lovely Soup for parsing HTML paperwork. It’s possible you’ll wish to test it out when you’re comfy utilizing Lovely Soup.
Word: HTML parsers like Lovely Soup can prevent plenty of effort and time with regards to finding particular knowledge in internet pages. Nevertheless, typically HTML is so poorly written and disorganized that even a classy parser like Lovely Soup can’t interpret the HTML tags correctly.
On this case, you’re usually left with utilizing
.discover() and common expression strategies to attempt to parse out the knowledge that you simply want.
Lovely Soup is nice for scraping knowledge from a web site’s HTML, but it surely doesn’t present any strategy to work with HTML kinds. For instance, if it’s essential search a web site for some question after which scrape the outcomes, then Lovely Soup alone received’t get you very far.
Verify Your Understanding
Broaden the block beneath to examine your understanding.
Write a program that grabs the total HTML from the page on the URL
Utilizing Lovely Soup, print out a listing of all of the hyperlinks on the web page by on the lookout for HTML tags with the identify
a and retrieving the worth taken on by the
href attribute of every tag.
The ultimate output ought to appear like this:
http://olympus.realpython.org/profiles/aphrodite http://olympus.realpython.org/profiles/poseidon http://olympus.realpython.org/profiles/dionysus
Just remember to solely have one slash (
/) between the bottom URL and the relative URL.
You possibly can increase the block beneath to see an answer:
First, import the
urlopen perform from the
urlib.request module and the
BeautifulSoup class from the
bs4 package deal:
from urllib.request import urlopen from bs4 import BeautifulSoup
Every hyperlink URL on the
/profiles web page is a relative URL, so create a
base_url variable with the bottom URL of the web site:
base_url = "http://olympus.realpython.org"
You possibly can construct a full URL by concatenating
base_url with a relative URL.
Now open the
/profiles web page with
urlopen() and use
.learn() to get the HTML supply:
html_page = urlopen(base_url + "/profiles") html_text = html_page.learn().decode("utf-8")
With the HTML supply downloaded and decoded, you possibly can create a brand new
BeautifulSoup object to parse the HTML:
soup = BeautifulSoup(html_text, "html.parser")
soup.find_all("a") returns a listing of all of the hyperlinks within the HTML supply. You possibly can loop over this checklist to print out all of the hyperlinks on the internet web page:
for hyperlink in soup.find_all("a"): link_url = base_url + hyperlink["href"] print(link_url)
You possibly can entry the relative URL for every hyperlink by means of the
"href" subscript. Concatenate this worth with
base_url to create the total
While you’re prepared, you possibly can transfer on to the following part.
Leave a Reply