Friday, March 24, 2023
Learning Code
  • Home
  • JavaScript
  • Java
  • Python
  • Swift
  • C++
  • C#
No Result
View All Result
  • Home
  • JavaScript
  • Java
  • Python
  • Swift
  • C++
  • C#
No Result
View All Result
Learning Code
No Result
View All Result
Home Python

A Practical Introduction to Web Scraping in Python – Real Python

learningcode_x1mckf by learningcode_x1mckf
October 17, 2022
in Python
0
A Practical Introduction to Web Scraping in Python – Real Python
74
SHARES
1.2k
VIEWS
Share on FacebookShare on Twitter


Though common expressions are nice for sample matching usually, typically it’s simpler to make use of an HTML parser that’s explicitly designed for parsing out HTML pages. There are various Python instruments written for this goal, however the Beautiful Soup library is an effective one to begin with.

Set up Lovely Soup

To put in Lovely Soup, you possibly can run the next in your terminal:

$ python -m pip set up beautifulsoup4

With this command, you’re putting in the most recent model of Lovely Soup into your international Python surroundings.

Create a BeautifulSoup Object

Sort the next program into a brand new editor window:

# beauty_soup.py

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
web page = urlopen(url)
html = web page.learn().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

This program does three issues:

  1. Opens the URL http://olympus.realpython.org/profiles/dionysus through the use of urlopen() from the urllib.request module

  2. Reads the HTML from the web page as a string and assigns it to the html variable

  3. Creates a BeautifulSoup object and assigns it to the soup variable

The BeautifulSoup object assigned to soup is created with two arguments. The primary argument is the HTML to be parsed, and the second argument, the string "html.parser", tells the item which parser to make use of behind the scenes. "html.parser" represents Python’s built-in HTML parser.

Use a BeautifulSoup Object

Save and run the above program. When it’s completed operating, you need to use the soup variable within the interactive window to parse the content material of html in numerous methods.

Word: In the event you’re not utilizing IDLE, then you possibly can run your program with the -i flag to enter interactive mode. One thing like python -i beauty_soup.py will first run your program after which depart you in a REPL the place you possibly can discover your objects.

For instance, BeautifulSoup objects have a .get_text() methodology that you need to use to extract all of the textual content from the doc and mechanically take away any HTML tags.

Sort the next code into IDLE’s interactive window or on the finish of the code in your editor:

>>>

>>> print(soup.get_text())


Profile: Dionysus





Title: Dionysus

Hometown: Mount Olympus

Favourite animal: Leopard

Favourite Colour: Wine

There are plenty of clean traces on this output. These are the results of newline characters within the HTML doc’s textual content. You possibly can take away them with the .change() string methodology if it’s essential.

Typically, it’s essential get solely particular textual content from an HTML doc. Utilizing Lovely Soup first to extract the textual content after which utilizing the .discover() string methodology is usually simpler than working with common expressions.

Nevertheless, different instances the HTML tags themselves are the weather that time out the information you wish to retrieve. As an illustration, maybe you wish to retrieve the URLs for all the photographs on the web page. These hyperlinks are contained within the src attribute of <img> HTML tags.

On this case, you need to use find_all() to return a listing of all situations of that specific tag:

>>>

>>> soup.find_all("img")
[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

This returns a listing of all <img> tags within the HTML doc. The objects within the checklist appear like they is perhaps strings representing the tags, however they’re really situations of the Tag object supplied by Lovely Soup. Tag objects present a easy interface for working with the knowledge they comprise.

You possibly can discover this somewhat by first unpacking the Tag objects from the checklist:

>>>

>>> image1, image2 = soup.find_all("img")

Every Tag object has a .identify property that returns a string containing the HTML tag kind:

You possibly can entry the HTML attributes of the Tag object by placing their names between sq. brackets, simply as if the attributes had been keys in a dictionary.

For instance, the <img src="https://realpython.com/static/dionysus.jpg"/> tag has a single attribute, src, with the worth "https://realpython.com/static/dionysus.jpg". Likewise, an HTML tag such because the hyperlink <a href="https://realpython.com" goal="_blank"> has two attributes, href and goal.

To get the supply of the photographs within the Dionysus profile web page, you entry the src attribute utilizing the dictionary notation talked about above:

>>>

>>> image1["src"]
'/static/dionysus.jpg'

>>> image2["src"]
'/static/grapes.png'

Sure tags in HTML paperwork will be accessed by properties of the Tag object. For instance, to get the <title> tag in a doc, you need to use the .title property:

>>>

>>> soup.title
<title>Profile: Dionysus</title>

In the event you take a look at the supply of the Dionysus profile by navigating to the profile page, right-clicking on the web page, and deciding on View web page supply, then you definately’ll discover that the <title> tag is written in all caps with areas:

Screenshot of Dionysos Website with Source Code.

Lovely Soup mechanically cleans up the tags for you by eradicating the additional area within the opening tag and the extraneous ahead slash (/) within the closing tag.

You can even retrieve simply the string between the title tags with the .string property of the Tag object:

>>>

>>> soup.title.string
'Profile: Dionysus'

One of many options of Lovely Soup is the power to seek for particular sorts of tags whose attributes match sure values. For instance, if you wish to discover all of the <img> tags which have a src attribute equal to the worth /static/dionysus.jpg, then you possibly can present the next further argument to .find_all():

>>>

>>> soup.find_all("img", src="/static/dionysus.jpg")
[<img src="/static/dionysus.jpg"/>]

This instance is considerably arbitrary, and the usefulness of this system is probably not obvious from the instance. In the event you spend a while looking numerous web sites and viewing their web page sources, then you definately’ll discover that many web sites have extraordinarily difficult HTML buildings.

When scraping knowledge from web sites with Python, you’re usually concerned with explicit elements of the web page. By spending a while trying by means of the HTML doc, you possibly can determine tags with distinctive attributes that you need to use to extract the information you want.

Then, as an alternative of counting on difficult common expressions or utilizing .discover() to look by means of the doc, you possibly can straight entry the actual tag that you simply’re concerned with and extract the information you want.

In some instances, you might discover that Lovely Soup doesn’t supply the performance you want. The lxml library is considerably trickier to get began with however affords way more flexibility than Lovely Soup for parsing HTML paperwork. It’s possible you’ll wish to test it out when you’re comfy utilizing Lovely Soup.

Word: HTML parsers like Lovely Soup can prevent plenty of effort and time with regards to finding particular knowledge in internet pages. Nevertheless, typically HTML is so poorly written and disorganized that even a classy parser like Lovely Soup can’t interpret the HTML tags correctly.

On this case, you’re usually left with utilizing .discover() and common expression strategies to attempt to parse out the knowledge that you simply want.

Lovely Soup is nice for scraping knowledge from a web site’s HTML, but it surely doesn’t present any strategy to work with HTML kinds. For instance, if it’s essential search a web site for some question after which scrape the outcomes, then Lovely Soup alone received’t get you very far.

Verify Your Understanding

Broaden the block beneath to examine your understanding.

Write a program that grabs the total HTML from the page on the URL http://olympus.realpython.org/profiles.

Utilizing Lovely Soup, print out a listing of all of the hyperlinks on the web page by on the lookout for HTML tags with the identify a and retrieving the worth taken on by the href attribute of every tag.

The ultimate output ought to appear like this:

http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus

Just remember to solely have one slash (/) between the bottom URL and the relative URL.

You possibly can increase the block beneath to see an answer:

First, import the urlopen perform from the urlib.request module and the BeautifulSoup class from the bs4 package deal:

from urllib.request import urlopen
from bs4 import BeautifulSoup

Every hyperlink URL on the /profiles web page is a relative URL, so create a base_url variable with the bottom URL of the web site:

You might also like

When Should You Use .__repr__() vs .__str__() in Python? – Real Python

Summing Values the Pythonic Way With sum() – Real Python

Executing Python Scripts With a Shebang – Real Python

base_url = "http://olympus.realpython.org"

You possibly can construct a full URL by concatenating base_url with a relative URL.

Now open the /profiles web page with urlopen() and use .learn() to get the HTML supply:

html_page = urlopen(base_url + "/profiles")
html_text = html_page.learn().decode("utf-8")

With the HTML supply downloaded and decoded, you possibly can create a brand new BeautifulSoup object to parse the HTML:

soup = BeautifulSoup(html_text, "html.parser")

soup.find_all("a") returns a listing of all of the hyperlinks within the HTML supply. You possibly can loop over this checklist to print out all of the hyperlinks on the internet web page:

for hyperlink in soup.find_all("a"):
    link_url = base_url + hyperlink["href"]
    print(link_url)

You possibly can entry the relative URL for every hyperlink by means of the "href" subscript. Concatenate this worth with base_url to create the total link_url.

While you’re prepared, you possibly can transfer on to the following part.





Source link

Share30Tweet19
learningcode_x1mckf

learningcode_x1mckf

Recommended For You

When Should You Use .__repr__() vs .__str__() in Python? – Real Python

by learningcode_x1mckf
March 22, 2023
0
When Should You Use .__repr__() vs .__str__() in Python? – Real Python

One of the vital frequent duties that a pc program performs is to show information. This system typically shows this info to this system’s person. Nonetheless, a program...

Read more

Summing Values the Pythonic Way With sum() – Real Python

by learningcode_x1mckf
March 21, 2023
0
Summing Values the Pythonic Way With sum() – Real Python

Python’s built-in perform sum() is an environment friendly and Pythonic strategy to sum an inventory of numeric values. Including a number of numbers collectively is a typical intermediate...

Read more

Executing Python Scripts With a Shebang – Real Python

by learningcode_x1mckf
March 20, 2023
0
Executing Python Scripts With a Shebang – Real Python

While you learn another person’s Python code, you continuously see a mysterious line, which all the time seems on the high of the file, beginning with the distinctive...

Read more

Coding With namedtuple & Python’s Dynamic Superpowers – The Real Python Podcast

by learningcode_x1mckf
March 17, 2023
0
Coding With namedtuple & Python’s Dynamic Superpowers – The Real Python Podcast

Mar 17, 2023 53m Have you ever explored Python’s collections module? Inside it, you’ll discover a highly effective manufacturing facility operate known as namedtuple(), which gives a number...

Read more

How to Evaluate the Quality of Python Packages – Real Python

by learningcode_x1mckf
March 15, 2023
0
How to Evaluate the Quality of Python Packages – Real Python

Putting in packages with Python is only one pip set up command away. That’s one of many many nice qualities that the Python ecosystem has to supply. Nonetheless,...

Read more
Next Post
Java 19 could be big

Java 19 could be big

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Related News

Google expands open source bounties, will soon support Javascript fuzzing too – ZDNet

2023 Java roadmap for developers – TheServerSide.com

March 3, 2023
How ‘Java’ Became Coffee’s Nickname and a Programming Language

How ‘Java’ Became Coffee’s Nickname and a Programming Language

December 13, 2022
7 reasons Java is still great

7 reasons Java is still great

November 9, 2022

Browse by Category

  • C#
  • C++
  • Java
  • JavaScript
  • Python
  • Swift

RECENT POSTS

  • Java Developer Survey Reveals Increased Need for Java … – PR Newswire
  • What You Should Definitely Pay Attention to When Hiring Java Developers – Modern Diplomacy
  • Java Web Frameworks Software Market Research Report 2023 … – Los Alamos Monitor

CATEGORIES

  • C#
  • C++
  • Java
  • JavaScript
  • Python
  • Swift

© 2022 Copyright Learning Code

No Result
View All Result
  • Home
  • JavaScript
  • Java
  • Python
  • Swift
  • C++
  • C#

© 2022 Copyright Learning Code

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?