Issue
I want to scrape a page from a website that includes the following HTML:
<div class="section">
<div ng-bind="1" class="item even">First item</div>
<div ng-bind="2" class="item odd">Second item</div>
<div ng-bind="3" class="item-alt even">Third item</div>
</div>
Here is my code:
from bs4 import BeautifulSoup
from urllib import request
my_url = 'https://some.site/some/file.html?param=value'
with request.urlopen(my_url) as r:
soup = BeautifulSoup(r.read(), "html.parser")
result = soup.findAll('item')
print(result)
But I get an empty list as a result ([]
). I also tried:
result = soup.find('item')
print(result)
But that prints None
.
Why doesn't my code find the items? I can see the items on the page, so I know they are there?
Solution
The above is a very common type of question about web-scraping in general and BeautifulSoup in particular. The problem is usually one of the following (explained below):
- trying to match a class, but using the syntax to match an element / tag
- trying to match part of the class name the element actually has
- trying to match a single class, when the elements needed have multiple
- trying to match elements that don't get loaded in the script (but do get loaded in the browser)
Other common problems are a page not actually loading (i.e. a http response status other than 200
is returned). A status code of 403
would indicate access is not allowed and may be resolved by added headers or cookies. A status code of 500
indicates a server problem, which may be caused by making a request that causes a problem on the server side.
It's also possible that a response is only correct after previous pages have been visited, and again provide the correct headers or cookie may resolve that.
Matching a tag instead of a class
Where the code above reads:
result = soup.findAll('item')
If it instead read:
result = soup.findAll('div')
There would be at least 4 matches - the first being the outer div
with all the contents, and then the inner div
s separately.
To actually match div
s with the item
class, the code would have to be:
result = soup.findAll('div', {"class": "item"})
To match multiple tag types with that class, for example both div
and td
:
result = soup.findAll(['div', 'td'], {"class": "item"})
Matching partial class name
In the previous example, one div
still would not be matched with:
result = soup.findAll('div', {"class": "item"}) # only 2 results
Since one of the div
s actually has the class item-alt
, which starts with item
but the full class name is item-alt
.
To match partial classnames, you should make use of the fact that .findAll()
accepts regular expressions and functions as values to compare attributes (like class
) to:
import re
result = soup.findAll('div', {"class": re.compile('item.*')}) # finds all 3
result = soup.findAll('div', {"class": lambda c: c.startswith('item')}) # also finds all 3
Regular expressions and lambdas are very powerful and there's plenty of tutorials and documentation on how to use them - neither requires installing third party packages, they are a part of the standard Python libraries.
Matching multiple classes
If only the odd
classes need to be matched, the following does not work:
result = soup.findAll('div', {"class": ["item", "odd"]})
This instead matches any item that has either of the item
and odd
classes, so in the question's example, it would match both of the first 2 inner div
s. Think of this as selecting one class or
the other.
To only match the first (which has both classes) using .select()
is a good option:
result = soup.select('div.item.odd'}) # .odd.item would also work
This selects only div
elements that have the one class and
the other.
Using .select()
could also work for some of the problems above, but it lacks some of the options of .findAll()
, and it may perform differently. It's mainly useful if you can express the search as a CSS selector, and you need to keep in mind that support for pseudo classes is very limited.
Matching elements that don't get loaded
Even if your BeautifulSoup code is perfect, you may still not see the result you expected after looking at the page's source in a browser.
This is because most users will try to load the html using urllib
(like the example above) or a third party library like requests
. Both will load the html just fine, but neither will execute any scripts that would be loaded with the page and executed by a browser.
If the elements you are trying to scrape are generated from JavaScript, or loaded after the page has loaded and updated in the document with JavaScript, they won't be available in the loaded html itself.
The ng-bind
attributes in the example above are a clear indication that this may be the case here, since they indicate that the page is using Angular. You may see other, but similar attributes in pages using other web frameworks. In general, you should load the html, and save or print it, to inspect it and see if the elements you are trying to match are actually loaded with the html.
If not, a solution using a third party library like selenium
may be required. Selenium allows a Python script to 'puppeteer' a browser, either while you can see it, or invisibly in the background. It will load your page, and you can even use Selenium to interact with the page (click buttons and links, fill out values, etc.)
A simple example matching the code above:
from bs4 import BeautifulSoup
from selenium import webdriver
from os import environ, pathsep
environ["PATH"] += pathsep + './bin'
browser = webdriver.Firefox()
browser.get(my_url)
soup = BeautifulSoup(browser.page_source, "html.parser")
This example uses Firefox, but you can use most common browsers, as long as you download the matching selenium
browser driver for it, and put in in a location you add to the path (./bin
in this example).
Answered By - Grismar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.