Issue
I have this html_scraper that takes in three parameters, a URL, a rate limit and the target element selector. The for loop calls the scraper function for each URL in the url_list. I expect the function to return the scraped content of each page, but it only returns the scraped content of the last URL in the loop. The list that collects the page_contents for each loop is outside of the for loop, so what am I doing wrong here?
from bs4 import BeautifulSoup
import lxml
import requests
scraped_html = []
def html_scraper(url, ratelimit=1.0, target_element_selector='example-id-0'):
page_contents = {
'pageTitle': None,
'content': None
}
result = requests.get(url).text
soup = BeautifulSoup(result, 'lxml')
page_contents['pageTitle'] = soup.find('h1')
page_contents['content'] = soup.find(id=target_element_selector)
time.sleep(ratelimit)
return page_contents
if __name__ == '__main__':
url_list = [
'https://example.com/page-1',
'https://example.com/page-2',
'https://example.com/page-3',
]
for url in url_list:
try:
scraped = html_scraper(url, 0.5, 'example-id-1')
scraped_html.append(scraped)
except Exception as e:
print(e)
print(scraped_html)
# [{'pageTitle': None, 'content': None}, {'pageTitle': None, 'content': None}, {'pageTitle': Example Page 3 Title, 'content': <div id="example-id-1">Blah-blah-blah-blah-blah</div>}]
Solution
So, I replaced requests.get(url).text
with urllib.request.urlopen(url)
and the problem was resolved. When the html_scraper function is used in only one page, meaning not in a for loop, the request
package works, but inside a for loop it only retrieved the contents of the last page in the loop. Thanks to @JohnGordon!
I've dug a little deeper into why urllib.request.urlopen()
works and requests.get().text
doesn't for my use case. With the former, it returns an <http.client.HTTPResponse object>, with the latter it returns an 'HTTP 403 - Forbidden' page error. I found the answer to the following StackOverflow question to be relevant, "What are the differences between the urllib, urllib2, urllib3 and requests module?"
Answered By - japonix
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.