Issue
I try to build a list of urls out of an online-forum. It is necessary to use BeautifulSoup in my case. The goal is a list of URLs containing every page of a thread, e.g.
[http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html,
http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches-2.html,
http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches-3.html]
What works is this:
#import modules
import requests
from bs4 import BeautifulSoup
#define main-url
url = 'http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html'
#create a list of urls
urls=[url]
#load url
page = requests.get(url)
#parse it using BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')
#search for the url of the next page
nextpage = soup.find("a", ["pag next"]).get('href')
#append the urls of the next page to the list of urls
urls.append(nextpage)
print(urls)
When I try to build a if loop for the next pages as follows it will not work. Why?
for url in urls:
urls.append(soup.find("a", ["pag next"]).get('href'))
("a", ["pag next"]).get('href')
identifies the url for the next page)
It's not an option to use the pagination of the url because there will be many other threads to crawl. I'm using Jupyter Notebook server 5.7.4 on a macbook pro Python 3.7.1 IPython 7.2.0
I know about the existence of this post. For my beginners knowledge the code is written too complicated but maybe your experience allows you to apply it on my use case.
Solution
The url pattern for pagination is always consistent with this site, therefore you don't need to make requests to grab page urls. Instead you can parse the the text in the button that says "Page 1 of 10" and construct the page urls after knowing the final page number.
import re
import requests
from bs4 import BeautifulSoup
thread_url = "http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html"
r = requests.get(thread_url)
soup = BeautifulSoup(r.content, 'lxml')
pattern = re.compile(r'Seite\s\d+\svon\s(\d+)', re.I)
pages = soup.find('a', text=pattern).text.strip()
pages = int(pattern.match(pages).group(1))
page_urls = [f"{thread_url[:-5]}-{p}.html" for p in range(1, pages + 1)]
for url in page_urls:
print(url)
Answered By - nicholishen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.