Sunday, December 17, 2023

[FIXED] web scraping the pages of an online-forum thread using python and beatifulsoup

December 17, 2023 jupyter-notebook, python No comments

Issue

I try to build a list of urls out of an online-forum. It is necessary to use BeautifulSoup in my case. The goal is a list of URLs containing every page of a thread, e.g.

[http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html, 
http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches-2.html, 
http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches-3.html]

What works is this:

#import modules
import requests
from bs4 import BeautifulSoup
#define main-url
url = 'http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html'
#create a list of urls
urls=[url]
#load url
page = requests.get(url)
#parse it using BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')
#search for the url of the next page
nextpage = soup.find("a", ["pag next"]).get('href')
#append the urls of the next page to the list of urls
urls.append(nextpage)
print(urls)

When I try to build a if loop for the next pages as follows it will not work. Why?

for url in urls:
    urls.append(soup.find("a", ["pag next"]).get('href'))

("a", ["pag next"]).get('href') identifies the url for the next page)

It's not an option to use the pagination of the url because there will be many other threads to crawl. I'm using Jupyter Notebook server 5.7.4 on a macbook pro Python 3.7.1 IPython 7.2.0

I know about the existence of this post. For my beginners knowledge the code is written too complicated but maybe your experience allows you to apply it on my use case.

Solution

The url pattern for pagination is always consistent with this site, therefore you don't need to make requests to grab page urls. Instead you can parse the the text in the button that says "Page 1 of 10" and construct the page urls after knowing the final page number.

import re

import requests
from bs4 import BeautifulSoup

thread_url = "http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html"
r = requests.get(thread_url)
soup = BeautifulSoup(r.content, 'lxml')
pattern = re.compile(r'Seite\s\d+\svon\s(\d+)', re.I)
pages = soup.find('a', text=pattern).text.strip()
pages = int(pattern.match(pages).group(1))
page_urls = [f"{thread_url[:-5]}-{p}.html" for p in range(1, pages + 1)]
for url in page_urls:
    print(url)

Answered By - nicholishen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 17, 2023

[FIXED] web scraping the pages of an online-forum thread using python and beatifulsoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels