Sunday, December 26, 2021

[FIXED] Web Scraping: scrape multiple webs by Python

December 26, 2021 beautifulsoup, python, web-scraping No comments

Issue

from bs4 import BeautifulSoup
import requests

url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 10):
  pg = url + '?page=' + str(pg)
  soup = BeautifulSoup(page.content, 'lxml')
  for paragraph in soup.find_all('p'):
     print(paragraph.text)

I want to scrape the ranking, review and review date from https://uk.trustpilot.com/review/thread.com, however, I don't have any clue how to scrape from multiple pages and make a pandas DataFrame for the scraping result

Solution

Hi you need to send a request to each page and then process the response. Also as some items are not directly available as text within tags so you have either get it from the javascript (i go the date like this using json load) or get it from the class name (i got the rating like this).

from bs4 import BeautifulSoup
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 3):
  pg = url + '?page=' + str(pg)
  r=requests.get(pg)
  soup = BeautifulSoup(r.text, 'lxml')
  for paragraph in soup.find_all('section',class_='review__content'):
     title=paragraph.find('h2',class_='review-content__title').text.strip()
     content=paragraph.find('p',class_='review-content__text').text.strip()
     datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
     date=datedata['publishedDate'].split('T')[0]
     rating_class=paragraph.find('div',class_='star-rating')['class']
     rating=rating_class[1].split('-')[-1]
     final_list.append([title,content,date,rating])
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)

Output

                                                Title                                            Content        Date Rating
0                      I ordered a jacket 2 weeks ago  I ordered a jacket 2 weeks ago.  Still hasn't ...  2019-01-13      1
1              I've used this service for many years…  I've used this service for many years and get ...  2018-12-31      4
2                                       Great website  Great website, tailored recommendations, and e...  2018-12-19      5
3              I was excited by the prospect offered…  I was excited by the prospect offered by threa...  2018-12-18      1
4       Thread set the benchmark for customer service  Firstly, their customer service is second to n...  2018-12-12      5
5                                    It's a good idea  It's a good idea.  I am in between sizes and d...  2018-12-02      3
6                             Great experience so far  Great experience so far. Big choice of clothes...  2018-10-31      5
7                    Absolutely love using Thread.com  Absolutely love using Thread.com.  As a man wh...  2018-10-31      5
8                 I'd like to give Thread a one star…  I'd like to give Thread a one star review, but...  2018-10-30      2
9            Really enjoying the shopping experience…  Really enjoying the shopping experience on thi...  2018-10-22      5
10                         The only way I buy clothes  I absolutely love Thread. I've been surviving ...  2018-10-15      5
11                                  Excellent Service  Excellent ServiceQuick delivery, nice items th...  2018-07-27      5
12             Convenient way to order clothes online  Convenient way to order clothes online, and gr...  2018-07-05      5
13                Superb - would thoroughly recommend  Recommendations have been brilliant - no more ...  2018-06-24      5
14                    First time ordering from Thread  First time ordering from Thread - Very slow de...  2018-06-22      1
15          Some of these criticisms are just madness  I absolutely love thread.com, and I can't reco...  2018-05-28      5
16                                       Top service!  Great idea and fantastic service. I just recei...  2018-05-17      5
17                                      Great service  Great service. Great clothes which come well p...  2018-05-05      5
18                                          Thumbs up  Easy, straightforward and very good costumer s...  2018-04-17      5
19                 Good idea, ruined by slow delivery  I really love the concept and the ordering pro...  2018-04-08      3
20                                      I love Thread  I have been using thread for over a year. It i...  2018-03-12      5
21      Clever simple idea but.. low quality clothing  Clever simple idea but.. low quality clothingL...  2018-03-12      2
22                      Initially I was impressed....  Initially I was impressed with the Thread shop...  2018-02-07      2
23                                 Happy new customer  Joined the site a few weeks ago, took a short ...  2018-02-06      5
24                          Style tips for mature men  I'm a man of mature age, let's say a "baby boo...  2018-01-31      5
25            Every shop, every item and in one place  Simple, intuitive and makes online shopping a ...  2018-01-28      5
26                     Fantastic experience all round  Fantastic experience all round.  Quick to regi...  2018-01-28      5
27          Superb "all in one" shopping experience …  Superb "all in one" shopping experience that i...  2018-01-25      5
28  Great for time poor people who aren’t fond of ...  Rally love this company. Super useful for thos...  2018-01-22      5
29                            Really is worth trying!  Quite cautious at first, however, love the way...  2018-01-10      4
30           14 days for returns is very poor given …  14 days for returns is very poor given most co...  2017-12-20      3
31                  A great intro to online clothes …  A great intro to online clothes shopping. Usef...  2017-12-15      5
32                           I was skeptical at first  I was skeptical at first, but the service is s...  2017-11-16      5
33            seems good to me as i hate to shop in …  seems good to me as i hate to shop in stores, ...  2017-10-23      5
34                          Great concept and service  Great concept and service. This service has be...  2017-10-17      5
35                                      Slow dispatch  My Order Dispatch was extremely slow compared ...  2017-10-07      1
36             This company sends me clothes in boxes  This company sends me clothes in boxes! I find...  2017-08-28      5
37          I've been using Thread for the past six …  I've been using Thread for the past six months...  2017-08-03      5
38                                             Thread  Thread, this site right here is literally the ...  2017-06-22      5
39                                       good concept  The website is a good concept in helping buyer...  2017-06-14      3

Note: Although i was able to "hack" my way into getting the result for this site, it is better to use selenium to scrape dynamic pages.

Edit: Code to find out number of pages automatically

from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
  pg = url + '?page=' + str(pg)
  r=requests.get(pg)
  soup = BeautifulSoup(r.text, 'lxml')
  for paragraph in soup.find_all('section',class_='review__content'):
     try:
         title=paragraph.find('h2',class_='review-content__title').text.strip()
         content=paragraph.find('p',class_='review-content__text').text.strip()
         datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
         date=datedata['publishedDate'].split('T')[0]
         rating_class=paragraph.find('div',class_='star-rating')['class']
         rating=rating_class[1].split('-')[-1]
         final_list.append([title,content,date,rating])
     except AttributeError:
        pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)

Answered By - Bitto Bennichan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 26, 2021

[FIXED] Web Scraping: scrape multiple webs by Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels