Issue
from bs4 import BeautifulSoup
import requests
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 10):
pg = url + '?page=' + str(pg)
soup = BeautifulSoup(page.content, 'lxml')
for paragraph in soup.find_all('p'):
print(paragraph.text)
I want to scrape the ranking, review and review date from https://uk.trustpilot.com/review/thread.com, however, I don't have any clue how to scrape from multiple pages and make a pandas DataFrame for the scraping result
Solution
Hi you need to send a request to each page and then process the response. Also as some items are not directly available as text within tags so you have either get it from the javascript (i go the date like this using json load) or get it from the class name (i got the rating like this).
from bs4 import BeautifulSoup
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 3):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)
Output
Title Content Date Rating
0 I ordered a jacket 2 weeks ago I ordered a jacket 2 weeks ago. Still hasn't ... 2019-01-13 1
1 I've used this service for many years… I've used this service for many years and get ... 2018-12-31 4
2 Great website Great website, tailored recommendations, and e... 2018-12-19 5
3 I was excited by the prospect offered… I was excited by the prospect offered by threa... 2018-12-18 1
4 Thread set the benchmark for customer service Firstly, their customer service is second to n... 2018-12-12 5
5 It's a good idea It's a good idea. I am in between sizes and d... 2018-12-02 3
6 Great experience so far Great experience so far. Big choice of clothes... 2018-10-31 5
7 Absolutely love using Thread.com Absolutely love using Thread.com. As a man wh... 2018-10-31 5
8 I'd like to give Thread a one star… I'd like to give Thread a one star review, but... 2018-10-30 2
9 Really enjoying the shopping experience… Really enjoying the shopping experience on thi... 2018-10-22 5
10 The only way I buy clothes I absolutely love Thread. I've been surviving ... 2018-10-15 5
11 Excellent Service Excellent ServiceQuick delivery, nice items th... 2018-07-27 5
12 Convenient way to order clothes online Convenient way to order clothes online, and gr... 2018-07-05 5
13 Superb - would thoroughly recommend Recommendations have been brilliant - no more ... 2018-06-24 5
14 First time ordering from Thread First time ordering from Thread - Very slow de... 2018-06-22 1
15 Some of these criticisms are just madness I absolutely love thread.com, and I can't reco... 2018-05-28 5
16 Top service! Great idea and fantastic service. I just recei... 2018-05-17 5
17 Great service Great service. Great clothes which come well p... 2018-05-05 5
18 Thumbs up Easy, straightforward and very good costumer s... 2018-04-17 5
19 Good idea, ruined by slow delivery I really love the concept and the ordering pro... 2018-04-08 3
20 I love Thread I have been using thread for over a year. It i... 2018-03-12 5
21 Clever simple idea but.. low quality clothing Clever simple idea but.. low quality clothingL... 2018-03-12 2
22 Initially I was impressed.... Initially I was impressed with the Thread shop... 2018-02-07 2
23 Happy new customer Joined the site a few weeks ago, took a short ... 2018-02-06 5
24 Style tips for mature men I'm a man of mature age, let's say a "baby boo... 2018-01-31 5
25 Every shop, every item and in one place Simple, intuitive and makes online shopping a ... 2018-01-28 5
26 Fantastic experience all round Fantastic experience all round. Quick to regi... 2018-01-28 5
27 Superb "all in one" shopping experience … Superb "all in one" shopping experience that i... 2018-01-25 5
28 Great for time poor people who aren’t fond of ... Rally love this company. Super useful for thos... 2018-01-22 5
29 Really is worth trying! Quite cautious at first, however, love the way... 2018-01-10 4
30 14 days for returns is very poor given … 14 days for returns is very poor given most co... 2017-12-20 3
31 A great intro to online clothes … A great intro to online clothes shopping. Usef... 2017-12-15 5
32 I was skeptical at first I was skeptical at first, but the service is s... 2017-11-16 5
33 seems good to me as i hate to shop in … seems good to me as i hate to shop in stores, ... 2017-10-23 5
34 Great concept and service Great concept and service. This service has be... 2017-10-17 5
35 Slow dispatch My Order Dispatch was extremely slow compared ... 2017-10-07 1
36 This company sends me clothes in boxes This company sends me clothes in boxes! I find... 2017-08-28 5
37 I've been using Thread for the past six … I've been using Thread for the past six months... 2017-08-03 5
38 Thread Thread, this site right here is literally the ... 2017-06-22 5
39 good concept The website is a good concept in helping buyer... 2017-06-14 3
Note: Although i was able to "hack" my way into getting the result for this site, it is better to use selenium to scrape dynamic pages.
Edit: Code to find out number of pages automatically
from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
try:
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
except AttributeError:
pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)
Answered By - Bitto Bennichan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.