Issue
I am new to web scraping. I want to scrape the data (comments and respective dates) from this web page https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938 It has pagination for pages.... This is the way I am doing
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
AllEntries = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
noofforumpagesvodafone = 1000
currentpage = 1
page = browser.new_page()
page.goto('https://search.donanimhaber.com/?q=vodafone&p='+ str(currentpage) + '&token=-1&hash=56BB9D1746DBCDA94D0B1E5825EFF47D&order=date&in=all&type=both&scope=all&range=all',timeout = 0)
html = page.inner_html("div.results")
soup = BeautifulSoup(html, 'html.parser')
xx = [x.get('href') for x in soup.find_all('a')]
xxi = 0
time = []
while(xxi<1):
if(xx[xxi][0] == "/"):
entry = []
# page.goto('https://search.donanimhaber.com' + str(xx[xxi]),timeout = 0)
page.goto("https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938")
html = page.inner_html("div.kl-icerik")
soup = BeautifulSoup(html, 'html.parser')
for table in soup.findAll('div', {'class': 'ki-cevapicerigi'}):
for t in table.findAll('span', {'class': 'mButon info'}):
print(t.text)
for links in table.findAll('span', {'class': 'msg'}):
for link in links.findAll('td'):
print(link.text)
for linko in links.findAll('p'):
print(linko.text)
This code is working only on first page its give all comments and dates accordingly..but not from page 2.3.4..... which appears as we scroll to the buttom
How can I do that ...Thank you
Solution
In your special case, each page has their own link. It is your base link and the page number with an hyphen (-) in between.
You can see this behaviour when clicking on the second page, compare your base-link with the link you have now: https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938-2
(notice the -2 at the end)
One way to do it, would be to change your url in a for-loop, iterating up to 24 and scrape all of those pages individually.
Answered By - Dugnom
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.