Issue
I'm trying to scrape a dynamic website and I need Selenium.
The links that I want to scrape only open if I click on that specific element. They are being opened by jQuery, so my only option is to click on them because there is no href attribute or anything that would give me an URL.
My approach is this one:
# -*- coding: utf-8 -*-
import scrapy
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
class AnofmSpider(scrapy.Spider):
name = 'anofm'
def start_requests(self):
yield SeleniumRequest(
url='https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1',
callback=self.parse
)
def parse(self, response):
driver = response.meta['driver']
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "tableRepeat2"))
)
finally:
html = driver.page_source
response_obj = Selector(text=html)
links = response_obj.xpath("//tbody[@id='tableRepeat2']")
for link in links:
driver.execute_script("arguments[0].click();", link)
yield {
'Ocupatia': response_obj.xpath("//div[@id='print']/p/text()[1]")
}
but it won't work.
On the line where I want to click on that element, I get this error:
TypeError: Object of type Selector is not JSON serializable
I kind of understand this error, but I have no idea how to solve it. I somehow need to transform that object from a Selector to a Clickable button.
I checked online for solutions and also the docs, but I couldn't find anything useful.
Can anybody help me better understand this error and how should I fix it?
Thanks.
Solution
Actually, data is also generating from API
calls JSON
response and you can easily scrape from API
. Here is the working solution along with pagination. Each page contains 8 items where total items 32.
CODE:
import scrapy
import json
class AnofmSpider(scrapy.Spider):
name = 'anofm'
def start_requests(self):
yield scrapy.Request(
url='https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit=8&localitate=',
method='GET',
callback=self.parse,
meta= {
'limit': 8}
)
def parse(self, response):
resp = json.loads(response.body)
hits = resp.get('lmv').get('data')
for h in hits:
yield {
'Ocupatia': h.get('OCCUPATION')
}
total_limit = resp.get('lmv').get('total')
next_limit = response.meta['limit'] + 8
if next_limit <= total_limit:
yield scrapy.Request(
url=f'https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit={next_limit}&localitate=',
method='GET',
callback=self.parse,
meta= {
'limit': next_limit}
)
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.