Issue
I'm a beginner in web scraping in general. My goal is to scrap the site 'https://buscatextual.cnpq.br/buscatextual/busca.do', the thing is, this is a scientific site, so I need to check the box "Assunto(Título ou palavra chave da produção)" and also write in the main input of the page the word "grafos". How can I do it using Scrapy? I have been trying to do that with the following code but I had several errors and had never dealed with POST in general.
import scrapy
class LattesSpider(scrapy.Spider):
name = 'lattesspider'
login_url = 'https://buscatextual.cnpq.br/buscatextual/busca.do'
start_urls = [login_url]
def parse(self, response):
data = {'filtros.buscaAssunto': True,
'textoBusca': 'grafos'}
yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse_profiles)
def parse_profiles(self, response):
yield {'url': response.url,
'nome': response.xpath("//a/text()").get()
}
Solution
If it's a little difficult and unfamiliar for you to use Scrapy, and it is hard to locate certain things on the page, I suggest using playwright
. Playwright
and Scrapy
are both pretty new libraries, playwright
is slightly newer. The reason I suggest using playwright
is because it's very easy to locate buttons, checkboxes, and fill text boxes, using either CSS selectors or xpath. I have put installation and documentation at the bottom of my answer.
Here's some example code I pulled together that should work:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://buscatextual.cnpq.br/buscatextual/busca.do')
page.locator('input#buscaAssunto').check()
page.locator('input#textoBusca').fill('grafos')
page.wait_for_timeout(5000)
browser.close()
Here I used CSS, but you could also use xpath, playwright
accepts both. Note that I launched chromium here, but you'll need a different line for every different browser.
Chromium: browser = p.chromium.launch()
Chrome: browser = p.chromium.launch(channel="chrome")
Msedge: browser = p.chromium.launch(channel="msedge")
Firefox: browser = p.firefox.launch()
Webkit: browser = p.webkit.launch()
Just replace that line with your current browser and that should work for you.
Note that I also included the headless=False
argument, which allowed me to see the browser opening and checking and filling boxes (mainly for testing). Do away with that argument to be in headless mode (by default). I included: page.wait_for_timeout(5000)
to wait 5 seconds before closing the browser.
Playwright: https://playwright.dev/python/docs/intro
Answered By - 5rod
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.