Thursday, January 25, 2024

[FIXED] Trying to make a POST request using Scrapy

January 25, 2024 post, python, scrapy, web-scraping No comments

Issue

I'm a beginner in web scraping in general. My goal is to scrap the site 'https://buscatextual.cnpq.br/buscatextual/busca.do', the thing is, this is a scientific site, so I need to check the box "Assunto(Título ou palavra chave da produção)" and also write in the main input of the page the word "grafos". How can I do it using Scrapy? I have been trying to do that with the following code but I had several errors and had never dealed with POST in general.

import scrapy

class LattesSpider(scrapy.Spider):
    name = 'lattesspider'
    login_url = 'https://buscatextual.cnpq.br/buscatextual/busca.do'
    start_urls = [login_url]

    
    def parse(self, response):
        data = {'filtros.buscaAssunto': True,
                'textoBusca': 'grafos'}
        yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse_profiles)
    
    def parse_profiles(self, response):
        yield {'url': response.url,
               'nome': response.xpath("//a/text()").get()
               }

Solution

If it's a little difficult and unfamiliar for you to use Scrapy, and it is hard to locate certain things on the page, I suggest using playwright. Playwright and Scrapy are both pretty new libraries, playwright is slightly newer. The reason I suggest using playwright is because it's very easy to locate buttons, checkboxes, and fill text boxes, using either CSS selectors or xpath. I have put installation and documentation at the bottom of my answer.

Here's some example code I pulled together that should work:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://buscatextual.cnpq.br/buscatextual/busca.do')
    page.locator('input#buscaAssunto').check()
    page.locator('input#textoBusca').fill('grafos')
    page.wait_for_timeout(5000)
    browser.close()

Here I used CSS, but you could also use xpath, playwright accepts both. Note that I launched chromium here, but you'll need a different line for every different browser.

Chromium: browser = p.chromium.launch() Chrome: browser = p.chromium.launch(channel="chrome") Msedge: browser = p.chromium.launch(channel="msedge") Firefox: browser = p.firefox.launch() Webkit: browser = p.webkit.launch()

Just replace that line with your current browser and that should work for you.

Note that I also included the headless=False argument, which allowed me to see the browser opening and checking and filling boxes (mainly for testing). Do away with that argument to be in headless mode (by default). I included: page.wait_for_timeout(5000) to wait 5 seconds before closing the browser.

Playwright: https://playwright.dev/python/docs/intro

Answered By - 5rod

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 25, 2024

[FIXED] Trying to make a POST request using Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels