Monday, January 24, 2022

[FIXED] Post request with scrapy on homepage with ajax

January 24, 2022 ajax, infinite-scroll, python, scrapy, web-scraping No comments

Issue

I am trying to scrape the prices of various pharmacies on the site https://www.medizinfuchs.de for a specific drug (e.g., https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html).

The page works with infinite scrolling that is called via a Load-more-button. Using the network analysis of the developer tools, I see that the page sends a post request to https://www.medizinfuchs.de/ajax_apotheken, if I click this button.
If I copy this post request as a cURL and then convert it with curl2scrapy, I get the following code:

from scrapy import Request

url = 'https://www.medizinfuchs.de/ajax_apotheken"'

request = Request(
    url=url,
    method='POST',
    dont_filter=True,
)

fetch(request)

The network analysis shows that the response to the post request is in HTML format (analogous to the homepage), but all pharmacies are listed there with their prices (not just barely ten pharmacies as on the homepage before I click the Load-more-button).

My somewhat embarrassing question - I'am still an absolute beginner - is now how I integrate this post request into my previous python code so that all pharmacies are scanned and I get the price information for all pharmacies. My previous python code is:

import scrapy

class MedizinfuchsSpider(scrapy.Spider):
    name = "medizinfuchs"
    start_urls = [
            'https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html'
        ]
        
    def parse(self, response):
        for apotheke in response.css('div.apotheke'):
            yield {
                'name': apotheke.css('a.name::text').getall(),
                'single': apotheke.css('div.single::text').getall(),
                'shipping': apotheke.css('div.shipping::text').getall(),
            }

I would be super grateful for support :-).

Christian

Solution

If you are open to suggestions using only requests and beautifulsoup, you can:

use a requests.Session() to store cookies and perform a first call on the url s.get(url). This will get the cookie product_history which is equal to the product id
use requests.post to call the API you've spotted in the chrome dev tools, and also specify the id in the form data

The following example iterates a list of products and perform the flow described above :

import requests
from bs4 import BeautifulSoup
import pandas as pd

products = [
    "https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html",
    "https://www.medizinfuchs.de/preisvergleich/alcohol-pads-b.braun-100-st-b.-braun-melsungen-ag-pzn-629703.html"
]

results = []

for url in products:
    # get id
    s = requests.Session()
    r = s.get(url)
    id = s.cookies.get_dict()["product_history"]

    soup = BeautifulSoup(r.text, "html.parser")
    pzn = soup.find("li", {"class": "pzn"}).text[5:]
    print(f'pzn: {pzn}')

    # make the call
    r = requests.post("https://www.medizinfuchs.de/ajax_apotheken",
                      data={
                          "params[ppn]": id,
                          "params[entry_order]": "single_asc",
                          "params[filter][rating]": "",
                          "params[filter][country]": 7,
                          "params[filter][favorit]": 0,
                          "params[filter][products_from][de]": 0,
                          "params[filter][products_from][at]": 0,
                          "params[filter][send]": 1,
                          "params[limit]": 300,
                          "params[merkzettel_sel]": "",
                          "params[merkzettel_reload]":  "",
                          "params[apo_id]":  ""
                      })
    soup = BeautifulSoup(r.text, "html.parser")
    data = [
        {
            "name": t.find("a").text.strip(),
            "single": t.find("div", {"class": "single"}).text.strip(),
            "shipping": t.find("div", {"class": "shipping"}).text.strip().replace("\t", "").replace("\n", " "),
        }
        for t in soup.findAll("div", {"class": "apotheke"})
    ]
    for t in data:
        results.append({
            "pzn": pzn,
            **t
        })
df = pd.DataFrame(results)
df.to_csv('result.csv', index=False)
print(df)

repl.it: https://replit.com/@bertrandmartel/ScrapeMedicinFuchs

Note that in the solution above, I'm only using requests.Session() in order to get the product_history cookie. The session is not needed in subsequent calls. This way, I get directly the product id without having to use regex in the html/js. But maybe there is a better way to get the product id, we can't get it from the url since it only has part of the product id 4114918 instead of 1104114918 (if you don't want to harcode the 110 suffix part)

Answered By - Bertrand Martel

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 24, 2022

[FIXED] Post request with scrapy on homepage with ajax

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels