Issue
I am trying to scrape the prices of various pharmacies on the site https://www.medizinfuchs.de for a specific drug (e.g., https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html).
The page works with infinite scrolling that is called via a Load-more-button. Using the network analysis of the developer tools, I see that the page sends a post request to https://www.medizinfuchs.de/ajax_apotheken, if I click this button.
If I copy this post request as a cURL and then convert it with curl2scrapy, I get the following code:
from scrapy import Request
url = 'https://www.medizinfuchs.de/ajax_apotheken"'
request = Request(
url=url,
method='POST',
dont_filter=True,
)
fetch(request)
The network analysis shows that the response to the post request is in HTML format (analogous to the homepage), but all pharmacies are listed there with their prices (not just barely ten pharmacies as on the homepage before I click the Load-more-button).
My somewhat embarrassing question - I'am still an absolute beginner - is now how I integrate this post request into my previous python code so that all pharmacies are scanned and I get the price information for all pharmacies. My previous python code is:
import scrapy
class MedizinfuchsSpider(scrapy.Spider):
name = "medizinfuchs"
start_urls = [
'https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html'
]
def parse(self, response):
for apotheke in response.css('div.apotheke'):
yield {
'name': apotheke.css('a.name::text').getall(),
'single': apotheke.css('div.single::text').getall(),
'shipping': apotheke.css('div.shipping::text').getall(),
}
I would be super grateful for support :-).
Christian
Solution
If you are open to suggestions using only requests and beautifulsoup, you can:
- use a
requests.Session()
to store cookies and perform a first call on the urls.get(url)
. This will get the cookieproduct_history
which is equal to the product id - use
requests.post
to call the API you've spotted in the chrome dev tools, and also specify theid
in the form data
The following example iterates a list of products and perform the flow described above :
import requests
from bs4 import BeautifulSoup
import pandas as pd
products = [
"https://www.medizinfuchs.de/preisvergleich/aspirin-complex-beutel-20-st-bayer-vital-gmbh-pzn-4114918.html",
"https://www.medizinfuchs.de/preisvergleich/alcohol-pads-b.braun-100-st-b.-braun-melsungen-ag-pzn-629703.html"
]
results = []
for url in products:
# get id
s = requests.Session()
r = s.get(url)
id = s.cookies.get_dict()["product_history"]
soup = BeautifulSoup(r.text, "html.parser")
pzn = soup.find("li", {"class": "pzn"}).text[5:]
print(f'pzn: {pzn}')
# make the call
r = requests.post("https://www.medizinfuchs.de/ajax_apotheken",
data={
"params[ppn]": id,
"params[entry_order]": "single_asc",
"params[filter][rating]": "",
"params[filter][country]": 7,
"params[filter][favorit]": 0,
"params[filter][products_from][de]": 0,
"params[filter][products_from][at]": 0,
"params[filter][send]": 1,
"params[limit]": 300,
"params[merkzettel_sel]": "",
"params[merkzettel_reload]": "",
"params[apo_id]": ""
})
soup = BeautifulSoup(r.text, "html.parser")
data = [
{
"name": t.find("a").text.strip(),
"single": t.find("div", {"class": "single"}).text.strip(),
"shipping": t.find("div", {"class": "shipping"}).text.strip().replace("\t", "").replace("\n", " "),
}
for t in soup.findAll("div", {"class": "apotheke"})
]
for t in data:
results.append({
"pzn": pzn,
**t
})
df = pd.DataFrame(results)
df.to_csv('result.csv', index=False)
print(df)
repl.it: https://replit.com/@bertrandmartel/ScrapeMedicinFuchs
Note that in the solution above, I'm only using requests.Session()
in order to get the product_history
cookie. The session is not needed in subsequent calls. This way, I get directly the product id without having to use regex in the html/js. But maybe there is a better way to get the product id, we can't get it from the url since it only has part of the product id 4114918
instead of 1104114918
(if you don't want to harcode the 110
suffix part)
Answered By - Bertrand Martel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.