Issue
I am trying to scrap data (medicine name) from this link https://www.1mg.com/drugs-all-medicines this link have 841 pages with 30 data per page. But my code is somehow only picking 20 data per page. I don't know what causing it and how to solve it. this is the code I am using.
import requests
import json
import io
from bs4 import BeautifulSoup
medicine_name = []
f = io.open('data.txt', 'a', encoding='utf-8')
for i in range(1,842):
url = "https://www.1mg.com/drugs-all-medicines?page=" + str(i)
r = requests.get(url)
HTMLcontent = r.content
soup = BeautifulSoup(HTMLcontent, 'html.parser')
json_data = json.loads(
soup.select_one("script").string
)
for data in json_data['itemListElement']:
medicine_name.append(data['name'])
f.write('\n'+data['name'])
print("parsed --> " + str(len(medicine_name)) + " from page No. --> " + str(i) + "")
medicine_name = []
f.close()
I am getting this output:
PS E:\Practice\Python\1mg Scrapper> & D:/Python396/python.exe "e:/Practice/Python/1mg Scrapper/tool.py"
parsed --> 20 from page No. --> 1
parsed --> 20 from page No. --> 2
parsed --> 20 from page No. --> 3
parsed --> 20 from page No. --> 4
parsed --> 20 from page No. --> 5
parsed --> 20 from page No. --> 6
parsed --> 20 from page No. --> 7
parsed --> 20 from page No. --> 8
parsed --> 20 from page No. --> 9
...................................
<-----------Upto------------------>
...................................
parsed --> 20 from page No. --> 837
parsed --> 20 from page No. --> 838
parsed --> 20 from page No. --> 839
parsed --> 20 from page No. --> 840
parsed --> 20 from page No. --> 841
I am expecting output that is something like
parsed --> 30 from page No. --> xxx
Solution
Try to specify User-Agent
HTTP header. Without it, the server returns different type of page:
import json
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
for i in range(1, 842):
url = "https://www.1mg.com/drugs-all-medicines?page=" + str(i)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
json_data = json.loads(
soup.select_one('script[type="application/ld+json"]').string
)
for data in json_data["itemListElement"]:
print(data["name"])
Prints 30 products per page:
Ascoril D Plus Syrup Sugar Free
Augmentin 625 Duo Tablet
Allegra 180mg Tablet
Azithral 500 Tablet
Ascoril LS Syrup
Avil 25 Tablet
Allegra 120mg Tablet
...and so on.
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.