Issue
I am trying to download multiple PDFs from CAG website (link https://cag.gov.in/en/state-accounts-report?defuat_state_id=64). I am using the following code-
url='https://cag.gov.in/en/state-accounts-report?defuat_state_id=64'
response=requests.get(url)
response
soup=BeautifulSoup(response.text,'html.parser')
soup
for link in soup.select("a[href$='.pdf']"):
print(link)
for link in soup.select("a[href$='.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
This is giving me all the PDFs from the whole page, I wish to download the PDF under the tab 'Monthly Key Indicators' only. Please suggest the necessary changes in the code to do that.
Solution
You could try narrowing down the tab from which the links are selected. The tab id can be found using
tabId = soup.find(
lambda t: t.name == 'a' and t.get('href') and
t.get('href').startswith('#tab') and # just in case
'Monthly Key Indicators' == t.get_text(strip=True)
).get('href')
(Or, if it's always the same id, you can just set as tabId = "#tab-360"
. ) Then, you can just change your selection to
soup.select(f"{tabId} a[href$='.pdf']")
But aren't you downloading the same file 3x with each report? You could alter your for-loop to only download from the links with "Download" as text:
pdfLinks = soup.select(f"{tabId} a[href$='.pdf']")
pdfLinks = [pl for pl in pdfLinks if pl.get_text(strip=True) == 'Download']
for link in pdfLinks:
#download
Answered By - PerpetuallyConfused
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.