Issue
Is it possible to scrape Google for PDF files? Like, to download all ".pdf" files within a certain number of search results for a given term. Webscraping is pretty new to me, though I've been using beautifulsoup4 if it's possible with that.
Thanks in advance.
Solution
Here's what I would do.
Google allows you to search by file type by adding
filetype:[your file type extension (pdf)]
.You can bypass the Google search page by using a direct URL and changing the query: https://www.google.com/search?q=these+are+keywords+filetype%3Apdf
You can use BeautifulSoup to find the URL of each search result (relevant question's answer). The most important part is that each search result has a class "
g
", so you can get the URL from each element that has that class.From there, you can use BeautifulSoup to find the direct URL to the PDF. The URL will be in the tag type "
a
" and will be in the formhref
. Relevant question's answer
I'm not an expert, but maybe this will be enough to set you on your way. Others may chime in with better methods.
Answered By - TheKingElessar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.