Issue
I have Selenium opening many pdfs for me from Google Search (using f"https://www.google.com/search?q=filetype:pdf {search_term}"
and then clicking on the first link)
I want to know which pages contain my keyword WITHOUT downloading the pdf first. I believe I can use
Ctrl+F --> keyword --> {scrape page number} --> Tab (next keyword) --> {scrape page number} --> ... --> switch to next PDF
How can I accomplish the {scrape page number} part?
Context
For each PDF I need to grab these numbers as a list or in a Pandas DataFrame or anything I can use to feed in camelot.read_pdf() later
The idea is also once I have these page numbers, I can selectively download pages of these pdfs and save on storage, memory and network speeds rather than downloading and parsing the entire pdf
Using BeautifulSoup
PDFs have a small gray box at the top with the current page number and total pages number with the option to skip around the PDF.
<input data-element-focusable="true" id="pageselector" class="c0191 c0189" type="text" value="151" title="Page number (Ctrl+Alt+G)" aria-label="Go to any page between 1 and 216">
The value
in this input
tag contains the number I am looking for.
Other SO answers
I'm aware that reading PDFs programatically is challenging and I'm currently using this function (finding on which page a search string is located in a pdf document using python) to scrape the pdf pages having downloaded the whole PDF first. But Chrome searches PDFs pretty quickly with Ctrl+F which gives me inspiration I can use browser functionality to collect this data and I've already seen this data in the box at the top.
How do you keep the page numbers in a PDF where a keyword is present?
Solution
Your question is built on several misconceptions not helped by the way modern browsers obscure their workings.
Consider these points
while viewing a 4096 page pdf I can disconnect from the web and still navigate end to end. (Only possible by the fact that a PDF must download ALL pages to start view search edit etc, yes there are those that display early but most need 100% download first)
I can add annotation with the web address showing but clearly I am not writing on the server copy. The downloaded file is converted to text and pixels using my local resources, thus I have already paid the price of my own converted copy. Why would I want to keep repeating that cost over and over, simply save as my own, searchable copy, that's far easier to grep offline.
It does not matter which browser extension you are using they all hold the file somewhere in your file system, note the difference here the data says its on the web but the edit message show otherwise. In this case the field is secured outside the browser however Ctrl+D + C gives me
File: https://africau.edu/images/default/sample.pdf
Created: 3/1/2006 7:28:26 AM
Application: Rave (http://www.nevrona.com/rave)
PDF Producer: Nevrona Designs
PDF Version: 1.3
File Size: 2.96 KB (3,028 Bytes)
Number of Pages: 2
Page Size: 8.5 x 11.0 in (Letter)
Fonts: Helvetica (Type1; Ansi)
Mozilla PDF.js is a different beast, so may be more addressable, but as you found you can use a hybrid approach in the index.htm of Chrome/Edge you could equally do that offline.
So on the basis you have scraped a list of URLs the simplest solution should be
curl -O (or -o tmp.pdf) URL & pdftotext | find "Keyword"
you will need to adapt that a bit to show page and line number but that's a different question or two
https://stackoverflow.com/a/72440765/10802527
https://stackoverflow.com/a/72778117/10802527
Answered By - K J
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.