Wednesday, September 7, 2022

[FIXED] Using Selenium to return which page of the PDF is being displayed

September 07, 2022 beautifulsoup, pdf, python, selenium, selenium-chromedriver No comments

Issue

I have Selenium opening many pdfs for me from Google Search (using f"https://www.google.com/search?q=filetype:pdf {search_term}" and then clicking on the first link)

I want to know which pages contain my keyword WITHOUT downloading the pdf first. I believe I can use

Ctrl+F --> keyword --> {scrape page number} --> Tab (next keyword) --> {scrape page number} --> ... --> switch to next PDF

How can I accomplish the {scrape page number} part?

Context

For each PDF I need to grab these numbers as a list or in a Pandas DataFrame or anything I can use to feed in camelot.read_pdf() later

The idea is also once I have these page numbers, I can selectively download pages of these pdfs and save on storage, memory and network speeds rather than downloading and parsing the entire pdf

Using BeautifulSoup

PDFs have a small gray box at the top with the current page number and total pages number with the option to skip around the PDF.


<input data-element-focusable="true" id="pageselector" class="c0191 c0189" type="text" value="151" title="Page number (Ctrl+Alt+G)" aria-label="Go to any page between 1 and 216">

The value in this input tag contains the number I am looking for.

Solution

Your question is built on several misconceptions not helped by the way modern browsers obscure their workings.

Consider these points

while viewing a 4096 page pdf I can disconnect from the web and still navigate end to end. (Only possible by the fact that a PDF must download ALL pages to start view search edit etc, yes there are those that display early but most need 100% download first)
I can add annotation with the web address showing but clearly I am not writing on the server copy. The downloaded file is converted to text and pixels using my local resources, thus I have already paid the price of my own converted copy. Why would I want to keep repeating that cost over and over, simply save as my own, searchable copy, that's far easier to grep offline.

It does not matter which browser extension you are using they all hold the file somewhere in your file system, note the difference here the data says its on the web but the edit message show otherwise. In this case the field is secured outside the browser however Ctrl+D + C gives me

File: https://africau.edu/images/default/sample.pdf
Created: 3/1/2006 7:28:26 AM
Application: Rave (http://www.nevrona.com/rave)
PDF Producer: Nevrona Designs
PDF Version: 1.3
File Size: 2.96 KB (3,028 Bytes)
Number of Pages: 2
Page Size: 8.5 x 11.0 in (Letter)
   
Fonts: Helvetica (Type1; Ansi)

Mozilla PDF.js is a different beast, so may be more addressable, but as you found you can use a hybrid approach in the index.htm of Chrome/Edge you could equally do that offline.

So on the basis you have scraped a list of URLs the simplest solution should be
curl -O (or -o tmp.pdf) URL & pdftotext | find "Keyword"

you will need to adapt that a bit to show page and line number but that's a different question or two
https://stackoverflow.com/a/72440765/10802527
https://stackoverflow.com/a/72778117/10802527

Answered By - K J

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, September 7, 2022

[FIXED] Using Selenium to return which page of the PDF is being displayed

Issue

Context

Using BeautifulSoup

Other SO answers

Solution

0 comments:

Post a Comment

Popular Posts

Labels