Issue
I am new to extracting data from PDF files. I need help regarding extraction of paragraph content which contains a particular keyword. The issue I face is that the paragraph in which the keyword is, extends to another page, with the page separator \r\n\x0c
and all the paragraphs are separated with \r\n
pattern.
The below is the attachment that will make you understand the issue.
Here is the link to the pdf https://www.jpmorganchase.com/content/dam/jpmc/jpmorgan-chase-and-co/investor-relations/documents/annualreport-2022.pdf
I need to extract the content from "As these events…." to "economy is safe and secure."
I will provide you with a small chunk of or the same chunk of text extracted formatted from pdf.
"\r\n\r\nAcross the globe, 2022 was another year of significant challenges: from a terrible war in Ukraine and growing geopolitical tensions -- particularly with China -- to a politically divided America. Almost all nations felt the effects of global economic uncertainty, including higher energy and food prices, mounting inflation rates and volatile markets, and, of course, COVID-19's lingering impacts. While all these experiences and associated turmoil have serious ramifications on our company, colleagues, clients and the countries in which we do business, their consequences on the world at large -- with the extreme suffering of the Ukrainian people and the potential restructuring of the global order -- are far more important.\r\nAs these events unfold, America remains divided within its borders, and its global leadership role is being challenged outside of its borders. Nevertheless, this is the moment when we should put aside our differences and work with other Western nations to come together in defense of democracy and essential\r\n2\r\n\r\n\x0cfreedoms, including free enterprise. During other times of great crisis, we have seen America, in partnership with other countries around the globe, unite for a common cause. This is that moment again, when our country needs to work across public and private sectors to lead while improving American competitiveness -- which also means re-establishing the American promise of providing equal access to opportunity for all. JPMorgan Chase, a company that historically has worked across borders and boundaries, will do its part to ensure the global economy is safe and secure.\r\n"
Please help me by providing the regex for extracting this content.
paragraph_pattern = re.compile(r'(?<=\r\n)([A-Z][^\n]+(\r\n\x0c(?!\n)[^\n]+)*)')
I tried providing this regex, but it only gives me paragraph content which are available only in single page. I need a starting pattern of '\r\n' which start with [A-Z] and ending pattern of '\r\n' which ends with (.), furthermore I also need to add the case of page separation that I have provided above.
The code that I am using is:
import re
import textract
def extract_paragraphs_from_pdf(pdf_path, keywords):
paragraph_pattern = re.compile(r'(?<=\n)[A-Z].*(?:(?:\n|\r\n\x0c)(?!\n).*)*\.(?=\r\n)')
extracted_paragraphs = []
pdf_text = textract.process(pdf_path, method='pdftotext').decode('utf-8')
temp = repr(pdf_text)
#with open('sa.txt','w') as f:
# f.write(temp)
matches = paragraph_pattern.findall(pdf_text)
for paragraph in matches:
if any(f' {keyword} ' in paragraph[0].lower() for keyword in keywords):
extracted_paragraphs.append(paragraph[0].strip())
return extracted_paragraphs
pdf_file_path = './jp.pdf'
search_keywords = ['partnership']
result_paragraphs = extract_paragraphs_from_pdf(pdf_file_path,
search_keywords)
for res in result_paragraphs:
print("-->>",res)
Solution
Note that instead of using paragraph[0].lower()
you should use paragraph.lower()
or else you would get the first character of the string instead.
Using re.findall will return the capture group 1 values in a list
You could use a pattern with a capture group:
\r?\n([A-Z][^.]*(?:\.(?!\r?\n)[^.]*)*\.)(?=\r?\n)
\r?\n
Match a newline(
Capture group 1[A-Z][^.]*
Match an uppercase char A-Z and optional chars other than a dot(?:\.(?!\r?\n)[^.]*)*
Then repeat matching any character, and only dots that are not directly followed by a newline\.
Match a dot
)
Close group 1(?=\r?\n)
Positive lookahead, assert a newline directly to the right
The updated code:
import re
import textract
def extract_paragraphs_from_pdf(pdf_path, keywords):
paragraph_pattern = re.compile(r'\r?\n([A-Z][^.]*(?:\.(?!\r?\n)[^.]*)*\.)(?=\r?\n)')
extracted_paragraphs = []
pdf_text = textract.process(pdf_path, method='pdftotext').decode('utf-8')
matches = paragraph_pattern.findall(pdf_text)
for paragraph in matches:
if any(f' {keyword} ' in paragraph.lower() for keyword in keywords):
extracted_paragraphs.append(paragraph.strip())
return extracted_paragraphs
pdf_file_path = './jp.pdf'
search_keywords = ["partnership"]
result_paragraphs = extract_paragraphs_from_pdf(pdf_file_path, search_keywords)
for res in result_paragraphs:
print("-->>", res)
The result (you want the first, but note that you get 3 results as you are filtering on partnership
)
Answered By - The fourth bird
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.