Issue
update when i write (as answerer said)
with open('results.txt', 'a', encoding="utf-8") as f:
for line in results:
f.write(line)
f.write('\n')
all text pieces are appended correctly into result.txt
. But when i go into the cmd and do
magick -density 288 text:"result.txt" -alpha off -compress Group4 filename1.tif
it creates filename1.tif
with all of the result.txt
characters as a picture.
Original question: This code accesses a folder of single page .tif files and extracts textual data.
data = []
data1 = []
listOfPages = glob.glob(r"C:/Users/name/folder/*.tif")
for entry in listOfPages:
if os.path.isfile(entry):
filenames = entry
data1.append(filenames)
text1 = pytesseract.image_to_string(
Image.open(entry), lang="en"
)
text = re.sub(r'\n',' ', text1)
regex1 = re.compile(r'(www(i|ı)a\s+bbb(\:)?(\s+|\s+\.)?\s+(de(s|r(:)?))?)', flags = re.IGNORECASE)
try:
var1a = regex1.search(text)
if var1a:
var1 = var1a.group(1)
else:
var1 = None
except:
pass
data.append([text, var1])
df0 = pd.DataFrame(data, columns =['raw_text', 'var1'])
df01= pd.DataFrame(data1,columns =['filename'])
df1 = pd.concat([df0, df01], axis=1)
I wanted to adjust it in order to work for multipage files, too. Therefore i am trying to convert it through Image.fromarray() which raises the following error:
text1 = pytesseract.image_to_string(np.array(entry), lang="en")
or
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry)), lang="en")
TypeError: Cannot handle this data type: (1, 1), <U52
I use python 3.9.7 pytesseract 0.3.8 numpy 1.21.2 pillow 8.3.2
I read this PIL TypeError: Cannot handle this data type
and came up with this
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.uint8)), lang="en")
which gives me error: ValueError: invalid literal for int() with base 10: 'C:/Users/name/folder/test\\fff.tifC:/Users/name/folder/test\\ddddd.tif
which hints at the need to use float
but when i do
text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.float)), lang="en")
i get
ValueError: could not convert string to float: 'C:/Users/name/folder/test\\ffff.tif
exiftool output
File Type : TIFF
File Type Extension : tif
MIME Type : image/tiff
Exif Byte Order : Little-endian (Intel, II)
Subfile Type : Full-resolution image
Image Width : 2472
Image Height : 3495
Bits Per Sample : 1
Compression : T6/Group 4 Fax
Photometric Interpretation : BlackIsZero
Thresholding : No dithering or halftoning
Fill Order : Reversed
Image Description : DN31
Camera Model Name : SCA
Strip Offsets : (Binary data 90 bytes, use -b option to extract)
Orientation : Horizontal (normal)
Samples Per Pixel : 1
Rows Per Strip : 213
Strip Byte Counts : (Binary data 73 bytes, use -b option to extract)
X Resolution : 300
Y Resolution : 300
Planar Configuration : Chunky
T6 Options : (none)
Resolution Unit : inches
Software : DACS Toolkit II
Modify Date : 1998:03:12 10:29:31
Image Size : 2472x3495
Megapixels : 8.6
other suggestions on SO are
im = Image.fromarray((img[0] * 255).astype(np.uint8))
If your image is greyscale, you need to pass PIL a 2-D array, i.e. the shape must be h,w not h,w,1.
i = Image.open('image.png').convert('RGB')
a = np.asarray(i, np.uint8)
print(a.shape)
b = abs(np.fft.rfft2(a,axes=(0,1)))
b = np.uint8(b)
j = Image.fromarray(b)
By default, it uses the last two axes: axes=(-2,-1). The third axis represents the RGB channels. Instead, it seems more plausible that one would want to perform an FFT over the spatial axes, axes=(0,1)
img = Image.fromarray(data[0][i].transpose(0,2).numpy().astype(np.uint8))
channel dimension will be the last (rather than the first)
Solution
I think you want something more like this to process multipage TIFFs. I have tried to improve your variable names from nondescript names like data
, var
to make it more readable.
#!/usr/bin/env python3
import re
from glob import glob
import pytesseract
from PIL import Image, ImageSequence
def processPage(filename, pageNum, im):
global results
print(f'Processing: {filename}, page: {pageNum}')
text = pytesseract.image_to_string(im, lang="eng")
srchResult = regex.search(text)
if srchResult is not None:
results.append(srchResult.group(0))
# Compile regex just once, outside loop - it doesn't change
regex = re.compile(r'(\w+\s(Queen|President|Washington|London|security|architect)\s\w+)', flags = re.IGNORECASE)
results = []
# Get list of all filenames to be processed
filenames = glob('folder/*.tif')
# Iterate over all files
for filename in filenames:
print(f'Processing file: {filename}')
with Image.open(filename) as im:
for pageNum, page in enumerate(ImageSequence.Iterator(im)):
processPage(filename, pageNum, page)
print('\n'.join(results))
You don't need to do the rest of this stuff... it's just there so you can see how I generated TIFFs to test with...
I tested it by making 2 multi-page TIFFs from the Wikipedia entries for "Buckingham Palace" and "The White House" with ImageMagick by going to each of those pages, copying and saving the text as a.txt
and then doing:
magick -density 288 text:"a.txt" -alpha off -compress Group4 WhiteHouse.tif
Sample Output
Processing file: folder/Buckingham.tif
Processing: folder/Buckingham.tif, page: 0
Processing: folder/Buckingham.tif, page: 1
Processing: folder/Buckingham.tif, page: 2
Processing file: folder/WhiteHouse.tif
Processing: folder/WhiteHouse.tif, page: 0
Processing: folder/WhiteHouse.tif, page: 1
Processing: folder/WhiteHouse.tif, page: 2
Processing: folder/WhiteHouse.tif, page: 3
Processing: folder/WhiteHouse.tif, page: 4
Processing: folder/WhiteHouse.tif, page: 5
the London residence
stricken Queen withdrew
the president of
George Washington occupied
President Washington
by architect Frederick
in Washington when
House security breaches
Answered By - Mark Setchell
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.