Wednesday, November 17, 2021

[FIXED] storing image (.tif) in np.array through PIL fromarray [TypeError: Cannot handle this data type][ValueError: invalid literal for int()]

November 17, 2021 numpy, python-imaging-library, python-tesseract, tiff No comments

Issue

update when i write (as answerer said)

with open('results.txt', 'a', encoding="utf-8") as f:
    for line in results:
        f.write(line)
        f.write('\n')

all text pieces are appended correctly into result.txt. But when i go into the cmd and do magick -density 288 text:"result.txt" -alpha off -compress Group4 filename1.tif it creates filename1.tif with all of the result.txt characters as a picture.

Original question: This code accesses a folder of single page .tif files and extracts textual data.

data = []
data1 = []
listOfPages = glob.glob(r"C:/Users/name/folder/*.tif")
for entry in listOfPages:
    if os.path.isfile(entry):
        filenames = entry
    data1.append(filenames)
    text1 = pytesseract.image_to_string(
            Image.open(entry), lang="en"
        )
    text = re.sub(r'\n',' ', text1)     
    regex1 = re.compile(r'(www(i|ı)a\s+bbb(\:)?(\s+|\s+\.)?\s+(de(s|r(:)?))?)', flags = re.IGNORECASE)
        
    try:
        var1a = regex1.search(text)
        if var1a:
            var1 = var1a.group(1)
        else:
            var1 = None
    except:
        pass
        
    data.append([text, var1])
    
df0 = pd.DataFrame(data, columns =['raw_text', 'var1'])
df01= pd.DataFrame(data1,columns =['filename'])
df1 = pd.concat([df0, df01], axis=1)

I wanted to adjust it in order to work for multipage files, too. Therefore i am trying to convert it through Image.fromarray() which raises the following error:

text1 = pytesseract.image_to_string(np.array(entry), lang="en") or text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry)), lang="en")

TypeError: Cannot handle this data type: (1, 1), <U52

I use python 3.9.7 pytesseract 0.3.8 numpy 1.21.2 pillow 8.3.2 I read this PIL TypeError: Cannot handle this data type and came up with this

text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.uint8)), lang="en")

which gives me error: ValueError: invalid literal for int() with base 10: 'C:/Users/name/folder/test\\fff.tifC:/Users/name/folder/test\\ddddd.tif

which hints at the need to use float

but when i do

text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry * 255).astype(np.float)), lang="en")

i get

ValueError: could not convert string to float: 'C:/Users/name/folder/test\\ffff.tif

exiftool output

File Type                       : TIFF
File Type Extension             : tif
MIME Type                       : image/tiff
Exif Byte Order                 : Little-endian (Intel, II)
Subfile Type                    : Full-resolution image
Image Width                     : 2472
Image Height                    : 3495
Bits Per Sample                 : 1
Compression                     : T6/Group 4 Fax
Photometric Interpretation      : BlackIsZero
Thresholding                    : No dithering or halftoning
Fill Order                      : Reversed
Image Description               : DN31
Camera Model Name               : SCA
Strip Offsets                   : (Binary data 90 bytes, use -b option to extract)
Orientation                     : Horizontal (normal)
Samples Per Pixel               : 1
Rows Per Strip                  : 213
Strip Byte Counts               : (Binary data 73 bytes, use -b option to extract)
X Resolution                    : 300
Y Resolution                    : 300
Planar Configuration            : Chunky
T6 Options                      : (none)
Resolution Unit                 : inches
Software                        : DACS Toolkit II
Modify Date                     : 1998:03:12 10:29:31
Image Size                      : 2472x3495
Megapixels                      : 8.6

other suggestions on SO are

im = Image.fromarray((img[0] * 255).astype(np.uint8)) If your image is greyscale, you need to pass PIL a 2-D array, i.e. the shape must be h,w not h,w,1.

i = Image.open('image.png').convert('RGB')
a = np.asarray(i, np.uint8)
print(a.shape)

b = abs(np.fft.rfft2(a,axes=(0,1)))
b = np.uint8(b)
j = Image.fromarray(b)

By default, it uses the last two axes: axes=(-2,-1). The third axis represents the RGB channels. Instead, it seems more plausible that one would want to perform an FFT over the spatial axes, axes=(0,1)

img = Image.fromarray(data[0][i].transpose(0,2).numpy().astype(np.uint8)) channel dimension will be the last (rather than the first)

Solution

I think you want something more like this to process multipage TIFFs. I have tried to improve your variable names from nondescript names like data, var to make it more readable.

#!/usr/bin/env python3

import re
from glob import glob
import pytesseract
from PIL import Image, ImageSequence

def processPage(filename, pageNum, im):
    global results
    print(f'Processing: {filename}, page: {pageNum}')

    text = pytesseract.image_to_string(im, lang="eng")
    srchResult = regex.search(text)
    if srchResult is not None:
        results.append(srchResult.group(0))

# Compile regex just once, outside loop - it doesn't change
regex = re.compile(r'(\w+\s(Queen|President|Washington|London|security|architect)\s\w+)', flags = re.IGNORECASE)

results = []

# Get list of all filenames to be processed
filenames = glob('folder/*.tif')

# Iterate over all files
for filename in filenames:
    print(f'Processing file: {filename}')
    with Image.open(filename) as im:
        for pageNum, page in enumerate(ImageSequence.Iterator(im)):
                processPage(filename, pageNum, page)

print('\n'.join(results))

You don't need to do the rest of this stuff... it's just there so you can see how I generated TIFFs to test with...

I tested it by making 2 multi-page TIFFs from the Wikipedia entries for "Buckingham Palace" and "The White House" with ImageMagick by going to each of those pages, copying and saving the text as a.txt and then doing:

magick -density 288 text:"a.txt" -alpha off -compress Group4 WhiteHouse.tif

Sample Output

Processing file: folder/Buckingham.tif
Processing: folder/Buckingham.tif, page: 0
Processing: folder/Buckingham.tif, page: 1
Processing: folder/Buckingham.tif, page: 2
Processing file: folder/WhiteHouse.tif
Processing: folder/WhiteHouse.tif, page: 0
Processing: folder/WhiteHouse.tif, page: 1
Processing: folder/WhiteHouse.tif, page: 2
Processing: folder/WhiteHouse.tif, page: 3
Processing: folder/WhiteHouse.tif, page: 4
Processing: folder/WhiteHouse.tif, page: 5
the London residence
stricken Queen withdrew
the president of
George Washington occupied
President Washington
by architect Frederick
in Washington when
House security breaches

Answered By - Mark Setchell

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 17, 2021

[FIXED] storing image (.tif) in np.array through PIL fromarray [TypeError: Cannot handle this data type][ValueError: invalid literal for int()]

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels