Issue
I have a data frame with multiple columns (I get it from pytesseract.image_to_data(img_pl,lang="eng", output_type='data.frame', config='--psm 11')
[used psm 11 or 12, same result] and taking only the important columns from it), lets look on the following columns:
# This is the data I get from the above command,
# I added it like that so you will be able to copy and test it
data = {'left': [154, 154, 200, 154, 201, 199],
'top': [0, 3, 3, 7, 8, 12],
'width': [576, 168, 162, 168, 155, 157],
'height': [89, 10, 10, 10, 10, 10],
'text': ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']}
output_test_min_agg = pd.DataFrame(data)
# Output:
+----+---+-----+------+-------+
|left|top|width|height| text|
+----+---+-----+------+-------+
| 154| 0| 576| 89| text1|
| 154| 3| 168| 10| text2|
| 200| 3| 162| 10| text3|
| 154| 7| 168| 10| text4|
| 201| 8| 155| 10| text5|
| 199| 12| 157| 10| text6|
+----+---+-----+------+-------+
Notice that some of the coordinates are off by few pixels (from what I saw its maximum 3-5 pixels off) that is why the width can also be taken to account (for example the left of "abc" and "abcdef" will be different but with the width we can see that it reaches to the same size
Excepted result will be as below:
+-----+-------+-------+
|index| col 01| col 02|
+-----+-------+-------+
| 0| text1| |
| 1| text2| text3|
| 2| text4| text5|
| 3| | text6|
+-----+-------+-------+
The best result I get is from this:
output_test_min_agg=output_test_min.sort_values('top', ascending=True)
output_test_min_agg = output_test_min_agg.groupby(['top', 'left'], sort=False)['text'].sum().unstack('left')
output_test_min_agg.reindex(sorted(output_test_min_agg.columns), axis=1).dropna(how='all')
But it's still not good because if the top
or left
have even 1 pixel difference it will create a whole new column and row for them
How can I accomplish such a task?
Solution
I accomplished it by doing the following:
I made 3 functions for each purpose
1) Using your dummy data:
import pandas as pd
import numpy as np
# Create a dictionary of data for the DataFrame
data = {'left': [154, 154, 200, 154, 201, 199],
'top': [0, 3, 3, 7, 8, 12],
'width': [576, 168, 162, 168, 155, 157],
'height': [89, 10, 10, 10, 10, 10],
'text': ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']}
# Create the DataFrame
df = pd.DataFrame(data)
2) Creating a function, using the code you supply + adding to it handling of the NaN
values
def optimizeDf(df: pd.DataFrame) -> pd.DataFrame:
df['left+width'] = df['left'] + df['width']
df = df.sort_values(by=['top'], ascending=True)
df = df.groupby(['top', 'left+width'], sort=False)['text'].sum().unstack('left+width')
df = df.reindex(sorted(df.columns), axis=1).dropna(how='all').dropna(axis='columns', how='all')
df = df.fillna('')
return df
df = optimize_df(df)
3) Creating a function to merge the columns based on the name threshold similarity:
def mergeDfColumns(old_df: pd.DataFrame, threshold: int = 10) -> pd.DataFrame:
new_columns = {}
old_columns = old_df.columns
i = 0
while i < len(old_columns) - 1:
if any(old_columns[i+1] == old_columns[i] + x for x in range(1, threshold)):
new_col = old_df[old_columns[i]] + old_df[old_columns[i+1]]
new_columns[old_columns[i+1]] = new_col
i += 1
else:
new_columns[old_columns[i]] = old_df[old_columns[i]]
i += 1
new_columns[old_columns[i]] = old_df[old_columns[i]]
return pd.DataFrame.from_dict(new_columns).replace('', np.nan).dropna(axis='columns', how='all').fillna('')
df = mergeDfColumns(df)
4) Creating a function to merge the rows based on the name threshold similarity:
def mergeDfRows(old_df: pd.DataFrame, threshold: int = 2) -> pd.DataFrame:
new_df = old_df.iloc[:1]
for i in range(1, len(old_df)):
if abs(old_df.index[i] - old_df.index[i - 1]) < threshold:
new_df.iloc[-1] = new_df.iloc[-1] + old_df.iloc[i]
else:
new_df = new_df.append(old_df.iloc[i])
return new_df.reset_index(drop=True)
df = mergeDfRows(df)
The end result will be as follows:
+-+-----+-----+-----+
| | 322| 362| 730|
+-+-----+-----+-----+
|0| | |text1|
|1|text2|text3| |
|2|text4|text5| |
|3| |text6| |
+-+-----+-----+-----+
That is the best result I got from your dummy data, but please notice how text1
gets it's own row and column, it's because of the data if you'll look you'll see it's width and height are huge compare to the others, what I think is that your table in the image have some sort of a title that is really close to it and pytesseract
recognized it as part of the table, my suggestions to you is to try some other config
options or use some deep learning in order to classify your table better.
Answered By - Lidor Eliyahu Shelef
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.