Monday, November 13, 2023

[FIXED] Create a dataframe - order based on text coordinates

November 13, 2023 dataframe, pandas, python, python-tesseract No comments

Issue

I have a data frame with multiple columns (I get it from pytesseract.image_to_data(img_pl,lang="eng", output_type='data.frame', config='--psm 11') [used psm 11 or 12, same result] and taking only the important columns from it), lets look on the following columns:

# This is the data I get from the above command,
# I added it like that so you will be able to copy and test it
data = {'left': [154, 154, 200, 154, 201, 199],
        'top': [0, 3, 3, 7, 8, 12],
        'width': [576, 168, 162, 168, 155, 157],
        'height': [89, 10, 10, 10, 10, 10],
        'text': ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']}
output_test_min_agg = pd.DataFrame(data)
# Output:
+----+---+-----+------+-------+
|left|top|width|height|   text|
+----+---+-----+------+-------+
| 154|  0|  576|    89|  text1|
| 154|  3|  168|    10|  text2|
| 200|  3|  162|    10|  text3|
| 154|  7|  168|    10|  text4|
| 201|  8|  155|    10|  text5|
| 199| 12|  157|    10|  text6|
+----+---+-----+------+-------+

Notice that some of the coordinates are off by few pixels (from what I saw its maximum 3-5 pixels off) that is why the width can also be taken to account (for example the left of "abc" and "abcdef" will be different but with the width we can see that it reaches to the same size

Excepted result will be as below:

+-----+-------+-------+
|index| col 01| col 02|
+-----+-------+-------+
|    0|  text1|       |
|    1|  text2|  text3|
|    2|  text4|  text5|
|    3|       |  text6|
+-----+-------+-------+

The best result I get is from this:

output_test_min_agg=output_test_min.sort_values('top', ascending=True)
output_test_min_agg = output_test_min_agg.groupby(['top', 'left'], sort=False)['text'].sum().unstack('left')
output_test_min_agg.reindex(sorted(output_test_min_agg.columns), axis=1).dropna(how='all')

But it's still not good because if the top or left have even 1 pixel difference it will create a whole new column and row for them

How can I accomplish such a task?

Solution

I accomplished it by doing the following:

I made 3 functions for each purpose

1) Using your dummy data:

import pandas as pd
import numpy as np
# Create a dictionary of data for the DataFrame
data = {'left': [154, 154, 200, 154, 201, 199],
        'top': [0, 3, 3, 7, 8, 12],
        'width': [576, 168, 162, 168, 155, 157],
        'height': [89, 10, 10, 10, 10, 10],
        'text': ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']}
# Create the DataFrame
df = pd.DataFrame(data)

2) Creating a function, using the code you supply + adding to it handling of the `NaN` values

def optimizeDf(df: pd.DataFrame) -> pd.DataFrame:
    df['left+width'] = df['left'] + df['width']
    df = df.sort_values(by=['top'], ascending=True)
    df = df.groupby(['top', 'left+width'], sort=False)['text'].sum().unstack('left+width')
    df = df.reindex(sorted(df.columns), axis=1).dropna(how='all').dropna(axis='columns', how='all')
    df = df.fillna('')
    return df
df = optimize_df(df)

3) Creating a function to merge the columns based on the name threshold similarity:

def mergeDfColumns(old_df: pd.DataFrame, threshold: int = 10) -> pd.DataFrame:
    new_columns = {}
    old_columns = old_df.columns
    i = 0
    while i < len(old_columns) - 1:
        if any(old_columns[i+1] == old_columns[i] + x for x in range(1, threshold)):
            new_col = old_df[old_columns[i]] + old_df[old_columns[i+1]]
            new_columns[old_columns[i+1]] = new_col
            i += 1
        else:
            new_columns[old_columns[i]] = old_df[old_columns[i]]
        i += 1
    new_columns[old_columns[i]] = old_df[old_columns[i]]
    return pd.DataFrame.from_dict(new_columns).replace('', np.nan).dropna(axis='columns', how='all').fillna('')
df = mergeDfColumns(df)

4) Creating a function to merge the rows based on the name threshold similarity:

def mergeDfRows(old_df: pd.DataFrame, threshold: int = 2) -> pd.DataFrame:
    new_df = old_df.iloc[:1]
    for i in range(1, len(old_df)):
        if abs(old_df.index[i] - old_df.index[i - 1]) < threshold:
            new_df.iloc[-1] = new_df.iloc[-1] + old_df.iloc[i]
        else:
            new_df = new_df.append(old_df.iloc[i])
    return new_df.reset_index(drop=True)
df = mergeDfRows(df)

The end result will be as follows:

+-+-----+-----+-----+
| |  322|  362|  730|
+-+-----+-----+-----+
|0|     |     |text1|
|1|text2|text3|     |
|2|text4|text5|     |
|3|     |text6|     |
+-+-----+-----+-----+

That is the best result I got from your dummy data, but please notice how text1 gets it's own row and column, it's because of the data if you'll look you'll see it's width and height are huge compare to the others, what I think is that your table in the image have some sort of a title that is really close to it and pytesseract recognized it as part of the table, my suggestions to you is to try some other config options or use some deep learning in order to classify your table better.

Answered By - Lidor Eliyahu Shelef

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 13, 2023

[FIXED] Create a dataframe - order based on text coordinates

Issue

Solution

1) Using your dummy data:

2) Creating a function, using the code you supply + adding to it handling of the `NaN` values

3) Creating a function to merge the columns based on the name threshold similarity:

4) Creating a function to merge the rows based on the name threshold similarity:

The end result will be as follows:

0 comments:

Post a Comment

Popular Posts

Labels

Monday, November 13, 2023

Issue

Solution

1) Using your dummy data:

2) Creating a function, using the code you supply + adding to it handling of the NaN values

3) Creating a function to merge the columns based on the name threshold similarity:

4) Creating a function to merge the rows based on the name threshold similarity:

The end result will be as follows:

0 comments:

Post a Comment

Popular Posts

Labels

2) Creating a function, using the code you supply + adding to it handling of the `NaN` values