Issue
I have a Text data and I want to parse the content and extract the Brand, MPN, Condition, Qty, Price, Customer name.
The text data is in the format:
---ABB INSTALLATION PRODUCT---
54905BE06, NEW, qty 1687, $1.05, ABB INSTALLATION PRODUCT, *#6 1 hole
5490 Searched by: Gracile IT LLC
Oliver Wilson P:123 456 7890
---ARUBA NETWORKS---
R1C72A, NEW, qty 21, $155.00, ARUBA NETWORKS, *AP mount bracketAruba
R1C72A Searched by: Synergy Associates, LLC
N/A P:123 456 7890
---ATC CENTRAL SOURCING---
3X5HZYLBACKCOX, NEW, qty 353, $8.24, ATC CENTRAL SOURCING, The ATC Ce
3X Searched by: Golden Gate Communications
Mike Sweiss P:123 456 7890
3X5HZYLPWRCOX, NEW, qty 353, $8.24, ATC CENTRAL SOURCING, The ATC Cen
3X Searched by: Golden Gate Communications
Mike Sweiss P:123 456 7890
---CAMBIUM NETWORKS---
SFP-10G-LR, NEW, qty 360, $52.95, CAMBIUM NETWORKS, Cambium Networks
SFP-10G-LR Searched by: Diversity IT LLC
Jeremy Place P:123 456 7890
I am trying this with Python using regression technique to extract the data from text. But I am unable to identify how I can extract all the data for a particular column. The column name should be (Brand, MPN, Condition, Qty, Price, Search by, Customer name and phone number). I am attaching my code here. if anyone can help me out.
The format of text data are :
---Brand---
Product details
Contact details
import pandas as pd
import os
import re
def parse_email_content(email_content):
# Regular expression patterns
brand_pattern = r"---(.*?)---"
product_pattern = r"(.*?), (.*?), qty (.*?), \$(.*?), (.*?), (.+)"
contact_pattern = r"Searched by: (.+?)\n\s+(.+) P: (\S+)"
data = []
while email_content:
reg_match = _RegExLib(email_content)
if reg_match.brand_pattern:
Brand = reg_match.brand_pattern.group(1)
if reg_match.product_pattern:
Product = reg_match.product_pattern.group(1)
data['Product'] = data['Product'].str.split(',').str[0]
if reg_match.contact_pattern:
Contact_Person = reg_match.contact_pattern.group(1)
dict_of_data = {
'Brand': brand_pattern,
'Product': product_pattern,
'Contact_Person': contact_pattern,
value_type: value
}
data.append(dict_of_data)
email_content = file.readline()
email_content = file.readline()
data = pd.DataFrame(data)
data.set_index(['Brand', 'Product', 'Contact_Person'], inplace=True)
return data
Solution
To parse the data into a dataframe you can use following example:
import re
import pandas as pd
text = """\
---ABB INSTALLATION PRODUCT---
54905BE06, NEW, qty 1687, $1.05, ABB INSTALLATION PRODUCT, *#6 1 hole
5490 Searched by: Gracile IT LLC
Oliver Wilson P:123 456 7890
---ARUBA NETWORKS---
R1C72A, NEW, qty 21, $155.00, ARUBA NETWORKS, *AP mount bracketAruba
R1C72A Searched by: Synergy Associates, LLC
N/A P:123 456 7890
---ATC CENTRAL SOURCING---
3X5HZYLBACKCOX, NEW, qty 353, $8.24, ATC CENTRAL SOURCING, The ATC Ce
3X Searched by: Golden Gate Communications
Mike Sweiss P:123 456 7890
3X5HZYLPWRCOX, NEW, qty 353, $8.24, ATC CENTRAL SOURCING, The ATC Cen
3X Searched by: Golden Gate Communications
Mike Sweiss P:123 456 7890
---CAMBIUM NETWORKS---
SFP-10G-LR, NEW, qty 360, $52.95, CAMBIUM NETWORKS, Cambium Networks
SFP-10G-LR Searched by: Diversity IT LLC
Jeremy Place P:123 456 7890
"""
all_data = []
for name, group in re.findall(
r"^---(.*?)---(.*?)\s*(?=^---|\Z)", text, flags=re.S | re.M
):
m = re.search(
r"(\S+)\s*,\s*(\S+),\s*qty\s*(\d+).*\$([^,\s]+).*?Searched by:\s*(.*?)\s*P:([^\n]+)",
group,
flags=re.S,
)
if m:
mpn, condition, qty, price, customer_name, p_number = m.groups()
search_by, customer_name = map(str.strip, customer_name.split("\n"))
all_data.append(
[name, mpn, condition, qty, price, search_by, customer_name, p_number]
)
df = pd.DataFrame(
all_data,
columns=[
"name",
"mpn",
"condition",
"qty",
"price",
"search_by",
"customer_name",
"p_number",
],
)
print(df)
Prints:
name mpn condition qty price search_by customer_name p_number
0 ABB INSTALLATION PRODUCT 54905BE06 NEW 1687 1.05 Gracile IT LLC Oliver Wilson 123 456 7890
1 ARUBA NETWORKS R1C72A NEW 21 155.00 Synergy Associates, LLC N/A 123 456 7890
2 ATC CENTRAL SOURCING 3X5HZYLBACKCOX NEW 353 8.24 Golden Gate Communications Mike Sweiss 123 456 7890
3 CAMBIUM NETWORKS SFP-10G-LR NEW 360 52.95 Diversity IT LLC Jeremy Place 123 456 7890
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.