Issue
I found this instruction for reading the links within the one column that contains hyperlinks.
What I have i excel that contains column, that have 2 kinds of values: NaN or Link with: text and hyperlink. Example:
NaN
NaN
NaN
Link_text_1:www.something...
Link_text_2:www.something_else...
NaN
When I load the excel file, it return only the Text from column and hyperlink is removed.
I tried special instructions, but they do not work.
Base of my code is:
path=r'path_to_file'
df = pd.read_excel(path, sheet_name='docs')
link_column = df[["Unnamed: 18"]]
print(link_column)
And then I will see NaN or Link_text.
I tried:
df_2 = pd.read_excel(path, sheet_name='Jobs', converters={"Unnamed: 18": lambda x: str(x.value) + "|"+ str(x.hyperlink.target)})
But it return error:
df_2 = pd.read_excel(path, sheet_name='Jobs', converters={"Unnamed: 18": lambda x: str(x.value) + "|"+ str(x.hyperlink.target)}) # read sheet Jobs from excel file
AttributeError: 'str' object has no attribute 'value'
I have tried to google this error but I could not find something that works for me.
Solution
I'm not sure that pandas is capable of parsing Excel hyperlinks.
So, as a workaround, you can use openpyxl.worksheet.hyperlink
module to get a list of the hyperlinks in a worksheet/column then create a series in your dataframe based on this list.
Try this :
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook('test.xlsx')
ws = wb['Sheet1']
list_of_links = []
for i in range(2, ws.max_row + 1):
try:
list_of_links.append(ws.cell(row=i, column=1).hyperlink.target)
except AttributeError:
list_of_links.append(np.nan)
df = pd.read_excel('test.xlsx')
df['Links'] = list_of_links
# Output:
print(df)
ColA Links
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 Link_text_1 http://www.something.../
4 Link_text_2 http://www.something_else.../
5 NaN NaN
# Worksheet used :
Answered By - M92_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.