Thursday, January 6, 2022

[FIXED] How to merge multiple .xls files with hyperlinks in python?

January 06, 2022 excel, openpyxl, pandas, python, xlsx No comments

Issue

I am trying to merge multiple .xls files that have many columns, but 1 column with hyperlinks. I try to do this with Python but keep running into unsolvable errors.

Just to be concise, the hyperlinks are hidden under a text section. The following ctrl-click hyperlink is an example of what I encounter in the .xls files: ES2866911 (T3).

In order to improve reproducibility, I have added .xls1 and .xls2 samples below.

xls1:

Title	Publication_Number
P_A	ES2866911 (T3)
P_B	EP3887362 (A1)

.xls2:

Title	Publication_Number
P_C	AR118706 (A2)
P_D	ES2867600 (T3)

Desired outcome:

Title	Publication_Number
P_A	ES2866911 (T3)
P_B	EP3887362 (A1)
P_C	AR118706 (A2)
P_D	ES2867600 (T3)

I am unable to get .xls file into Python without losing formatting or losing hyperlinks. In addition I am unable to convert .xls files to .xlsx. I have no possibility to acquire the .xls files in .xlsx format. Below I briefly summarize what I have tried:

1.) Reading with pandas was my first attempt. Easy to do, but all hyperlinks are lost in PD, furthermore all formatting from original file is lost.

2.) Reading .xls files with openpyxl.load

InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.

3.) Converting .xls files to .xlsx

from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element

import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration

4.) Even if we are able to read the .xls file with xlrd for example (meaning we will never be able to save the file as .xlsx, I can't even see the hyperlink:

import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value   
'AR118706 (A2)' #Which is the name, not hyperlink

5.) I tried installing older versions of openpyxl==3.0.1 to overcome type error to no succes. I tried to open .xls file with openpyxl with xlrd engine, similar typerror "xml.entree.elementtree.element' error occured. I tried many ways to batch convert .xls files to .xlsx all with similar errors.

Obviously I can just open with excel and save as .xlsx but this defeats the entire purpose, and I can't do that for 100's of files.

Solution

Inspired by @Kunal, I managed to write code that avoids using Pandas libraries. .xls files are read by xlrd, and written to a new excel file by xlwt. Hyperlinks are maintened, and output file was saved as .xlsx format:

import os
import xlwt
from xlrd import open_workbook

# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)

#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)

#Initialize header row, can be done with any file 
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
    new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row

#Add rows from all files present in folder 
for file in required_files:
    old_file = open_workbook(directory+"/"+file, formatting_info=True) 
    old_sheet = old_file.sheet_by_index(0) #Define old sheet
    hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
    for row in range(1, old_sheet.nrows): #We need all rows except header row
        if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
            Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row           
            for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
                if col is 1: #We need to make an exception for column 2 being the hyperlinked column
                    click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
                    new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
                else: #If not hyperlinked column
                    new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell

new_file.save("random_directory/output_file.xlsx")

Answered By - Rivered

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 6, 2022

[FIXED] How to merge multiple .xls files with hyperlinks in python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels