Issue
I have to extract information from a xml.p7m (Italian invoice with digital signature function, I think at least.).
The extraction part is already done and works fine with the usual xml from Italy, but since we get those xml.p7m too (which I just recently discovered), I'm stuck, because I can't figure out how to deal with those.
I just want the xml part so I start with those splits to remove the signature part:
with open(path, encoding='unicode_escape') as f:
txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"
So what I'm stuck with now is that there are still parts like this in the xml:
""" <Anagrafica>
<Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>
</Anagraf♦♥èica>"""
which makes the xml not well formed, obviously and the data extraction is not working.
I have to use unicode_escape to open the file and remove those lines, because otherwise I would get an error because those signature parts can't be encoded in utf-8.
If I encode this part, I get:
b' <Anagrafica>\n <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>\n </Anagraf\xe2\x99\xa6\xe2\x99\xa5\xc3\xa8ica>'
Anyone an idea on how to extract only the xml part from the xml? Btw the xml should be: but if I open the xml, there are already characters that don't belong to the utf-8 charset or something?
Solution
Edit: The way I did it at first was really not optimal. There was to much manual work, so I searched further for a real solution and found this:
from OpenSSL._util import (
ffi as _ffi,
lib as _lib,
)
def removeSignature(fileString):
p7 = crypto.load_pkcs7_data(crypto.FILETYPE_ASN1, fileString)
bio_out =crypto._new_mem_buf()
res = _lib.PKCS7_verify(p7._pkcs7, _ffi.NULL, _ffi.NULL, _ffi.NULL, bio_out, _lib.PKCS7_NOVERIFY|_lib.PKCS7_NOSIGS)
if res == 1:
return(crypto._bio_to_string(bio_out).decode('UTF-8'))
else:
errno = _lib.ERR_get_error()
errstrlib = _ffi.string(_lib.ERR_lib_error_string(errno))
errstrfunc = _ffi.string(_lib.ERR_func_error_string(errno))
errstrreason = _ffi.string(_lib.ERR_reason_error_string(errno))
return ""
What I'm doing now is checking the xml if it's allready in proper xml format, or if it has to be decoded at first, after that I remove the signature and form the xml tree, so I can do the xml stuff I need to do:
if filePath.lower().endswith('p7m'):
logger.infoLog(f"Try open file: {filePath}")
with open(filePath, 'rb') as f:
txt = f.read()
# no opening tag to find --> no xml --> decode the file, save it, and get the text
if not re.findall(b'<',txt):
image_64_decode = base64.decodebytes(txt)
image_result = open(path + 'decoded.xml', 'wb') # create a writable image and write the decoding result
image_result.write(image_64_decode)
image_result.close()
txt = open(path + 'decoded.xml', 'rb').read()
# try to parse the string
try:
logger.infoLog("Try parsing the first time")
txt = removeSignature(txt)
ET.fromstring(txt)
Answered By - user3793935
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.