Monday, February 5, 2024

[FIXED] Python how to extract the xml part from xml.p7m file

February 05, 2024 python-3.x, xml No comments

Issue

I have to extract information from a xml.p7m (Italian invoice with digital signature function, I think at least.).

The extraction part is already done and works fine with the usual xml from Italy, but since we get those xml.p7m too (which I just recently discovered), I'm stuck, because I can't figure out how to deal with those.

I just want the xml part so I start with those splits to remove the signature part:

with open(path, encoding='unicode_escape') as f:
    txt = '<?xml version="1.0"' + re.split('<?xml version="1.0"',f.read())[1]
    txt = re.split('</FatturaElettronica>', txt)[0] + "</FatturaElettronica>"

So what I'm stuck with now is that there are still parts like this in the xml:

    """ <Anagrafica>
              <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>
            </Anagraf♦♥èica>"""

which makes the xml not well formed, obviously and the data extraction is not working.

I have to use unicode_escape to open the file and remove those lines, because otherwise I would get an error because those signature parts can't be encoded in utf-8.

If I encode this part, I get:

    b' <Anagrafica>\n          <Denominazione>AUTOCARROZZERIA CIANO S.R.L.</Denominazione>\n        </Anagraf\xe2\x99\xa6\xe2\x99\xa5\xc3\xa8ica>'

Anyone an idea on how to extract only the xml part from the xml? Btw the xml should be: but if I open the xml, there are already characters that don't belong to the utf-8 charset or something?

Solution

Edit: The way I did it at first was really not optimal. There was to much manual work, so I searched further for a real solution and found this:

from OpenSSL._util import (
    ffi as _ffi,
    lib as _lib,
)
def removeSignature(fileString):    
    p7 = crypto.load_pkcs7_data(crypto.FILETYPE_ASN1, fileString)
    bio_out =crypto._new_mem_buf()
    res = _lib.PKCS7_verify(p7._pkcs7, _ffi.NULL, _ffi.NULL, _ffi.NULL, bio_out, _lib.PKCS7_NOVERIFY|_lib.PKCS7_NOSIGS)

    if res == 1:
        return(crypto._bio_to_string(bio_out).decode('UTF-8'))
    else:
        errno = _lib.ERR_get_error()
        errstrlib = _ffi.string(_lib.ERR_lib_error_string(errno))
        errstrfunc = _ffi.string(_lib.ERR_func_error_string(errno))
        errstrreason = _ffi.string(_lib.ERR_reason_error_string(errno))
        return ""

What I'm doing now is checking the xml if it's allready in proper xml format, or if it has to be decoded at first, after that I remove the signature and form the xml tree, so I can do the xml stuff I need to do:

    if filePath.lower().endswith('p7m'):
        logger.infoLog(f"Try open file: {filePath}")
        with open(filePath, 'rb') as f:
            txt = f.read()
            # no opening tag to find --> no xml --> decode the file, save it, and get the text
            if not re.findall(b'<',txt):
                image_64_decode = base64.decodebytes(txt)
                image_result = open(path + 'decoded.xml', 'wb') # create a writable image and write the decoding result
                image_result.write(image_64_decode)
                image_result.close()
                txt = open(path + 'decoded.xml', 'rb').read()
        # try to parse the string
        try:
            logger.infoLog("Try parsing the first time")
            txt = removeSignature(txt)
            ET.fromstring(txt)

Answered By - user3793935

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 5, 2024

[FIXED] Python how to extract the xml part from xml.p7m file

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels