Friday, January 5, 2024

[FIXED] Can compile function output be used to detect changes in the source code

January 05, 2024 python, python-2.x No comments

Issue

Python's builtin function compile()

code = """def fact(x):
    if x <= 1:
        return 1
    print (1)
    return x*fact(x-1)
fact(10)"""
c = compile(code, "<string>", "exec")

code = """def fact(y):
    if y <= 1:
        return 1
    print (1)
    return y*fact(y-1)
fact(10)"""
d = compile(code, "<string>", "exec")

here c == d is False. Which is the expected behavior? (adding space in source code or making print 1 instead of print(1) doesn't result in a changed object, which is correct.)

My question is: Can compile object be used to detect changes in the Python source code?

Edit:

To explain it more clearly, I am working on a web application that will allow users to execute their Python code.

Every time a user executes code, its executed on the server. Even adding a space/bracket etc results in a new execution. I am trying to optimize this step by storing the compiled code and if the code is the same in the new request then not executing it.

How do I know that code has changed? (Is it the kind of change that requires executing the code again or a simple space addition.)

I do think there are other ways to achieve what I am trying to achieve by using hashed value or something like this but I want to do it this way because it seems more Pythonic and better than reinventing the wheel.

Solution

Don't do this, as an optimization it won't result in great speed ups.

compile is a built-in function, it is implemented in C, it is fast and is not a place where you should be looking to optimize. When a user tries to execute code you should allow it to be compiled and executed without any sort of caching.

Consider the following, writing code that tries to discover if there's any difference in text entered by a user, will most likely result in more execution time than actually just compiling the function and executing it. Apart from this, you also have the fact that you'll have to store and retrieve the compile code somewhere. The first adds unnecessary space requirements and, the second, adds to the overhead because of the look ups you have to do (depending on how you chose to store it, of course).

Apart from what I just said, you can try and compare the result of compiling Python code but the approach is limited and obfuscated. I will assume the code snippets you entered should compare equal since the only difference between them lies in the parameter names (whose name has no impact on code execution).

Use the byte code:

In general the approach I'm going to present only returns "equal" (True) if the code snippets in question only differ in white-space and/or parameter names, in all other cases, it returns False (you could attempt to make it more intelligent but that would probably require you taking many edge cases into consideration).

What you should generally be aware of is that compile returns a code object. code objects generally contain all the information Python needs in order to execute a piece of code. You can view the instructions contained in a code object by using the dis module:

from dis import dis
dis(c)
  1           0 LOAD_CONST               0 (<code object fact at 0x7fa7bc30e630, file "<string>", line 1>)
              3 MAKE_FUNCTION            0
              6 STORE_NAME               0 (fact)

  6           9 LOAD_NAME                0 (fact)
             12 LOAD_CONST               1 (10)
             15 CALL_FUNCTION            1
             18 POP_TOP             
             19 LOAD_CONST               2 (None)
             22 RETURN_VALUE

This is what your code snippet does. It loads another code object (which represents the function fact), makes a function call and returns.

The same exact output is generated when you call dis(d), the only difference lies in the loading of the code object for fact:

dis(d)
  1           0 LOAD_CONST               0 (<code object fact at 0x7fa7bc30e5b0, file "<string>", line 1>)

As you can see, they are different:

code object fact at 0x7fa7bc30e630 != code object fact at 0x7fa7bc30e5b0

but their difference does not lie in their functionality, rather they are different instances representing a unit of functionality that is identical.

So here lies the question: do we care about the fact that they are different instances if they represent the same functionality? Similarly, do we care if two variables have different names if they represent the same value? I'm assuming we don't and this is why I think you can make a meaningful basic comparison of code objects if you compare their raw byte code.

Comparing the raw byte code:

Raw byte code is pretty much what I previously described, no names, no identities just a sequence of commands for the Python interpreter (pure functionality). Let's take a look at it:

c.co_code
'd\x00\x00\x84\x00\x00Z\x00\x00e\x00\x00d\x01\x00\x83\x01\x00\x01d\x02\x00S'

Okay, that's ugly, let's have a better look:

dis(c.co_code)
  0 LOAD_CONST          0 (0)
  3 MAKE_FUNCTION       0
  6 STORE_NAME          0 (0)
  9 LOAD_NAME           0 (0)
 12 LOAD_CONST          1 (1)
 15 CALL_FUNCTION       1
 18 POP_TOP        
 19 LOAD_CONST          2 (2)
 22 RETURN_VALUE

That looks better. This is exactly the same as the previous dis(c) command, the only difference is that the names are not present since they don't really play a part in all this.

So, what do we get if we compare d.co_code with c.co_code? Of course, equality since the commands executed are identical. But, there's a pit-fall here, in order to be 100% sure that d.co_code is equal to c.co_code we also need to compare the code objects for the functions loaded inside c and d (the code objects representing the function fact) if we don't compare these, we'll be getting false positives.

So where does the code object for the functions fact lie in each case? They lie in a field called co_consts inside the code object c and d respectively, co_consts is a list containing all the constants for a specific code object. If you take a peak inside there, you'll be able to see the definition for each of these:

# located at position 0 in this case.
# the byte code for function 'fact'
dis(c.co_consts[0])  
  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (1)
              6 COMPARE_OP               1 (<=)
              9 POP_JUMP_IF_FALSE       16

  3          12 LOAD_CONST               1 (1)
             15 RETURN_VALUE        

  4     >>   16 LOAD_CONST               1 (1)
             19 PRINT_ITEM          
             20 PRINT_NEWLINE       

  5          21 LOAD_FAST                0 (x)
             24 LOAD_GLOBAL              0 (fact)
             27 LOAD_FAST                0 (x)
             30 LOAD_CONST               1 (1)
             33 BINARY_SUBTRACT     
             34 CALL_FUNCTION            1
             37 BINARY_MULTIPLY     
             38 RETURN_VALUE

So, what do we do? We compare their raw byte code to see if they represent the same functionality, as we did previously.

As you can understand this turns out to be a recursive procedure where we first compare the raw byte code for the input, then scan through co_consts to see if another code object is present and repeat until no code objects can be found, if the code objects are in a different position in co_consts we'll return False.

In code, that should look like this:

from types import CodeType 

def equal_code(c1, c2):
    if c1.co_code != c2.co_code:
        return False
    for i, j in zip(c1.co_consts, c2.co_consts):
        if isinstance(i, CodeType):
            if isinstance(j, CodeType):
                return equal_code(i, j)
            else: # consts in different position, return false
                return False
    return True

Where CodeType from types is used to check for code instances.

I think this is pretty much the best you can do using only code objects generated from compile.

Answered By - Dimitris Fasarakis Hilliard

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 5, 2024

[FIXED] Can compile function output be used to detect changes in the source code

Issue

Solution

Use the byte code:

Comparing the raw byte code:

0 comments:

Post a Comment

Popular Posts

Labels