Issue
I try to get the data from within the <script> tag but 'model_data' return None.
When I run the code I get the error:
model_data = model_data.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
What is wrong here?
html_doc = """
<script>
var modelData = {
"hlsUrl": "null",
"account": "4LH7J44IYPAGEZEY6E3UL"
}
</script>
"""
soup = BeautifulSoup(html_doc, "html.parser")
# locate the script, get the contents
script_text = soup.select_one("script").contents[0]
# get javascript object inside the script
model_data = re.search(r"modelData = ({.*?});", script_text, flags=re.S)
print(model_data) # RETURNS None - why?
model_data = model_data.group(1)
# "convert" the javascript object to json-valid object
model_data = re.sub(
r"^\s*([^:\s]+):", r'"\1":', model_data.replace("'", '"'), flags=re.M
)
# json decode the object
model_data = json.loads(model_data)
# print the data
print(model_data["account"])
Updated issue:
After accepted the answer which worked with the given response, I found out that I had left out an important piece of information.
The full response is like this:
{
"hlsUrl": "null",
"account": "1V2FO4K7ME78RV09VXNEC",
"packageName": "null",
isActive: false
}
Here it shows that isActive is not a json like object or what it's called, so it gives me now the following error:
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 5 column 9 (char 111)
Solution
You're making it a bit too complicated than it really has to be.
Use .string
not .contents[0]
. As the later works too, the .string
method doesn't need indexing and can be directly passed to re.search()
. Also, IMHO, it's more readable.
Fix your regex.
This works:
modelData = ({.*?})
while this does not:
modelData = ({.*?});
Notice, there's no need for ;
Finally, you don't have to do all this:
# "convert" the javascript object to json-valid object
model_data = re.sub(
r"^\s*([^:\s]+):", r'"\1":', model_data.replace("'", '"'), flags=re.M
)
Just simply dump the regex group(1)
to json.loads
.
Full code:
import json
import re
from bs4 import BeautifulSoup
html_doc = """
<script>
var modelData = {
"hlsUrl": "null",
"account": "4LH7J44IYPAGEZEY6E3UL"
}
</script>
"""
soup = BeautifulSoup(html_doc, "html.parser")
script_text = soup.select_one("script").string
model_data = re.search(r"modelData = ({.*?})", script_text, re.S).group(1)
print(json.loads(model_data)["account"])
Output:
4LH7J44IYPAGEZEY6E3UL
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.