Issue
I would like to scraping from a link but I find some difficulties that either I can't find it or I don't know how to select some list and some text inside a ... . I do this, with BeautifulSoup:
response = requests.get(LINK)
response.raise_for_status()
soup = bs4.BeautifulSoup(response.text,'html.parser')
for select in soup.select("script",type="text/javascript"):
print(select)
where LINK is an https, and as an output I get this:
OTHER <script type="text/javascript"> WRITINGS
<script type="text/javascript">
$(function () {
$('#chart_t_2021').highcharts({
chart: {
...
},
title: {
text: 'I WANT TO PRINT THIS TEXT'
},
...
})
});
</script>
<script type="text/javascript">
$(function () {
$('#chart_2021').highcharts({
title: {
text: '...'
},
yAxis: {
...
},
xAxis: {tickPositions: [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30] <!--I WOULD LIKE TO TAKE THIS LIST AND PUT IT IN A VARIABLE-->
},
legend: {
layout: 'vertical',
align: 'center',
verticalAlign: 'bottom'
},
plotOptions: {
series: {
pointStart: 15
}
},
series: [{
name: 'I WOULD LIKE TO TAKE THIS TEXT AND PUT IT IN A VARIABLE',
data: [0,0,0,0,0,0,0,0,0,3,1,8,12,21,22,13]<!--I WOULD LIKE TO TAKE THIS LIST AND PUT IT IN A VARIABLE-->
}, {
name: 'I WOULD LIKE TO TAKE THIS TEXT AND PUT IT IN A VARIABLE',
data: [0,0,0,0,0,0,0,0,0,3,1,7,12,21,19,13]<!--I WOULD LIKE TO TAKE THIS LIST AND PUT IT IN A VARIABLE-->
}]
})
});</script>
OTHER <script type="text/javascript"> WRITINGS
I tried to do this:
for select1 in soup.select("script",type="text/javascript"):
for select2 in select1.select("title"):
print(select2)
but it does not print anything, can someone help me to print at least the first title that I put as output?
Solution
The information you are trying to extract is inside javascript. You cannot use BeautifulSoup for this part. One approach though could be to use regular expressions to extract the parts and ast.literal_eval()
to convert the text into Python variables.
For example:
from bs4 import BeautifulSoup
from ast import literal_eval
import re
def extract(pattern, script, var):
if script.string:
for value in re.findall(pattern, script.string):
var.append(literal_eval(value))
html = """<<script text copied from question>>"""
soup = BeautifulSoup(html, 'html.parser')
titles = []
tickpositions = []
names = []
data = []
for script in soup.select('script', type='application/json'):
extract("text: ('.*?')", script, titles)
extract("tickPositions: (\[.*?\])", script, tickpositions)
extract("name: ('.*?')", script, names)
extract("data: (\[.*?\])", script, data)
print(titles)
print(tickpositions)
print(names)
print(data)
For the data you have provided, this would give you the following type of output:
['I WANT TO PRINT THIS TEXT', '...']
[[15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]
['I WOULD LIKE TO TAKE THIS TEXT AND PUT IT IN A VARIABLE', 'I WOULD LIKE TO TAKE THIS TEXT AND PUT IT IN A VARIABLE']
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 8, 12, 21, 22, 13], [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 7, 12, 21, 19, 13]]
Answered By - Martin Evans
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.