Issue
I'm working on a side project to scrape some Crossfit Workouts from a website that posts daily workouts.
When I parse the content, the output is truncated (it doesn't return the entire workout description).
I tried using lxml and BeautifulSoup and got the same result.
import requests
response = requests.get('https://www.crossfitinvictus.com/wod/august-8-2023-performance')
from lxml import html
tree = html.fromstring(response.text)
description=tree.xpath('//meta[@property="og:description"]/@content')
print(description)
This is the output:
['Warm-up Two sets of: Assault Bike x 2 minutes Plank x 1 minute Bottom Squat KB Goblet Hold x 30 seconds Two sets of: Banded Face Pull x 20 reps Air Squat x 10 reps A. Every 2 minutes, for 12 minutes (6 sets): Front Squat *Set 1 – 3 reps @ 70% *Set 2…']
There are additional lines after the Set 2... text which are missing.
Is this a website server issue?
I tried through VS Code and JupyterNotebook editors.
Solution
To get full content you can get text from tag with class="entry-content"
:
import requests
from bs4 import BeautifulSoup
url = "https://www.crossfitinvictus.com/wod/august-8-2023-performance/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one(".entry-content").get_text(separator="\n", strip=True))
Prints:
Warm-up
Two sets of:
Assault Bike x 2 minutes
Plank x 1 minute
Bottom Squat KB Goblet Hold
...all the way to:
Barbell should start from the ground.
Compare results to April 4, 2023.
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.