Issue
I'm trying to download all the PGNs from this site.
I think I have to use urlopen
to open each url and then use urlretrieve
to download each pgn by accessing it from the download button near the bottom of each game. Do I have to create a new BeautifulSoup
object for each game? I'm also unsure of how urlretrieve
works.
import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
urlopen('http://chessgames.com'+link.get('href'))
Solution
There is no short answer to your question. I will show you a complete solution and comment this code.
First, import necessary modules:
from bs4 import BeautifulSoup
import requests
import re
Next, get index page and create BeautifulSoup
object:
req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")
I strongly advice to use lxml
parser, not common html.parser
After that, you should prepare game's links list:
pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))
You can do it by searching links containing 'chessgame' word in it. Now, you should prepare function which will download files for you:
def download_file(url):
path = url.split('/')[-1].split('?')[0]
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
And final magic is to repeat all previous steps preparing links for file downloader:
host = 'http://www.chessgames.com'
for page in pages:
url = host + page.get('href')
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
file_link = soup.find('a',text=re.compile('.*download.*'))
file_url = host + file_link.get('href')
download_file(file_url)
(first you search links containing text 'download' in their description, then construct full url - concatenate hostname and path, and finally download file)
I hope you can use this code without correction!
Answered By - Roman Mindlin
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.