Tuesday, October 12, 2021

[FIXED] Python download multiple files from links on pages

October 12, 2021 beautifulsoup, python, python-3.x, urllib No comments

Issue

I'm trying to download all the PGNs from this site.

I think I have to use urlopen to open each url and then use urlretrieve to download each pgn by accessing it from the download button near the bottom of each game. Do I have to create a new BeautifulSoup object for each game? I'm also unsure of how urlretrieve works.

import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup

url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
    urlopen('http://chessgames.com'+link.get('href'))

Solution

There is no short answer to your question. I will show you a complete solution and comment this code.

First, import necessary modules:

from bs4 import BeautifulSoup
import requests
import re

Next, get index page and create BeautifulSoup object:

req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")

I strongly advice to use lxml parser, not common html.parser After that, you should prepare game's links list:

pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))

You can do it by searching links containing 'chessgame' word in it. Now, you should prepare function which will download files for you:

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

And final magic is to repeat all previous steps preparing links for file downloader:

host = 'http://www.chessgames.com'
for page in pages:
    url = host + page.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    file_link = soup.find('a',text=re.compile('.*download.*'))
    file_url = host + file_link.get('href')
    download_file(file_url)

(first you search links containing text 'download' in their description, then construct full url - concatenate hostname and path, and finally download file)

I hope you can use this code without correction!

Answered By - Roman Mindlin

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 12, 2021

[FIXED] Python download multiple files from links on pages

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels