Issue
So I started with python some days ago and now tried to make a function that gives me all subpages of websites. I know it may not be the most elegant function but I had been pretty proud to see it working. But for some reason unknown to me, my function does not work anymore. I could've sworn I haven't changed that function since it worked the last time. But after hours of attempts to debug I am slowly doubting myself. Can you maybe take a look why my function does not output to a .txt file anymore? I just get handed an empty text file. Though if I delete it atleast creates a new (empty) one.
I tried to move the save strings part out of the try block, which didn't. work. I also tried all_urls.flush()
to maybe save everything. I restarted the PC in the hopes that something in the background accessed the file and made me unable to write on it. I also renamed the file it supposed to save as, so as to generate something truly fresh. Still the same problem. I also controlled that the link
from the loop gets given as a string, so that shouldn't be a problem. I also tried:
print(link, file=all_urls, end='\n')
as a replacement to
all_urls.write(link)
all_urls.write('\n')
with no result.
My full function:
def get_subpages(url):
# gets all subpage links from a website that start with the given url
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
links = [url]
tested_links = []
to_test_links = links
# open a .txt file to save results into
all_urls = open('all_urls.txt', 'w')
problematic_pages = open('problematic_pages.txt', 'w')
while len(to_test_links)>0:
for link in to_test_links:
print('the link we are testing right now:', link)
# add the current link to the tested list
tested_links.append(link)
try:
print(type(link))
all_urls.write(link)
all_urls.write('\n')
# Save it to the -txt file and make an abstract
# get the link ready to be accessed
req = Request(link)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, features="html.parser")
# empty previous temporary links
templinks = []
# move the links on the subpage link to templinks
for sublink in soup.findAll('a'):
templinks.append(sublink.get('href'))
# clean off accidental 'None' values
templinks = list(filter(lambda item: item is not None, templinks))
for templink in templinks:
# make sure we have still the correct website and don't accidentally crawl instagram etc.
# also avoid duplicates
if templink.find(url) == 0 and templink not in links:
links.append(templink)
#and lastly refresh the to_test_links list with the newly found links before going back into the loop
to_test_links = (list(set(links) ^ set(tested_links)))
except:
# Save it to the ERROR -txt file and make an abstract
problematic_pages.write(link)
problematic_pages.write('\n')
print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
all_urls.close()
problematic_pages.close()
return links
Solution
I can't reproduce this, but I've had inexplicable [to me at least] errors with file handling that were resolved when I wrote from inside a with
.
[Just make sure to remove the lines involving allurl
in your current code first just in case - or just try this with a different filename while checking if it works]
Since you're appending all the urls to tested_links
anyway, you could just write it all at once after the while loop
with open('all_urls.txt', 'w') as f:
f.write('\n'.join('tested_links')+'\n')
or, if you have to write link by link, you can append by opening with mode='a'
:
# before the while, if you're not sure the file exists
# [and/or to clear previous data from file]
# with open('all_urls.txt', 'w') as f: f.write('')
# and inside the try block:
with open('all_urls.txt', 'a') as f:
f.write(f'{link}\n')
Answered By - Driftr95
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.