Sunday, October 30, 2022

[FIXED] For some reason .write() does not write string to .txt file anymore

October 30, 2022 beautifulsoup, python, text No comments

Issue

So I started with python some days ago and now tried to make a function that gives me all subpages of websites. I know it may not be the most elegant function but I had been pretty proud to see it working. But for some reason unknown to me, my function does not work anymore. I could've sworn I haven't changed that function since it worked the last time. But after hours of attempts to debug I am slowly doubting myself. Can you maybe take a look why my function does not output to a .txt file anymore? I just get handed an empty text file. Though if I delete it atleast creates a new (empty) one.

I tried to move the save strings part out of the try block, which didn't. work. I also tried all_urls.flush() to maybe save everything. I restarted the PC in the hopes that something in the background accessed the file and made me unable to write on it. I also renamed the file it supposed to save as, so as to generate something truly fresh. Still the same problem. I also controlled that the link from the loop gets given as a string, so that shouldn't be a problem. I also tried:

print(link, file=all_urls, end='\n')

as a replacement to

all_urls.write(link)
all_urls.write('\n')

with no result.

My full function:

def get_subpages(url):
    # gets all subpage links from a website that start with the given url
    from urllib.request import urlopen, Request
    from bs4 import BeautifulSoup
    links = [url]
    tested_links = []
    to_test_links = links
    # open a .txt file to save results into
    all_urls = open('all_urls.txt', 'w')
    problematic_pages = open('problematic_pages.txt', 'w')
    while len(to_test_links)>0:
        for link in to_test_links:
            print('the link we are testing right now:', link)
            # add the current link to the tested list
            tested_links.append(link)
            try:
                print(type(link))
                all_urls.write(link)
                all_urls.write('\n')
                # Save it to the -txt file and make an abstract
                # get the link ready to be accessed
                req = Request(link)
                html_page = urlopen(req)
                soup = BeautifulSoup(html_page, features="html.parser")
                # empty previous temporary links
                templinks = []
                # move the links on the subpage link to templinks
                for sublink in soup.findAll('a'):
                    templinks.append(sublink.get('href'))
                # clean off accidental 'None' values
                templinks = list(filter(lambda item: item is not None, templinks))

                for templink in templinks:
                    # make sure we have still the correct website and don't accidentally crawl instagram etc.
                    # also avoid duplicates
                    if templink.find(url) == 0 and templink not in links:
                        links.append(templink)

                #and lastly refresh the to_test_links list with the newly found links before going back into the loop
                to_test_links = (list(set(links) ^ set(tested_links)))
            except:
                # Save it to the ERROR -txt file and make an abstract
                problematic_pages.write(link)
                problematic_pages.write('\n')
                print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
    all_urls.close()
    problematic_pages.close()
    return links

Solution

I can't reproduce this, but I've had inexplicable [to me at least] errors with file handling that were resolved when I wrote from inside a with.

[Just make sure to remove the lines involving allurl in your current code first just in case - or just try this with a different filename while checking if it works]

Since you're appending all the urls to tested_links anyway, you could just write it all at once after the while loop

with open('all_urls.txt', 'w') as f:
    f.write('\n'.join('tested_links')+'\n')

or, if you have to write link by link, you can append by opening with mode='a':

# before the while, if you're not sure the file exists
# [and/or to clear previous data from file]
# with open('all_urls.txt', 'w') as f: f.write('')

                # and inside the try block: 
                with open('all_urls.txt', 'a') as f:                 
                    f.write(f'{link}\n')

Answered By - Driftr95

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 30, 2022

[FIXED] For some reason .write() does not write string to .txt file anymore

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels