Issue
I'm very new to Python and I'm trying to code a program to extract text inside html tags (without tags) and write it onto a different text file for future analysis. I referred this and this as well. I came was able to get below code. But how can I write this as a separate function? Something like
"def read_list('file1.txt')
and then do the same scraping? The reason why I'm asking is output of this code (op1.txt)
will be used for stemming and then for another calculations afterwards. The output of this code doesn't write line by line as it intends either. Thank you very much for any input!
f = open('file1.txt', 'r')
for line in f:
url = line
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
content = bs.find_all(['title','h1', 'h2','h3','h4','h5','h6','p'])
with open('op1.txt', 'w', encoding='utf-8') as file:
file.write(f'{content}\n\n')
file.close()
Solution
Try like this
from urllib.request import urlopen
from bs4 import BeautifulSoup
def read_list(fl):
with open(fl, 'r') as f:
for line in f:
html = urlopen(line.strip()).read().decode("utf8")
bs = BeautifulSoup(html, "html.parser")
content = '\n'.join([x.text for x in bs.find_all(['title','p']+[f'h{n}' for n in range(1,7)])])
with open('op1.txt', 'w', encoding='utf-8') as file:
file.write(f'{content}\n\n')
Answered By - Wasif
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.