Issue
I'm starting to work with python again after 8 years. I'm trying to do program with BeautifulSoup
and a array argument. I pass the array argument medios
to the url
functions count_words
, but it doesn't work. Is there a way fix it or to search a word in multiple websites using BeautifulSoup
?
import requests
from bs4 import BeautifulSoup
def count_words(url, the_word):
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
# print(words)
return len(words)
def main():
url = 'https://www.nytimes.com/'
medios = {
'Los Angeles Times': ['http://www.latimes.com/'],
'New York Times' : ['http://www.nytimes.com/'
] }
word = 'Trump'
#count = count_words(url, word)
cuenta = count_words(medios, word)
# print('\n El Sitio: {}\n Contiene {} occurrencias de la palabra: {}'.format(url, count, word))
print('\n La palabra: {} aparece {} occurrencias en el New York Times'.format(word, cuenta))
if __name__ == '__main__':
main()
Solution
There are 3 problems here
medios
is adict
. Hence, you will have to loop through the keys and values to send it to the method as the method only accepts url string.- BeautifulSoup
find
method needs a tag name for it to search else it will returnNone
. If you want to count the number of occurrences of the word, then usecount
on the string. - You have to send User-Agent in the requests code else you will get
403
or301
.
import requests
from bs4 import BeautifulSoup
headers = {'user-agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}
def count_words(url, the_word):
r = requests.get(url, headers=headers)
return r.text.lower().count(the_word)
def main():
url = 'https://www.nytimes.com/'
medios = {
'Los Angeles Times': ['http://www.latimes.com/'],
'New York Times' : ['http://www.nytimes.com/']
}
word = 'trump'
for web_name, urls in medios.items():
for url in urls:
cuenta = count_words(url, word)
print('La palabra: {} aparece {} occurrencias en el {}'.format(word, cuenta, web_name))
if __name__ == '__main__':
main()
Output:
La palabra: trump aparece 47 occurrencias en el Los Angeles Times
La palabra: trump aparece 194 occurrencias en el New York Times
Answered By - bigbounty
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.