I have made a very simple web scraper which iterates through a list of links and scrapes the text and dates from them and outputs this into a text file.
So far everything has been working fine but I have received an error which only occurs for certain links.
for reference, I am using the same website (Yahoo Finance) which contains the same html for each webpage.
Below is an example of the issue I am running into. The function successfully retrieves the date from the first link, but fails on the second one:
Below is the entire code for the web scraper:
from bs4 import BeautifulSoup
import datetime
import yfinance as yf
def unique_filename () :
current_date =
day_month_string = current_date.strftime("%S-%M-%H-%d")
filename = day_month_string
return filename
def get_currency_info (name) :
currency = yf.Ticker(name)
history = currency.history(period="1mo")
return history
def scrape_all (list, curr) :
filename = unique_filename()
with open(filename, 'a') as file:
for link in list :
print("=======================NEW LINK========================")
source = requests.get(link).text
soup = BeautifulSoup(source, 'lxml')
date = soup.find(class_='caas-attr-time-style').find('time')
with open(filename, 'a') as file :
dateToAppend = (date.get_text().split('at'))[0]
file.write(f"Date: {dateToAppend}")
texts = soup.find_all('p')
for text in texts:
with open(filename, 'a') as file:
text_to_append = text.get_text()
if __name__ == "__main__" :
listOfLinks = [
scrape_all(listOfLinks, "BTC")
print("=======================LIST END========================")```
Based on your comment, the page loads in your browser but not through requests
You can add the user-agent
header to requests.get()
to trick the page that you're not a robot:
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36' }
source = requests.get(link, headers=HEADERS).text
See also:
Original Answer
I think the error is clear:
`NoneType` object has no attribute 'find'
which means, that in your code:
date = soup.find(class_='caas-attr-time-style').find('time')
the soup.find(class_='caas-attr-time-style')
is returning None
To investigate the problem, I've caught the exception and printed out the soup
to see what's going on:
soup = BeautifulSoup(source, 'lxml')
date = soup.find(class_='caas-attr-time-style').find('time')
except AttributeError:
print("Attribute Error\n", soup.prettify())
Which clearly Prints:
<!-- status code : 404 -->
<!-- Not Found on Server -->
<img alt="Yahoo Logo" src=""/>
<h1 style="margin-top:20px;">
Will be right back...
<p id="message-1">
Thank you for your patience.
<p id="message-2">
Our engineers are working quickly to resolve the issue.
So, from the above HTML it seems like the page is currently down.
What you can do to solve the problem is to use a Try/Except block and skip over it the value doens't exist:
for link in list:
print("=======================NEW LINK========================")
source = requests.get(link).text
soup = BeautifulSoup(source, 'lxml')
date = soup.find(class_='caas-attr-time-style').find('time')
except AttributeError:
date = 'No date found'
with open(filename, 'a') as file:
dateToAppend = (date.get_text().split('at'))[0]
file.write(f"Date: {dateToAppend}")
Answered By - MendelG
Post a Comment
Note: Only a member of this blog may post a comment.