Issue
I'm new to web scrapping and having an issue with using the .findAll attribute. when I run the code below, it says object has no attribute findAll.
I able to scrap succesfully for first page of website https://books.toscrape.com/ but when i try to scrap second page for testing and it throws "AttributeError: 'NoneType' object has no attribute 'find_all' " error and tried to scrap all the fifty pages of website as well. The same error is coming. someone Please help to scrap all the 50 pages.
The code to scrap 2nd page
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
#find all the book titles and their links under h3 tag
books = soup.find_all('h3')
book_extracted = 0
#iterate through the books and extract the information of each book
for book in books:
book_url = book.find('a')['href'] #grabbing or acessing href attribute of 1st book
book_response = requests.get(url + book_url) #getting the response of the 1st book link
book_soup = BeautifulSoup(book_response.content,'html.parser')
title = book_soup.find('h1').text #extracting title of the 1st book
category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip()
rating = book_soup.find('p',class_='star-rating')['class'][1]
price = book_soup.find('p',class_='price_color').text.strip()
availability = book_soup.find('p',class_='availability').text.strip()
book_extracted += 1
print(f"Title: {title}")
print(f"Category:{category}")
print(f"Rating:{rating}")
print(f"Price:{price}")
print(f"Availability:{availability}")
print("***************")
The error am getting is
AttributeError Traceback (most recent call last)
Cell In[80], line 18
14 book_soup = BeautifulSoup(book_response.content,'html.parser')
17 title = book_soup.find('h1').text #extracting title of the 1st book
---> 18 category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip() #extracting category
19 rating = book_soup.find('p',class_='star-rating')['class'][1] #having two classes star_rating and rating(three)
20 price = book_soup.find('p',class_='price_color').text.strip()
AttributeError: 'NoneType' object has no attribute 'find_all'
and the code loop to scrap all 50 pages is
books_data = []
#loop through all 50 pages
for page_num in range(1,51):
url = f'https://books.toscrape.com/catalogue/page-{page_num}.html'
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
# find all the book titles and their links under h3 tag of the current page
books = soup.find_all('h3')
for book in books:
book_url = book.find('a')['href'] #grabbing or acessing href attribute of 1st book
book_response = requests.get(url + book_url) #getting the response of the 1st book link
book_soup = BeautifulSoup(book_response.content,'html.parser')
title = book_soup.find('h1').text #extracting title of the 1st book
category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip()
rating = book_soup.find('p',class_='star-rating')['class'][1]
price = book_soup.find('p',class_='price_color').text.strip()
availability = book_soup.find('p',class_='availability').text.strip()
#appending the extracted data to the list
books_data.append([title,category,rating,price,availability])
print(books_data)
The error to scrap all 50 pages is
AttributeError Traceback (most recent call last)
Cell In[82], line 20
16 book_soup = BeautifulSoup(book_response.content,'html.parser')
19 title = book_soup.find('h1').text #extracting title of the 1st book
---> 20 category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip() #extracting category
21 rating = book_soup.find('p',class_='star-rating')['class'][1] #having two classes star_rating and rating(three)
22 price = book_soup.find('p',class_='price_color').text.strip()
AttributeError: 'NoneType' object has no attribute 'find_all'``
Solution
I have changed three lines here. One is creating the baseurl
constant, which has the root URL upon which all the other URLs are based. Two is to change the url =
line to build from that. Three is to change the second requests.get
call to build a URL from the base.
books_data = []
baseurl = 'https://books.toscrape.com/catalogue/'
#loop through all 50 pages
for page_num in range(1,51):
url = baseurl + f'page-{page_num}.html'
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
# find all the book titles and their links under h3 tag of the current page
books = soup.find_all('h3')
for book in books:
book_url = book.find('a')['href'] #grabbing or acessing href attribute of 1st book
book_response = requests.get(baseurl + book_url) #getting the response of the 1st book link
book_soup = BeautifulSoup(book_response.content,'html.parser')
title = book_soup.find('h1').text #extracting title of the 1st book
category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip()
rating = book_soup.find('p',class_='star-rating')['class'][1]
price = book_soup.find('p',class_='price_color').text.strip()
availability = book_soup.find('p',class_='availability').text.strip()
#appending the extracted data to the list
books_data.append([title,category,rating,price,availability])
print(books_data)
This is just how HTML works. All of the links on this page are "relative" URLs. They do not start with http://
. Instead, they are all relative to the page being displayed. That means you have to build those URLs from the directory that CONTAINED the page. You were trying to build the URL from the name of the specific page being displayed. That's wrong.
Answered By - Tim Roberts
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.