Saturday, December 2, 2023

[FIXED] How do i solve "AttributeError: 'NoneType' object has no attribute 'find_all' error in webscrapping"

December 02, 2023 beautifulsoup, python, python-requests, web-scraping No comments

Issue

I'm new to web scrapping and having an issue with using the .findAll attribute. when I run the code below, it says object has no attribute findAll.

I able to scrap succesfully for first page of website https://books.toscrape.com/ but when i try to scrap second page for testing and it throws "AttributeError: 'NoneType' object has no attribute 'find_all' " error and tried to scrap all the fifty pages of website as well. The same error is coming. someone Please help to scrap all the 50 pages.

The code to scrap 2nd page

response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')

#find all the book titles and their links under h3 tag
books = soup.find_all('h3')

book_extracted = 0

#iterate through the books and extract the information of each book
for book in books:
    book_url = book.find('a')['href'] #grabbing or acessing href attribute of 1st book
    book_response = requests.get(url + book_url) #getting the response of the 1st book link
    book_soup = BeautifulSoup(book_response.content,'html.parser')
    
    
    title = book_soup.find('h1').text #extracting title of the 1st book
    category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip() 
    rating = book_soup.find('p',class_='star-rating')['class'][1] 
    price = book_soup.find('p',class_='price_color').text.strip()
    availability = book_soup.find('p',class_='availability').text.strip()
    
    book_extracted += 1
    
    print(f"Title: {title}")
    print(f"Category:{category}")
    print(f"Rating:{rating}")
    print(f"Price:{price}")
    print(f"Availability:{availability}")
    print("***************")

The error am getting is

AttributeError                            Traceback (most recent call last)
Cell In[80], line 18
     14 book_soup = BeautifulSoup(book_response.content,'html.parser')
     17 title = book_soup.find('h1').text #extracting title of the 1st book
---> 18 category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip() #extracting category
     19 rating = book_soup.find('p',class_='star-rating')['class'][1] #having two classes star_rating and rating(three)
     20 price = book_soup.find('p',class_='price_color').text.strip()

AttributeError: 'NoneType' object has no attribute 'find_all'

and the code loop to scrap all 50 pages is

books_data = []

#loop through all 50 pages
for page_num in range(1,51):
    url = f'https://books.toscrape.com/catalogue/page-{page_num}.html'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    
    # find all the book titles and their links under h3 tag of the current page
    books = soup.find_all('h3')
    
    for book in books:
        book_url = book.find('a')['href'] #grabbing or acessing href attribute of 1st book
        book_response = requests.get(url + book_url) #getting the response of the 1st book link
        book_soup = BeautifulSoup(book_response.content,'html.parser')
    
        title = book_soup.find('h1').text #extracting title of the 1st book
        category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip() 
        rating = book_soup.find('p',class_='star-rating')['class'][1] 
        price = book_soup.find('p',class_='price_color').text.strip()
        availability = book_soup.find('p',class_='availability').text.strip()
    
        #appending the extracted data to the list
        books_data.append([title,category,rating,price,availability])
        print(books_data)

The error to scrap all 50 pages is

AttributeError                            Traceback (most recent call last)
Cell In[82], line 20
     16 book_soup = BeautifulSoup(book_response.content,'html.parser')
     19 title = book_soup.find('h1').text #extracting title of the 1st book
---> 20 category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip() #extracting category
     21 rating = book_soup.find('p',class_='star-rating')['class'][1] #having two classes star_rating and rating(three)
     22 price = book_soup.find('p',class_='price_color').text.strip()

AttributeError: 'NoneType' object has no attribute 'find_all'``

Solution

I have changed three lines here. One is creating the baseurl constant, which has the root URL upon which all the other URLs are based. Two is to change the url = line to build from that. Three is to change the second requests.get call to build a URL from the base.

books_data = []
baseurl = 'https://books.toscrape.com/catalogue/'

#loop through all 50 pages
for page_num in range(1,51):
    url = baseurl + f'page-{page_num}.html'
    response = requests.get(url)
    soup = BeautifulSoup(response.content,'html.parser')
    
    # find all the book titles and their links under h3 tag of the current page
    books = soup.find_all('h3')
    
    for book in books:
        book_url = book.find('a')['href'] #grabbing or acessing href attribute of 1st book
        book_response = requests.get(baseurl + book_url) #getting the response of the 1st book link
        book_soup = BeautifulSoup(book_response.content,'html.parser')
    
        title = book_soup.find('h1').text #extracting title of the 1st book
        category = book_soup.find('ul',class_='breadcrumb').find_all('a')[2].text.strip() 
        rating = book_soup.find('p',class_='star-rating')['class'][1] 
        price = book_soup.find('p',class_='price_color').text.strip()
        availability = book_soup.find('p',class_='availability').text.strip()
    
        #appending the extracted data to the list
        books_data.append([title,category,rating,price,availability])
        print(books_data)

This is just how HTML works. All of the links on this page are "relative" URLs. They do not start with http://. Instead, they are all relative to the page being displayed. That means you have to build those URLs from the directory that CONTAINED the page. You were trying to build the URL from the name of the specific page being displayed. That's wrong.

Answered By - Tim Roberts

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 2, 2023

[FIXED] How do i solve "AttributeError: 'NoneType' object has no attribute 'find_all' error in webscrapping"

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels