Tuesday, April 5, 2022

[FIXED] What is the fix for this Error: 'NoneType' object has no attribute 'prettify'

April 05, 2022 beautifulsoup, pandas, python, web-scraping No comments

Issue

I want to scrape this URL https://aviation-safety.net/wikibase/type/C206.

I don't understand the meaning of this error below: 'NoneType' object has no attribute 'prettify'

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request

url = 'https://aviation-safety.net/wikibase/type/C206'
req = Request(url , headers = {
                          'accept':'*/*',
                          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})

data = []

while True:
    print(url)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    data.append(pd.read_html(soup.select_one('tbody').prettify())[0])

    if soup.select_one('div.pagenumbers + div a[href]'):
        url = soup.select_one('div.pagenumbers + div a')['href']
    else:
        break
df = pd.concat(data)
df.to_csv('206.csv',encoding='utf-8-sig',index=False)

Solution

You're not using headers with requests, which is the reason you're not getting the right HTML and the table you're after is the second one, not the first. Also, I'd highly recommend to use requests over urllib.request.

So, having said that, here's how to get all the tables from all the pages:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://aviation-safety.net/wikibase/type/C206'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}

data = []
with requests.Session() as s:
    total_pages = int(
        BeautifulSoup(s.get(url, headers=headers).text, "lxml")
        .select("div.pagenumbers > a")[-1]
        .getText()
    )

    for page in range(1, total_pages + 1):
        print(f"Getting page: {page}...")
        data.append(
            pd.read_html(
                s.get(f"{url}/{page}", headers=headers).text,
                flavor="lxml",
            )[1]
        )

df = pd.concat(data)
df.to_csv('206.csv', sep=";", index=False)

Answered By - baduker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, April 5, 2022

[FIXED] What is the fix for this Error: 'NoneType' object has no attribute 'prettify'

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels