Issue
I want to scrape https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production
this website.
there are 2 set of links SI units
and Oil Field units
I have tried to scrape the list of links form SI units
and created function called get_gas_links
import io
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs, SoupStrainer
import re
url = "https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production"
first_page = requests.get(url)
soup = bs(first_page.content)
def pasrse_page(link):
print(link)
df = pd.read_html(link, skiprows=1, headers=1)
return df
def get_gas_links():
glinks=[]
gas_links = soup.find_all("a", href = re.compile("si.htm"))
for i in gas_links:
glinks.append("https://ens.dk/" + i.get("herf"))
return glinks
get_gas_links()
Main motive to scrape 3 tables from every link
however before scraping table I am trying to scrape list of links
but it shows error : TypeError: must be str, not NoneType
error_image
Solution
You are using wrong regex in a wrong way. That's why soup can not find any links that fulfills the criteria. You can check the following source and validate the the extracted_link however you want.
def get_gas_links():
glinks=[]
gas_links = soup.find('table').find_all('a')
for i in gas_links:
extracted_link = i['href']
#you can validate the extracted link however you want
glinks.append("https://ens.dk/" + extracted_link)
return glinks
Answered By - Nazmul Hasan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.