Wednesday, September 14, 2022

[FIXED] How to extract all h2 texts from some URLs and store to CSV?

September 14, 2022 beautifulsoup, csv, html, python, web-scraping No comments

Issue

Need to extract all h2 text from some links and I tried it by using BeautifulSoup, but it didn't worked.

I also want to output them to CSV

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests
import csv

r01 = requests.get("https://www.seikatsu110.jp/library/vermin/vr_termite/23274/") 
r02 = requests.get("https://yuko-navi.com/termite-control-subsidies")


soup_content01 = BeautifulSoup(r01.content, "html.parser")
soup_content02 = BeautifulSoup(r02.content, "html.parser")

alltxt01 = soup_content01.get_text()
alltxt02 = soup_content02.get_text()

with open('h2.csv', 'w+',newline='',encoding='utf-8') as f:
    n = 0

    for subheading01 in soup_content01.find_all('h2'):
        sh01 = subheading01.get_text()

        writer = csv.writer(f, lineterminator='\n')
        writer.writerow([n, sh01])
        n += 1

    for subheading02 in soup_content02.find_all('h2'):
        sh02 = subheading02.get_text()

        writer = csv.writer(f, lineterminator='\n')
        writer.writerow([n, sh01, sh02])
        n += 1
pass

expected csv output is as below:

sh01	sh02
シロアリ駆除に適用される補助金や保険は？	1章　シロアリ駆除工事に補助金はない！
シロアリ駆除の費用を補助金なしで抑える方法	2章　確定申告時に「雑損控除」申請がおすすめ
シロアリ駆除の費用ってどれくらいかかる？	3章　「雑損控除」として負担してもらえる金額
要件を満たせば加入できるシロアリ専門の保険がある？	4章　「雑損控除」の申請方法
シロアリには5年保証がある！	5章　損したくないなら信頼できる業者を選ぼう！
まとめ	まとめ
この記事の監修者　ナカザワ氏について
この記事の監修者　ナカザワ氏について
シロアリ駆除のおすすめ記事
関連記事カテゴリ一覧
シロアリ駆除の記事アクセスランキング
シロアリ駆除の最新記事
カテゴリ別記事⼀覧
シロアリ駆除の業者を地域から探す
関連カテゴリから業者を探す
シロアリ駆除業者ブログ
サービスカテゴリ
生活110番とは
加盟希望・ログイン

Please somebody tell me what's wrong with this code.

Solution

Just in addation to approach of @Barry the Platipus with itertools, that is great. - pandas is also my favorite and there is an alternative way with native dict comprehension.

Iterate your urls and create a dict that holds number or url as key and a list of heading texts as value. These could be easily transformed to a DataFrame and exported to CSV:

d = {}
for e,url in enumerate(urls,1):
    soup = BeautifulSoup(requests.get(url).content)
    d[f'sh{e}'] = [h.get_text() for h in soup.find_all('h2')]

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items()]))#.to_csv('yourfile.csv', index = False)

Example

from bs4 import BeautifulSoup
import requests
import pandas as pd

urls = ['https://www.seikatsu110.jp/library/vermin/vr_termite/23274/','https://yuko-navi.com/termite-control-subsidies']
d = {}
for e,url in enumerate(urls,1):
    soup = BeautifulSoup(requests.get(url).content)
    d[f'sh{e}'] = [h.get_text() for h in soup.find_all('h2')]

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items()]))#.to_csv('yourfile.csv', index = False)

Output

sh1	sh2
シロアリ駆除に適用される補助金や保険は？	1章　シロアリ駆除工事に補助金はない！
シロアリ駆除の費用を補助金なしで抑える方法	2章　確定申告時に「雑損控除」申請がおすすめ
シロアリ駆除の費用ってどれくらいかかる？	3章　「雑損控除」として負担してもらえる金額
要件を満たせば加入できるシロアリ専門の保険がある？	4章　「雑損控除」の申請方法
シロアリには5年保証がある！	5章　損したくないなら信頼できる業者を選ぼう！
まとめ	まとめ
この記事の監修者　ナカザワ氏について	nan
この記事の監修者　ナカザワ氏について	nan
シロアリ駆除のおすすめ記事	nan
関連記事カテゴリ一覧	nan
シロアリ駆除の記事アクセスランキング	nan
シロアリ駆除の最新記事	nan
カテゴリ別記事⼀覧	nan
シロアリ駆除の業者を地域から探す	nan
関連カテゴリから業者を探す	nan
シロアリ駆除業者ブログ	nan
サービスカテゴリ	nan
生活110番とは	nan
加盟希望・ログイン	nan

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, September 14, 2022

[FIXED] How to extract all h2 texts from some URLs and store to CSV?

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels