Saturday, March 5, 2022

[FIXED] how to find the pair content using beautifulsoup in Python3

March 05, 2022 beautifulsoup, python No comments

Issue

I am using beautiful soup to scrape a web page: http://www.jukuu.com/search.php?q=apple, I want to get the english sentence and the pair chineses translate demo sentence. Now I could find all english sentence and chinese sentence using this command:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

def parseDictWeb():
    print("parse....")
    url = "http://www.jukuu.com/search.php?q=apple"
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, 'html.parser')
    mydivs = soup.find_all("tr", {"class": "e"})

now the problem is that I could not pair the english sentence with chinese sentence. I want to get the pair sentence and make sure all sentence are paired successfully. If I get all english and all chinese and paired by the array id.

mydivseng = soup.find_all("tr", {"class": "e"})
mydivszh = soup.find_all("tr", {"class": "c"})
// do some pair logic by the array id

when some chinese sencence is null, the pair will broken. Maybe paired to the wrong sentence. how to get the pair in one time and make it paired 100% successful? I want the output should look like this:

lst = [
   {'e': ..., 'c': ...},
   {'e': ..., 'c': ...}
]

Solution

Try:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

def parseDictWeb():
    print("parse....")
    url = "http://www.jukuu.com/search.php?q=apple"
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, 'html.parser')

    # Select only English and Chinese rows
    mydivs = soup.find_all('tr', {'class': ['e', 'c']})
    
    lst = []
    idx = -1
    for row in divs:
        # initialize a new element of the list if english
        if 'e' in row.attrs['class']:
            lst.append({'eng': row.text})
            idx += 1
        # update existing element of the dict if chinese
        else:
            lst[idx].update({'zh': row.text})
            
    return lst
    
out = parseDictWeb()

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, March 5, 2022

[FIXED] how to find the pair content using beautifulsoup in Python3

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels