Issue
I am using beautiful soup to scrape a web page: http://www.jukuu.com/search.php?q=apple, I want to get the english sentence and the pair chineses translate demo sentence. Now I could find all english sentence and chinese sentence using this command:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def parseDictWeb():
print("parse....")
url = "http://www.jukuu.com/search.php?q=apple"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, 'html.parser')
mydivs = soup.find_all("tr", {"class": "e"})
now the problem is that I could not pair the english sentence with chinese sentence. I want to get the pair sentence and make sure all sentence are paired successfully. If I get all english and all chinese and paired by the array id.
mydivseng = soup.find_all("tr", {"class": "e"})
mydivszh = soup.find_all("tr", {"class": "c"})
// do some pair logic by the array id
when some chinese sencence is null, the pair will broken. Maybe paired to the wrong sentence. how to get the pair in one time and make it paired 100% successful? I want the output should look like this:
lst = [
{'e': ..., 'c': ...},
{'e': ..., 'c': ...}
]
Solution
Try:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def parseDictWeb():
print("parse....")
url = "http://www.jukuu.com/search.php?q=apple"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, 'html.parser')
# Select only English and Chinese rows
mydivs = soup.find_all('tr', {'class': ['e', 'c']})
lst = []
idx = -1
for row in divs:
# initialize a new element of the list if english
if 'e' in row.attrs['class']:
lst.append({'eng': row.text})
idx += 1
# update existing element of the dict if chinese
else:
lst[idx].update({'zh': row.text})
return lst
out = parseDictWeb()
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.