Issue
Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.
Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade.
I've already scraped all this HTML into a text, now how to fish out the Biology grades?
<div class = "student">
<div class = "score">Algebra C-</div>
<div class = "score">Biology A+</div>
<div class = "score">Chemistry B</div>
</div>
<div class = "student">
<div class = "score">Biology B</div>
<div class = "score">Chemistry A</div>
</div>
<div class = "student">
<div class = "score">Alchemy D</div>
<div class = "score">Algebra A</div>
<div class = "score">Biology B</div>
</div>
<div class = "student">
<div class = "score">Algebra A</div>
<div class = "score">Biology B</div>
<div class = "score">Chemistry C+</div>
</div>
<div class = "student">
<div class = "score">Alchemy D</div>
<div class = "score">Algebra A</div>
<div class = "score">Bangladeshi History C</div>
<div class = "score">Biology B</div>
</div>
I'm using beautiful soup, and I think I'm going to have to find divs where Text includes "Biology"?
This is only for a quick scrape and I'm open to hard-coding and fiddling in Excel or whatnot. Yes, it's a shoddy website! Yes, they do have an API, and I don't know a thing about WDSL.
Short version: http://www.legis.ga.gov/Legislation/en-US/Search.aspx ,to find the date of last action on every bill, FWIW. It's troublesome because if a bill has no sponsors in the second chamber, instead of a div containing nothing, they just don't have a div there at all. So sometimes the timeline is in div 3, sometimes 2, etc.
Solution
(1) To just get the biology grade only, it is almost one liner.
import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology'))
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores
The output looks like this:
[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']
(2) You locate the tags and maybe for further tasks, you need to find the parent
:
import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs
Output looks like this:
[<div class="score">Biology A+</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>]
*In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*
More information about how to navigate the tree. And Good luck with your work.
Answered By - B.Mr.W.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.