Issue
I am creating a Python web-scraper, and I have it print the title
and span
of the web-page I enter. I've been looking around, but cannot find other elements to a web-page.
Are there any other portions of a website which Python can access using bs4
/ BeautifulSoup
/ requests
?
I've found a head
element, but I'm sure there has to be more.
Solution
Here is a list of HTML tags you can find. In bs4, you generally use the find
or findAll
methods to scrape a page. The first parameter of these functions is the name of the tag you are in search for. Here are some examples of how to use the findAll method: https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#The%20basic%20find%20method:%20findAll(name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs)
(Stackoverflow would not let me paste the link as a hyperlink)
Alternatively you can traverse the document tree like so:
def walker(soup):
if soup.name is not None:
for child in soup.children:
#process node
print str(child.name) + ":" + str(type(child))
walker(child)
walker(soup)
taken from: http://makble.com/parsing-and-traversing-dom-tree-with-beautifulsoup
This goes through each node in the tree from the root, <html>
in a depth-first search. This is done by recursively looking at the children of each node, then the children's children and so on.
Answered By - Calder White
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.