Issue
Basically I am trying to build a program that can identify log in pages by url. My idea for doing so is parsing through the pages in search for textboxes (and than identify them by name and type). here is the code:
import requests
from bs4 import BeautifulSoup
\\parse page html (soup)
def parse(soup):
found = []
for a in soup.find_all('input'):
if(a['type'] in ['text','password','email']):
found.append(a['name'])
return found
\\get site's html
def get_site_content(url):
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html5lib')
textBoxes = parse(soup)
print("Found in: " +url)
print(textBoxes)
if __name__ == '__main__':
get_site_content('https://login.facebook.com')
get_site_content('https://www.instagram.com/accounts/login/')
get_site_content('https://instagram.com')
get_site_content('https://instagram.com/login')
get_site_content('https://login.yahoo.com')
Seems to work just fine, but for some reason I've had problems with instagram's log in page. here is the output:
Found in: https://login.facebook.com
['email', 'pass']
Found in: https://www.instagram.com/accounts/login/
[]
Found in: https://instagram.com
[]
Found in: https://instagram.com/login
[]
Found in: https://login.yahoo.com
['username', 'passwd']
Process finished with exit code 0
After using different libraries for getting the html and different parsers Ive come to understand that the problem is with the html = requests.get(url)
line. it just doesn't get the full html.
any ideas on how to fix this?
Thanks in advance!
by the way if you have a better idea for what I am trying to accomplish I would love to hear it :)
Solution
Alright, so thanks to @user:14460824 (HedgHog) I have come to realize that the problem was the need to render the page since its rendered dynamically from Javascript. Personally, I didn't like selenium and used requests-html instead. it operates the same as selenium but just feels easier to use and in the future when I realize how to identify weather a web page is rendered dynamically from Javascript or not this library will be much easier to use so I won't waste resources. here is the code:
from requests_html import HTMLSession
import requests
#parse page html
def parse(html):
found = []
for a in html.find('input'):
if(a.attrs['type'] in ['text','password','email'] and 'name' in a.attrs):
found.append(a.attrs['name'])
return found
#get site's html
def get_site_content(url):
try:
session = HTMLSession()
response = session.get(url)
#if(JAVASCRIPT): #here i need to find a way to tell weather
#Render the page #the page is rendered dynamically from Javascript
#response.html.render(timeout=20)
response.html.render(timeout=20) #for now render all pages
return response.html
except requests.exceptions.RequestException as e:
print(e)
def find_textboxes(url):
textBoxes = parse(get_site_content(url))
print("Found in: " +url)
print(textBoxes)
if __name__ == '__main__':
find_textboxes('https://login.facebook.com')
find_textboxes('https://www.instagram.com/accounts/login/')
find_textboxes('https://instagram.com')
find_textboxes('https://login.yahoo.com')
Answered By - user10696838
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.