Monday, April 11, 2022

[FIXED] parsing instagram's html of log-in page with beautifulsoup on python 3.9.10

April 11, 2022 beautifulsoup, python, python-3.9, python-requests, python-requests-html No comments

Issue

Basically I am trying to build a program that can identify log in pages by url. My idea for doing so is parsing through the pages in search for textboxes (and than identify them by name and type). here is the code:

import  requests
from bs4 import BeautifulSoup


\\parse page html (soup)
def parse(soup):
    found = []
    for a in soup.find_all('input'):
        if(a['type'] in ['text','password','email']):
            found.append(a['name'])
    return found

\\get site's html
def get_site_content(url):
    html = requests.get(url)

    soup = BeautifulSoup(html.text, 'html5lib') 
    textBoxes = parse(soup)
    print("Found in: " +url)
    print(textBoxes)

if __name__ == '__main__':
    get_site_content('https://login.facebook.com')
    get_site_content('https://www.instagram.com/accounts/login/')
    get_site_content('https://instagram.com')
    get_site_content('https://instagram.com/login')
    get_site_content('https://login.yahoo.com')

Seems to work just fine, but for some reason I've had problems with instagram's log in page. here is the output:

Found in: https://login.facebook.com
['email', 'pass']
Found in: https://www.instagram.com/accounts/login/
[]
Found in: https://instagram.com
[]
Found in: https://instagram.com/login
[]
Found in: https://login.yahoo.com
['username', 'passwd']

Process finished with exit code 0

After using different libraries for getting the html and different parsers Ive come to understand that the problem is with the html = requests.get(url) line. it just doesn't get the full html. any ideas on how to fix this? Thanks in advance!

by the way if you have a better idea for what I am trying to accomplish I would love to hear it :)

Solution

Alright, so thanks to @user:14460824 (HedgHog) I have come to realize that the problem was the need to render the page since its rendered dynamically from Javascript. Personally, I didn't like selenium and used requests-html instead. it operates the same as selenium but just feels easier to use and in the future when I realize how to identify weather a web page is rendered dynamically from Javascript or not this library will be much easier to use so I won't waste resources. here is the code:

from requests_html import HTMLSession
import  requests


#parse page html 
def parse(html):
    found = []
    for a in html.find('input'):
        if(a.attrs['type'] in ['text','password','email'] and 'name' in a.attrs):
            found.append(a.attrs['name'])
    return found

#get site's html
def get_site_content(url):
    try:
        session = HTMLSession()
        response = session.get(url)
        #if(JAVASCRIPT):      #here i need to find a way to tell weather
            #Render the page  #the page is rendered dynamically from Javascript
            #response.html.render(timeout=20)
        response.html.render(timeout=20) #for now render all pages
        return response.html

    except requests.exceptions.RequestException as e:
        print(e)


def find_textboxes(url):
    textBoxes = parse(get_site_content(url))
    print("Found in: " +url)
    print(textBoxes)

if __name__ == '__main__':
    find_textboxes('https://login.facebook.com')
    find_textboxes('https://www.instagram.com/accounts/login/')
    find_textboxes('https://instagram.com')
    find_textboxes('https://login.yahoo.com')

Answered By - user10696838

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, April 11, 2022

[FIXED] parsing instagram's html of log-in page with beautifulsoup on python 3.9.10

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels