Monday, April 11, 2022

[FIXED] Can't get info of a lxml site with Request and BeautifulSoup

April 11, 2022 beautifulsoup, python, python-requests, web-scraping No comments

Issue

I'm trying to make a testing project that scraps info of a specific site but with no success.
I followed some tutorials i have found and even an post on stackoverflow. After all this I'm stuck!
help me stepbrothers, I'm a hot new programmer with python and I can't stop my projects.

more info: this is a lottery website that I was trying to scrap and make some analisys to get a lucky number.

I have followed this tutorials:
https://towardsdatascience.com/how-to-collect-data-from-any-website-cb8fad9e9ec5

https://beautiful-soup-4.readthedocs.io/en/latest/

Using BeautifulSoup in order to find all "ul" and "li" elements

All of you have my gratitute!

from bs4 import BeautifulSoup as bs
import requests 
import html5lib
#import urllib3 # another attemp to make  another req in the url ------failed

url = '''https://loterias.caixa.gov.br/Paginas/Mega-Sena.aspx'''

#another try to take results in the <ul> but I have no qualified results  == None
def parse_ul(elem):#https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
    result = {}
    for sub in elem.find_all('li', recursive=False):
        if sub.li is None:
            continue
        data = {k: v for k, v in sub.attrs.items()}
        if sub.ul is not None:
            # recurse down
            data['children'] = parse_ul(sub.ul)
        result[sub.li.get_text(strip=True)] = data
    return result

page = requests.get(url)#taking info from website

print(page.encoding)# == UTF-8

soup = bs(page.content,features="lxml")#takes all info from the url and organizes it ==Beaultiful soup

numbers = soup.find(id='ulDezenas')#searcher in the content of this specific id// another try: soup.find('ul', {'class': ''})

result =  parse_ul(soup)#try to parse info, but none is found EVEN WITH THE ORIGINAL ONE

print(numbers)#The result is below:
'''<ul class="numbers megasena" id="ulDezenas">
<li ng-repeat="dezena in resultado.listaDezenas ">{{dezena.length &gt; 2 ? dezena.slice(1) : dezena}}</li>
</ul>'''
print(result)# == "{}" nothing found

#with open('''D:\Documents\python\_abretesesame.txt''', 'wb') as fd:
#    for chunk in page.iter_content(chunk_size=128):
#        fd.write(chunk)
# =======printing document(HTML) in file still no success in getting the numbers

Solution

Main issue is that the content is provided dynamically by JavaScript but you can get the information via another url:

jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()

will give you folowing JSON:

{'tipoJogo': 'MEGA_SENA', 'numero': 2468, 'nomeMunicipioUFSorteio': 'SÃO PAULO, SP', 'dataApuracao': '02/04/2022', 'valorArrecadado': 158184963.0, 'valorEstimadoProximoConcurso': 3000000.0, 'valorAcumuladoProximoConcurso': 0.0, 'valorAcumuladoConcursoEspecial': 36771176.89, 'valorAcumuladoConcurso_0_5': 33463457.98, 'acumulado': False, 'indicadorConcursoEspecial': 1, 'dezenasSorteadasOrdemSorteio': ['022', '041', '053', '042', '035', '057'], 'listaResultadoEquipeEsportiva': None, 'numeroJogo': 2, 'nomeTimeCoracaoMesSorte': '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'tipoPublicacao': 3, 'observacao': '', 'localSorteio': 'ESPAÇO DA SORTE', 'dataProximoConcurso': '06/04/2022', 'numeroConcursoAnterior': 2467, 'numeroConcursoProximo': 2469, 'valorTotalPremioFaixaUm': 0.0, 'numeroConcursoFinal_0_5': 2470, 'listaDezenas': ['022', '035', '041', '042', '053', '057'], 'listaDezenasSegundoSorteio': None, 'listaMunicipioUFGanhadores': [{'posicao': 1, 'ganhadores': 1, 'municipio': 'SANTOS', 'uf': 'SP', 'nomeFatansiaUL': '', 'serie': ''}], 'listaRateioPremio': [{'faixa': 1, 'numeroDeGanhadores': 1, 'valorPremio': 122627171.8, 'descricaoFaixa': '6 acertos'}, {'faixa': 2, 'numeroDeGanhadores': 267, 'valorPremio': 34158.18, 'descricaoFaixa': '5 acertos'}, {'faixa': 3, 'numeroDeGanhadores': 20734, 'valorPremio': 628.38, 'descricaoFaixa': '4 acertos'}], 'id': None, 'ultimoConcurso': True, 'exibirDetalhamentoPorCidade': True, 'premiacaoContingencia': None}

Simply extract dezenasSorteadasOrdemSorteio and prozess in list comprehension:

[n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']]

Result will be:

['22', '35', '41', '42', '53', '57']

Example

import requests

jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
print([n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']])

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, April 11, 2022

[FIXED] Can't get info of a lxml site with Request and BeautifulSoup

Issue

Solution

Example

0 comments:

Post a Comment

Popular Posts

Labels