Saturday, October 9, 2021

[FIXED] How to limit data being grabbed in python beautifulsoup in a page that refreshes every few seconds

October 09, 2021 beautifulsoup, python, python-3.x, web-scraping No comments

Issue

I am trying to grab some data. The problem I am facing is that the page refreshes every few seconds. I wanted to limit the data grabs based on the latest block only and refresh the scan and hopefully catch up with the next succeeding block. Any idea will be very helpful.

Goal #1 - Continuity with grabbed blocks

Goal #2 - Eliminate Duplicates

from bs4 import BeautifulSoup
from time import sleep
import re, requests

trim = re.compile(r'[^\d,.]+')

url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0

while True:
    scans += 1
    reqtxsInternal = requests.get(url,header, timeout=2)
    souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
    blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')

    for row in blocktxsInternal[1:]:
        txnhash = row.find_all('td')[1].text[0:]
        txnhashdetails = txnhash.strip()
        block = row.find_all('td')[3].text[0:]
        value = row.find_all('td')[9].text[0:]
        amount = trim.sub('', value).replace(",", "")
        transval = float(amount)
        
        if float(transval) >= 1:
            print ("Doing something with the data -> " + str(block) + "   " + str(transval))
        else:
            pass
    print (" -> Whole Page Scanned: ", scans)
sleep(1)

Current Output: #-- will be different when you run the script

Doing something with the data -> 10186993   1.233071907624764
Doing something with the data -> 10186993   4.689434542638692
Doing something with the data -> 10186993   27.97137792744322   #-- grab only until here and reload the scan
Doing something with the data -> 10186992   9.0
Doing something with the data -> 10186991   2.98
Doing something with the data -> 10186991   1.0
 -> Whole Page Scanned:  1
Doing something with the data -> 10186994   1.026868093169767
Doing something with the data -> 10186994   4.0
Doing something with the data -> 10186994   4.55582682
Doing something with the data -> 10186994   8.184713205161088
Doing something with the data -> 10186993   1.233071907624764
Doing something with the data -> 10186993   4.689434542638692
Doing something with the data -> 10186993   27.97137792744322
Doing something with the data -> 10186992   9.0
 -> Whole Page Scanned:  2

Wanted Output:

Doing something with the data -> 10186993   1.233071907624764
Doing something with the data -> 10186993   4.689434542638692
Doing something with the data -> 10186993   27.97137792744322
 -> Whole Page Scanned:  1
Doing something with the data -> 10186994   1.026868093169767
Doing something with the data -> 10186994   4.0
Doing something with the data -> 10186994   4.55582682
Doing something with the data -> 10186994   8.184713205161088
 -> Whole Page Scanned:  2

Solution

I utilized Pandas here since it uses beautifulsoup under the hood anyway, but since it's a table, I let pandas parse it. Then it's easy to manipulate the table.

So what it looks like is you only want the latest/max "Block" then return any values greater than or equal to 1. Does this give you what you want?

import pandas as pd
from time import sleep
import requests

url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0

while True:
    scans += 1
    reqtxsInternal = requests.get(url,header, timeout=2)
    df = pd.read_html(reqtxsInternal.text)[0]
    df = df[df['Block'] == max(df['Block'])]
    df['Value'] = df['Value'].str.extract('(^\d*.*\d+)')
    df = df[df['Value'].astype(float) >= 1]
    
    print (df[['Block','Value']])
    print (" -> Whole Page Scanned: ", scans)
sleep(1)

Your other option is just have it check to see if the current 'block' is greater than the previous. Then add that logic to only print if it is:

from bs4 import BeautifulSoup
from time import sleep
import re, requests

trim = re.compile(r'[^\d,.]+')

url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
previous_block = 0
while True:
    scans += 1
    reqtxsInternal = requests.get(url,header, timeout=2)
    souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
    blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')

    for row in blocktxsInternal[1:]:
        txnhash = row.find_all('td')[1].text[0:]
        txnhashdetails = txnhash.strip()
        block = row.find_all('td')[3].text[0:]
        if float(block) > float(previous_block):
            previous_block = block
        value = row.find_all('td')[9].text[0:]
        amount = trim.sub('', value).replace(",", "")
        transval = float(amount)
        
        if float(transval) >= 1 and block == previous_block:
            print ("Doing something with the data -> " + str(block) + "   " + str(transval))
        else:
            pass
    print (" -> Whole Page Scanned: ", scans)
sleep(1)

Answered By - chitown88

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 9, 2021

[FIXED] How to limit data being grabbed in python beautifulsoup in a page that refreshes every few seconds

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels