Issue
I wrote a code to scrape the binance announcement site (https://www.binance.com/en/support/announcement/c-48?navId=48) and get the last < a > tag and do something with it. The problem is that when Binance releases a new announcement with a new < a > tag, my code detects it after 3-5 minutes. So it has a 3-5 minute delay. Also, I tried that same code on my personal site and it works perfectly without any delay. Why is that and what might cause this issue?
session = requests_cache.CachedSession('demo_cache')
####### first check of <a> ########
def getFirstLink():
pageForFirstCheck = session.get(siteUrl)
soupForFirstCheck = BeautifulSoup(pageForFirstCheck.content, "html.parser")
resultForFirstCheck = soupForFirstCheck.find('div', class_='css-6f91y1')
firstDiv = resultForFirstCheck.find('div', class_='css-vurnku')
firstLink = firstDiv.find('a')
prevLink = firstLink.get_text() # <a> cel mai de sus
return prevLink
Also, I wrap this function inside a while True loop:
while True:
time.sleep(random.randint(1, 5))
try:
stringThatCameFromLink = getFirstLink()
# and it does something with that link
Thank you in advance!
Solution
I think the problem is that the cloudflare server is caching documents. Or it was done deliberately by the binance programmers, so that a narrow circle of people could react to the news faster than everyone else. This is a big problem if you want to get fresh data. If you look at the HTTP headers, you will notice that the "Date:" header is cached by the server, which means that the entire content of the document is cached. I managed to get 2 different "Date:" if I add or remove the gzip header. "accept-encoding: gzip, deflate". I am using the page https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15 If you change the "pageSize" parameter, you can get fresh cached responses from the server. But that still doesn't solve the 5 minute delay issue. And I still see the old page. Your link is https://www.binance.com/en/support/announcement/c-48?navId=48 like mine https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15 is also cached for 5 seconds. And my guess is that there will be a 5 minute delay as well. I have not found a solution to this problem.
Answered By - user3210461
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.