Issue
I've seen some questions and posts on how to scrape tweets of a specific handle, but not on how to do so to get all the replies to a particular tweet using Python via Jupyter Notebook.
Example: I want to scrape and export to Excel all the 340 replies to this public BBC tweet "Microplastics found in fresh Antarctic snow for the first time" (https://twitter.com/BBCWorld/status/1534777385249390593)
I need the following info: Reply date, Reply to (so I only get the replies to BBC, and not to other users in this thread) and the Reply text.
Inspecting the elements of the URL, I see that the reply container's class is named: css-1dbjc4n
. Likewise:
- The Reply date's class is:
css-1dbjc4n r-1loqt21 r-18u37iz r-1ny4l3l r-1udh08x r-1qhn6m8 r-i023vh r-o7ynqc r-6416eg
- The Reply to's class is:
css-4rbku5 css-18t94o4 css-901oao r-14j79pv r-1loqt21 r-1q142lx r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-3s2u2q r-qvutc0
- And the Reply text's class is:
css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0
I have tried to run the code below, but the list remains empty :(
Results so far:
Empty DataFrame
Columns: [Date of Tweet, Replying to, Tweet]
Index: []
Can anyone help me, please? Many thanks! :)
Code:
import sys
sys.path.append("path to site-packages in your pc")
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path=r"C:chromedriver path in your pc")
dates=[] #List to store date of tweet
replies=[] #List to store reply to info
comments=[] #List to store comments
driver.get("https://twitter.com/BBCWorld/status/1534777385249390593")
twts=[]
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('div',href=True, attrs={'class':'css-1dbjc4n'}):
datetweet=a.find('div', attrs={'class':'css-1dbjc4n r-1loqt21 r-18u37iz r-1ny4l3l r-1udh08x r-1qhn6m8 r-i023vh r-o7ynqc r-6416eg'})
replytweet=a.find('div', attrs={'class':'css-4rbku5 css-18t94o4 css-901oao r-14j79pv r-1loqt21 r-1q142lx r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-3s2u2q r-qvutc0'})
commenttweet=a.find('div', attrs={'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
dates.append(datetweet.text)
replies.append(replytweet.text)
comments.append(commenttweet.text)
df = pd.DataFrame({'Date of Tweet':dates,'Replying to':replies,'Tweet':comments})
df.to_csv('tweets.csv', index=False, encoding='utf-8')
print(df)
Solution
I found two problems:
page uses JavaScript to add elements and JavaScript may need time to add all elements to HTML - you may need
time.sleep(...)
before you getdriver.page_source
. OR use waits in Selenium to wait for some objects (before you getdriver.page_source
).HTML doesn't use
<div href="...">
so yourfindAll('div', href=True, ...)
is wrong. You have to removehref=True
EDIT:
I put code which I created but it needs also to scroll page to get more tweets. And later it may need to click Show more replies
to get even more tweets.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
import pandas as pd
import time
#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))
driver.get("https://twitter.com/BBCWorld/status/1534777385249390593")
time.sleep(10)
# TODO: scroll page to get more tweets
#for _ in range(2):
# last = driver.find_elements(By.XPATH, '//div[@data-testid="cellInnerDiv"]')[-1]
# driver.execute_script("arguments[0].scrollIntoView(true)", last)
# time.sleep(3)
all_tweets = driver.find_elements(By.XPATH, '//div[@data-testid]//article[@data-testid="tweet"]')
tweets = []
print(len(all_tweets)-1)
for item in all_tweets[1:]: # skip first tweet because it is BBC tweet
#print('--- item ---')
#print(item.text)
print('--- date ---')
try:
date = item.find_element(By.XPATH, './/time').text
except:
date = '[empty]'
print(date)
print('--- text ---')
try:
text = item.find_element(By.XPATH, './/div[@data-testid="tweetText"]').text
except:
text = '[empty]'
print(text)
print('--- replying_to ---')
try:
replying_to = item.find_element(By.XPATH, './/div[contains(text(), "Replying to")]//a').text
except:
replying_to = '[empty]'
print(replying_to)
tweets.append([date, replying_to, text])
df = pd.DataFrame(tweets, columns=['Date of Tweet', 'Replying to', 'Tweet'])
df.to_csv('tweets.csv', index=False, encoding='utf-8')
print(df)
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.