Issue
I am trying to collect URLs from a webpage with Rselenium, but getting InvalidSelector error
Use R 3.6.0 on Windows 10 PC, Rselenium 1.7.5 with Chrome webdriver (chromever="75.0.3770.8")
library(RSelenium)
rD <- rsDriver(browser=c("chrome"), chromever="75.0.3770.8")
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
url <- "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96"
remDr$navigate(url)
tt <- remDr$findElements(using = "xpath", "//a[contains(@href,'http://twitter.com/')]/@href")
I expect to collect URLs to Twitter accounts of politicians listed. Instead I am getting the next error:
Selenium message:
invalid selector: The result of the xpath expression "//a[contains(@href,'http://twitter.com/')]/@href" is: [object Attr]. It should be an element.
(Session info: chrome=75.0.3770.80)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/invalid_selector_exception.html
Build info: version: '4.0.0-alpha-1', revision: 'd1d3728cae', time: '2019-04-24T16:15:24'
System info: host: 'ALEX-DELL-17', ip: '10.0.75.1', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_191'
Driver info: driver.version: unknown
Error: Summary: InvalidSelector Detail: Argument was an invalid selector (e.g. XPath/CSS). class: org.openqa.selenium.InvalidSelectorException Further Details: run errorDetails method
When I make a similar search for very specific element all works fine, example:
tt <- remDr$findElement(value = '//a[@href = "http://twitter.com/AlboMP"]')
then
tt$getElementAttribute('href')
returns me URL I need
What am I doing wrong?
Solution
I don't anything about R so I am posting an answer with python. As this post is about R, I learned some R basics and posting it too.
The easiest way to get the twitter URL is by iterating through all the URLs in the webpage and check if it contains the word 'twitter' in it.
In python (which works absolutely fine):
driver.get('https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96')
links = driver.find_elements_by_xpath("//a[@href]")
for link in links:
if 'twitter' in link.get_attribute("href"):
print(link.get_attribute("href")
Result:
http://twitter.com/AlboMP http://twitter.com/SharonBirdMP
http://twitter.com/Bowenchris http://twitter.com/tony_burke
http://twitter.com/lindaburneymp http://twitter.com/Mark_Butler_MP
https://twitter.com/terrimbutler http://twitter.com/AnthonyByrne_MP
https://twitter.com/JEChalmers http://twitter.com/NickChampionMP
https://twitter.com/LMChesters http://twitter.com/JasonClareMP
https://twitter.com/SharonClaydon
https://www.twitter.com/LibbyCokerMP
https://twitter.com/JulieCollinsMP http://twitter.com/fitzhunter
http://twitter.com/stevegeorganas https://twitter.com/andrewjgiles
https://twitter.com/lukejgosling https://www.twitter.com/JulianHillMP http://twitter.com/stephenjonesalp https://twitter.com/gedkearney
https://twitter.com/MikeKellyofEM http://twitter.com/mattkeogh
http://twitter.com/PeterKhalilMP http://twitter.com/CatherineKingMP
https://twitter.com/MadeleineMHKing https://twitter.com/ALEIGHMP
https://twitter.com/RichardMarlesMP
https://twitter.com/brianmitchellmp
http://twitter.com/#!/RobMitchellMP
http://twitter.com/ShayneNeumannMP https://twitter.com/ClareONeilMP
http://twitter.com/JulieOwensMP
http://www.twitter.com/GrahamPerrettMP
http://twitter.com/tanya_plibersek http://twitter.com/AmandaRishworth http://twitter.com/MRowlandMP https://twitter.com/JoanneRyanLalor
http://twitter.com/billshortenmp http://www.twitter.com/annewerriwa
http://www.twitter.com/stemplemanmp
https://twitter.com/MThistlethwaite
http://twitter.com/MariaVamvakinou https://twitter.com/TimWattsMP
https://twitter.com/joshwilsonmp
In R: (This may be wrong but you can get an idea)
library(XML)
library(RCurl)
library(RSelenium)
url <- "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96"
doc <- getURL(url)
parser <- htmlParse(doc)
links <- xpathSApply(parser, "//a[@href]", xmlGetAttr, "href")
for(link in links){
if(grepl("twitter", link)){
print(link)
}
}
I don't even know if this code will work. But the idea is to get all the URLs in a page, iterate over it and check if the word twitter is in it. My answer is based on this
Answered By - Prasanth Ganesan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.