Issue
Here is my code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome(service=Service(executable_path=ChromeDriverManager().install()))
driver.maximize_window()
driver.get('https://quotes.toscrape.com/')
df = pd.DataFrame(
{
'Quote': [''],
'Author': [''],
'Tags': [''],
}
)
quotes = driver.find_elements(By.CSS_SELECTOR, '.quote')
for quote in quotes:
text = quote.find_element(By.CSS_SELECTOR, '.text')
author = quote.find_element(By.CSS_SELECTOR, '.author')
tags = quote.find_elements(By.CSS_SELECTOR, '.tag')
for tag in tags:
quote_tag = tag
df = df.append(
{
'Quote': text.text,
'Author': author.text,
'Tags': quote_tag.text,
},
ignore_index = True
)
df.to_csv('C:/Users/Jay/Downloads/Python/!Learn/practice/scraping/selenium/quotes.csv', index=False)
I should be getting this result:
Quote | Author | Tags |
---|---|---|
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” | Albert Einstein | change deep-thoughts thinking world |
Instead I'm getting this:
Quote | Author | Tags |
---|---|---|
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” | Albert Einstein | world |
I'm getting just the last item in the Tags
column instead of all four items.
If I run:
quotes = driver.find_elements(By.CSS_SELECTOR, '.quote')
for quote in quotes:
tags = quote.find_elements(By.CSS_SELECTOR, '.tag')
for tag in tags:
quote_tag = tag
print(quote_tag.text)
I get:
change
deep-thoughts
thinking
world
etc
So that piece of code works.
Why isn't the Tags
column being populated appropriately?
Solution
For your loop, use this code:
quote_tags = []
for tag in tags:
quote_tags.append(tag.text)
df = df.append(
{
'Quote': text.text,
'Author': author.text,
'Tags': ' '.join(quote_tags),
},
ignore_index = True
)
If you notice, the only tag that's being added (world
) happens to be the very last tag...and that's not a coincidence. It's because you loop over the tags, and for each tag, you assign that tag to the quote_tag
variable, but you don't do anything with it, so the next loop iteration just overwrites the value set by the previous iteration. Finally, when the loop is over, quote_tag
has the value of the last tag.
Answered By - richardec
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.