Issue
This is my code
from bs4 import BeautifulSoup
import requests, lxml
import re
from urllib.parse import urljoin
from googlesearch import search
import pandas as pd
query = 'A M C College of Engineering, Bangalore'
link = []
for i in search(query, tld='co.in', start=0, stop=1):
print(i)
soup = BeautifulSoup(requests.get(i).text, 'lxml')
for link in soup.select("a[href$='.pdf']"):
if re.search(r'nirf', str(link), flags=re.IGNORECASE):
fUrl = urljoin(i, link['href'])
print(fUrl)
link.append(fUrl)
print(link)
df = pd.DataFrame(link, columns=['PDF LINKS'])
print(df)
Here is my output after running the code:
https://www.amcgroup.edu.in/AMCEC/index.php
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFMBA.pdf
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2019.pdf
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf
# Printing list with links but getting tags
<a href="image/gallery/Swami Vivekananda.pdf" target="_black">For Invitation Click here...</a>
# Dataframe where I want to store list
PDF LINKS
0 For Invitation Click here...
I should get the list of links which is shown in the output but when printing the list it gives me the whole tag instead of link. Also I want to push the all the links that I got into a single row of dataframe like this:
PDF LINKS
0 link1 link2 link3 #for query1
1 link1 link2 #for another query
How can I achieve this. And what is the problem with my code why I am getting tag instead of list. Thanks in advance.
Solution
Use different variable name for the list and for the tag in for-loop:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin
query = "A M C College of Engineering, Bangalore"
all_data = []
for i in ["https://www.amcgroup.edu.in/AMCEC/index.php"]:
soup = BeautifulSoup(requests.get(i).text, "lxml")
for link in soup.select("a[href$='.pdf']"): # <-- `link` is different than `all_data` here!
if re.search(r"nirf", link["href"], flags=re.IGNORECASE):
fUrl = urljoin(i, link["href"])
all_data.append(fUrl)
df = pd.DataFrame(all_data, columns=["PDF LINKS"])
print(df)
Prints:
PDF LINKS
0 https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf
1 https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFMBA.pdf
2 https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2019.pdf
3 https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf
EDIT: To have results in one row:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin
query = "A M C College of Engineering, Bangalore"
all_data = []
for i in ["https://www.amcgroup.edu.in/AMCEC/index.php"]:
soup = BeautifulSoup(requests.get(i).text, "lxml")
row = []
for link in soup.select(
"a[href$='.pdf']"
): # <-- `link` is different than `all_data` here!
if re.search(r"nirf", link["href"], flags=re.IGNORECASE):
fUrl = urljoin(i, link["href"])
row.append(fUrl)
if row:
all_data.append(row)
df = pd.DataFrame({"PDF LINKS": all_data})
print(df)
Prints:
PDF LINKS
0 [https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFMBA.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2019.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf]
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.