Thursday, October 28, 2021

[FIXED] Appending extracted links in list but the list give the whole tag instead of link while printing

October 28, 2021 pandas, python-3.x, web-scraping No comments

Issue

This is my code


from bs4 import BeautifulSoup
import requests, lxml
import re
from urllib.parse import urljoin
from googlesearch import search
import pandas as pd

query = 'A M C College of Engineering, Bangalore'
link = []

for i in search(query, tld='co.in', start=0, stop=1):
    print(i)
    soup = BeautifulSoup(requests.get(i).text, 'lxml')
    for link in soup.select("a[href$='.pdf']"):
        if re.search(r'nirf', str(link), flags=re.IGNORECASE):
            fUrl = urljoin(i, link['href'])
            print(fUrl)
            link.append(fUrl)
print(link)
df = pd.DataFrame(link, columns=['PDF LINKS'])
print(df)

Here is my output after running the code:


https://www.amcgroup.edu.in/AMCEC/index.php
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFMBA.pdf
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2019.pdf
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf
# Printing list with links but getting tags
<a href="image/gallery/Swami Vivekananda.pdf" target="_black">For Invitation Click here...</a>
# Dataframe where I want to store list
                      PDF LINKS
0  For Invitation Click here...

I should get the list of links which is shown in the output but when printing the list it gives me the whole tag instead of link. Also I want to push the all the links that I got into a single row of dataframe like this:

       PDF LINKS
0  link1 link2 link3    #for query1
1  link1 link2          #for another query

How can I achieve this. And what is the problem with my code why I am getting tag instead of list. Thanks in advance.

Solution

Use different variable name for the list and for the tag in for-loop:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin


query = "A M C College of Engineering, Bangalore"
all_data = []

for i in ["https://www.amcgroup.edu.in/AMCEC/index.php"]:
    soup = BeautifulSoup(requests.get(i).text, "lxml")
    for link in soup.select("a[href$='.pdf']"):  # <-- `link` is different than `all_data` here!
        if re.search(r"nirf", link["href"], flags=re.IGNORECASE):
            fUrl = urljoin(i, link["href"])
            all_data.append(fUrl)

df = pd.DataFrame(all_data, columns=["PDF LINKS"])
print(df)

Prints:

                                                        PDF LINKS
0   https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf
1    https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFMBA.pdf
2  https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2019.pdf
3  https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf

EDIT: To have results in one row:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin


query = "A M C College of Engineering, Bangalore"
all_data = []

for i in ["https://www.amcgroup.edu.in/AMCEC/index.php"]:
    soup = BeautifulSoup(requests.get(i).text, "lxml")
    row = []
    for link in soup.select(
        "a[href$='.pdf']"
    ):  # <-- `link` is different than `all_data` here!
        if re.search(r"nirf", link["href"], flags=re.IGNORECASE):
            fUrl = urljoin(i, link["href"])
            row.append(fUrl)
    if row:
        all_data.append(row)

df = pd.DataFrame({"PDF LINKS": all_data})
print(df)

Prints:

                                                                                                                                                                                                                                                       PDF LINKS
0  [https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFMBA.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2019.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf]

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 28, 2021

[FIXED] Appending extracted links in list but the list give the whole tag instead of link while printing

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels