Issue
So I'm scraping TripAdvisor to get some information and here's one of the lists that I have:
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel. Great Staff. Wonderful walking tour with David.</span></span></a></div>
I Basically want to get rid of everything but the links (e.g /ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html)
What's the easiest way to do this in Python?
Here's a screenshot of one of the review page's code if it helps: Trip advisor code screenshot
Solution
Perfect job for BeautifulSoup:
import re
from bs4 import BeautifulSoup
html = """
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div class="fpMxB MC _S b S6 H5 _a" data-test-target="review-title" dir="ltr"><a class="fCitC" dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel. Great Staff. Wonderful walking tour with David.</span></span></a></div>
"""
soup = BeautifulSoup(html)
links = []
pattern = re.compile(".*ShowUserReviews-.*-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html")
for a in soup.find_all("a"):
href = a.get("href", "")
if pattern.match(href):
links.append(href)
Answered By - Code Different
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.