Skip to content

JoGorska/trustpilot-reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 

Repository files navigation

Trustpilot reviews

The quest

  1. Scrape company details from https://www.cylex-uk.co.uk/

  2. Find those companies on https://uk.trustpilot.com/

  3. Grab the review summary and a handful of reviews

  4. Put them into a database AND an excel file.

  5. Send us the results & the code used to get them

  6. repository link here

  7. repository link here

Initial set up

Create virtual enviroment and install requirements.txt for the program to work correctly.

4. Trustpilot

What data I need:

  • company name - This can be taken from the search on Yell

  • review summary - number next to the stars - representing some combination of average reviews and other factors as per Trustpilot takes into consideration

<p class="typography_typography__QgicV typography_bodysmall__irytL typography_color-gray-7__9Ut3K typography_weight-regular__TWEnf typography_fontstyle-normal__kHyN3" data-rating-typography="true">3.6</p>
  • handful of reviews - each review is in the container:

    one review container: '

    '

    smaller review container for h1 and p only: '

    '

  • review header: anchor tag with review header text:

<a href="/reviews/5e88aa23086b6409542fb601" rel="nofollow" target="_self" class="link_internal__7XN06 link_wrapper__5ZJEx styles_linkwrapper__73Tdy">If you are in need of an hones tGarage, look no further.</a>

h2 tag containing anchor tag with the review header text:

<h2 class="typography_typography__QgicV typography_h4__E971J typography_color-black__5LYEn typography_weight-regular__TWEnf typography_fontstyle-normal__kHyN3 styles_reviewTitle__04VGJ" data-service-review-title-typography="true">
    <a href="/reviews/5e88aa23086b6409542fb601" rel="nofollow" target="_self" class="link_internal__7XN06 link_wrapper__5ZJEx styles_linkwrapper__73Tdy">If you are in need of an hones tGarage, look no further.
    </a>
</h2>
  • the content of the review - description / story
<p class="typography_typography__QgicV typography_body__9UBeQ typography_color-black__5LYEn typography_weight-regular__TWEnf typography_fontstyle-normal__kHyN3" data-service-review-text-typography="true">
    If you are in need of an honest, reliable and friendly garage, you found it.<br>Booked my car in for the second year for a major service and MOT this time.<br>I was not disappointed.<br>The service was as thorough as can be expected and the MOT flagged up an advisory regarding my tyres.<br>not a big surprise to me, since I did get a lot of use out of them. <br>So, next week the tyres are getting fitted and I get them to check my wheel alignment too, since the front n/s seems to be worn unevenly. the tyres are going to be an absolute bargain. Under £50 per tyre on an HYUNDAI I30.<br>And regarding the video footage.<br>Yes, true. this guys and girls will provide you with the evidence as well, since they are an honest bunch who will not tell you a lot of BS.<br>Will definitely use them again next time.
</p>
   

Locate companies with Trustpilot profiles

Companies can be located by using the search form on trustpilot. The user can search by company name or company website. This might require installing selenium or submitting form with scrapy.

Another way can be used with the help of the sitemaps found in robots.txt

  • use scrapy sitemap function to search through each page
  • get the whole list of domains inside trustpilot page, extract company domains and cross refference with the list of domains obtained from Yell, than only scrap trustpilot review pages that are the comapnies existing in Yell

The sample of trustpilot review pages, below:

<url><loc>https://uk.trustpilot.com/review/firstcharterbus.com/location/little-rock</loc></url>
<url><loc>https://uk.trustpilot.com/review/www.brewers.co.uk/location/farnborough</loc></url>
<url><loc>https://uk.trustpilot.com/review/www.fiveguys.co.uk/location/solihull</loc></url>
<url><loc>https://uk.trustpilot.com/review/eurocell.co.uk/location/hayes</loc></url>

I worry that trustpilot might have some kind of safety measures set up for spiders, since they disallow spiders on review pages so I want to avoid scraping each review page, I will only scrape the review pages that I need.

I have downloaded all 9 sitemaps containing domain lists.

Issues

  • reviews url is disallowed by robots.txt, to scrape it I needed to change settings

  • tried to change the 'USER_AGENT' - to the one found in chrome dev tools on trustpilot page, but this was rejected

  • changed settings 'ROBOTSTXT_OBEY = False' which has worked

  • when I created second item (reviews) in items.py - I had error generated by the first item. I am not sure how to change pipeline so it handles two different items from items.py It seems the only safest solution is to start another project aimed at trustpilot

  • this gives me possibility to get more companies out of yell, by creating more spiders for other counties or categories, they would all return the same item

  • after running forloop to compare results between trustpilot and yell it didn't return my example company

Central Autopoint in Corby: real webstie addres 'http://www.motcorby.co.uk/' which is correctly typed in yell not real website address 'www.centralautopoint.com' which is submitted in trustpilot profile.

Robots.txt

Robots.txt file can be found here

Sitemap

  • sitemap contains the list of all companies' reviews pages. Each page contain the full address - domain of the company

Special Thanks to:

About

Scrapy app to get trustpilot reviews

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages