How to Scrape Amazon for your Data Science Project

If you are interested in anything that is sold nowadays, there is no way around Amazon. For your data science project that requires product data, you may wonder how to access their product data programmatically. Put simply, you have two different options: Speak to the Amazon product API or scrape the website directly.

Average review, date of publication: The results on Amazon include a lot of interesting meta data.

Why the Amazon API may not be the right tool for you

If you can get access to the Amazon product API – great, use it! However, this isn’t as straight forward and reliable as one may think. You need an active partner account and people actually need to purchase things through your links so that your API key actually keeps working.

Of course, it makes sense. Amazon does not need to feed hungry data scientists through an open API. This interface is designed to drive retail business, so it’s supposed to be used by eCommerce sites and the likes. From what I can tell, this may not have been much of an issue in the past, but when I tried my old API key from when I still had referral links online – it wasn’t working anymore:

{"__type":"com.amazon.paapi5#TooManyRequestsException","Errors":[{"Code":"TooManyRequests","Message":"The request was denied due to request throttling. Please verify the number of requests made per second to the Amazon Product Advertising API."}]}

Fair enough, let’s try the more exciting route and scrape the website instead.

Why scraping from the terminal may not work for you

If you’re coming from any kind of data science background, your tool of choice is probably Python, so you fire up a notebook and grab a current copy from an Amazon product search. But behold, what’s this? Ah, we look like a bot, fair enough.

Unless you want to teach your scraper how to fill in captchas, a headless scraper may not work on many modern websites.

We may be able to get away with setting the user agent string and faking a user session somehow. But why not pause on Python and delve into JavaScript land again?

How to scrape right from your browser

When looking for a simple web scrapter, I found artoo.js: An older but cute little JavaScript project to scrape right from your browser.

And yes, it works! artoo.js works through a simple bookmarklet, and then with your own scripts right inside the browser console. The scraped results are downloaded as CSV or JSON.

When writing a scraper, I can recommend writing some debug CSS to highlight the data you want to select.

I spend some time this weekend to create a scraping script to fetch a number of parameters for all items in an Amazon book search. A result then looks like this:

{
  "title": "Unfinished Tales",
  "url": "/Unfinished-Tales-J-R-Tolkien/dp/B00A2M4VZG/ref=sr_1_90?dchild=1&keywords=tolkien&qid=1589053005&s=books&sr=1-90",
  "img": "https://m.media-amazon.com/images/I/41YaQLNWFWL._AC_UY218_.jpg",
  "author": "J. R. R. Tolkien",
  "published": "Jan 1, 1979,
  "rating": 5,
  "reviews": 1,
  "price_1": 21.74,
  "price_2": null,
  "price_3": null
}

Want to scrape books on Amazon? Find the code on Github: https://github.com/florianletsch/scrape-amazon-books

Attending a Remote Hackathon

2020 is the year of doing things remotely. It was therefore my home home office and a healthy internet connection that provided the space to participate in the AI for Good hackathon last weekend, organized by Deep Berlin. The task description was broad, but it pointed the teams to work on something related to climate change, specifically the occurrence of wildfires.

As a team of four, we spend the weekend looking at the relation of human activity and wildfires. We focused on data about touristic activity in Northern Spain, an area that has seen intense wildfire seasons in the past.

Final presentation video

(Excuse the nervous beginning, anyone who has attended a hackathon before will be familiar with the last minute push, in this case submitting a final video to the hackathon organizers on time.)

Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.

YouTube privacy policy

If you accept this notice, your choice will be saved and the page will refresh.

Some notes

A few thoughts on what we did and what I took away from the weekeend.

Pandas and scikit-learn

In my day job, I mostly work with Python, and am familiar with deep learning libraries like PyTorch and Tensorflow/Keras. The hackathon was a welcome opportunity to do some hands-on Data Science work again, and I enjoyed using Pandas and scikit-learn for quick data analysis and plotting. What a nice ecosystem.

Free location data

Open street map is an amazing community project providing labeled location data from all around the globe. Open Street Map location data is provided in the osm format. To read these files in Python, we used the osmium package. Reading the file and filtering the nodes for our usecase was straight forward, but loading from that format can take surprisingly long.

Free geo data

Once you start looking, you discover some interesting datasets out there which are freely available. We used the MOD14A1 dataset, which provides satellite data of very recent recordings (up to a few days from today), with access to multiple levels of abstraction in the data format.

Pretty maps in folium

Our team member Markus spend some time creating pleasing visualisations of maps in folium.

A map of north-western Spain with two data distributions shown. The yellow/red clusters show the locations of wildfires in the past 10 years. The blue outlined areas show locations of high touristic activity. Clearly visible is the Camino de Santiago which extends all the way through Galicia. Map rendered using folium.

Code and resources

Our results were interesting, but anecdotal. You can find the collective resources from our team on Github. https://github.com/florianletsch/fire-tourism

How does a remote hackathon feel?

What I valued during that weekend was being in my default work environment. Our team quickly developed a working rhythm where we would have a video call for 30 minutes and then disconnect and spend some focused 2-3 hours by our own. I’ve never experienced such a focused working environment at an on-site hackathon.

Obviously, the main shortcoming of being remote was not having the chance to talk to people outside your team, or just bump into someone. Also, there was no way of passively observing what everyone is up to. From what gathered on Slack, many teams actually didn’t constitute properly, and then some lost participants tried to get into other teams, but it wasn’t as easy for them, as it might have been in person.

Would I join a remote hackathon again? Yes, to really get something done on 2 days. To actively socialize, it isn’t the right thing for me, though.