mid-semester-project-update.txt

Quinn Thompson 
IS310
Mid-Semester Project Data Update

For this project, I want to explore online shopping. In order to do this, I plan on scraping data from the Amazon website. Amazon keeps a list of its top 100 best selling items for each of about 41 different categories, which I will scrape in order to create my data set. The categories appear to sometimes have subcategories, which then have subcategories of their own, but for the sake of creating a managable project, I plan on only scraping the topmost of these categories, at least at first. Because this is a constantly changing list, It is important that I scrape all of the data at about the same time. This also means that the items on the list will provide insights into what is important to customers in the moment that I create my dataset.
Because Amazon is a somewhat complex website, I plan to use Selenium, a Python module I am already somewhat familiar with, that will make things much simpler. The data collection process should include 2 different steps. One script will collect links to the store pages of each item in the top 100 as well as its position in that list and what category Amazon has placed it under. I plan to have this part of my project done by the midterm deadline. A second script, which I can work on once this first data is collected, will collect more data by visiting each item's store page. This will require more detailed planning, but some information I plan on collecting includes item prices as well as any sale price, item names, ratings and numner of ratings, brand name, and an image of the product. 


There is a lot of very obvious potential for bias in this data set. Even before collecting any data, I can see that Amazon's categories are extremely arbitrary. At the top of the list, there are a few categories such as "Amazon Devices & Accessories," "Amazon Renewed," and "Audible Books & Originals," which are very clearly Amazon branded. How Amazon collects its own internal data on what items are best sellers is a complete mystery, and should be treated with some scepticism. At the top of many categories, Amazon Basics items can be found. As a result, I think that this data will only be somewhat useful as an insight into what Amazon customers value, but more so as an insight into what Amazon pushes.

The first part of the project, and the one that I completed for this milestone, is the "collectbestproducts.py" script and the dataset it generates. There were a few places where this was difficult to do, specifically with the "Digital music," "Digital educational resources" and "Enertainment collectibles" categories. Digital music linked to the Amazon music website, which I decided was too different of a product to include in the data set. Digital educational resources had no items in its list, which I had to alter my script to account for, and entertainmen collectibles had a few items in its list which had been removed due to being discontinued, which I again had to alter my script to account for.
The second part of the project is the "infofromproductpage.py" script, which will extract more detailed product data from each store page. I may re-run my first script to get a more up to date data set once I wrie this script, but I am not sure yet. I am also not yet sure how I want to analyze this data, but I have many ideas.