I've spent ridiculous amounts of time thinking about scraping, first as one of the founders of Two Tap, a sort of Plaid approach but for eCommerce, and then for a year at Honey (Honey acquired Two Tap).
The standard approach in scraping is building puppeteer or selenium scripts that gather information or perform specific actions based on selectors. Imagine that you want to scrape the titles on reddit.com. First, you have to figure out a path to those titles (for instance
'.posts .title'
) and then how to navigate the pagination (for example, clicking on '.pagination .next a'
).But things always break, and the most common situations are either the pages change, and selectors don't work anymore or the site themselves have odd problems— they have developers who can write buggy code too.
At Two Tap we had a fantastic team dedicated to creating new integrations, managing existing ones, and handling edge cases. The system was built around handling failures so that whenever something odd happened, someone could intervene and resolve the issue without the end customer noticing.
A rough estimate would be that the team spent 75% of their time with product pages, 15% avoiding anti-robot CDNs, and 5% on estimates, with checkout and everything else splitting the other 5%. We never thought of going deeper into automation because product pages had too many edge cases.
Extensions have a completely different problem. Since actions happen in a user's browser there's no way for humans to intervene and resolve failures. Things silently break all the time.
Earlier this year, I left Honey with the goal of traveling and clearing my head after everything that happened there. I wanted to take a break after seven years of nonstop work, but COVID-19 happened. Being locked up, I started thinking. A product page I don't believe can be automated, but could something be built for applying coupons without selectors? The answer is YES!
You can split applying coupons into three things:
1) figuring out if you are on a page with a promo field
2) applying the coupon
3) understanding if the coupon worked and its value
To find if we can apply a coupon, we look at all the visible text on the page and use cosine sentence similarity to match a bunch of keywords.
Here we find "Enter Promo Code." Then we check if there's an input and a button nearby. If these elements are super close we assume everything is all OK.
However, there are situations where this doesn't work, like below.
We can see that "Apply promotion code" is a link that makes the necessary fields visible once tapped.
Our approach makes the following assumption: if cosine similarity is high (meaning it's like the text implies coupon codes) and if it's something clickable, and if we have a functional pricing cluster nearby, then we decide we can apply coupons.
The next step involves just determining the total price. This is a bit tricky; we use OPTICS clustering and some rules to find the most likely candidate group for price information and try to guess the most likely total price.
We apply the coupon using the fields we previously discovered, fetch the total price again and diff the visible texts before and after applying them.
Determining if the coupon works pretty much means looking at the original total price vs. how it looks after each time we apply a coupon.
We can also diff what is visible before we apply the coupon with what is after, and most times, this returns the error message if the coupon failed. We could again use cosine similarity with some known failure messages to be super confident about failure.
In the same diff we can figure out if we need to perform specific actions like clicking an OK button or remove the coupon so the next one can be applied.
Here is the website for it where you can play around with a bookmarklet. You can find the code for all this on Github.
In my tests, it worked about 70–80% of the time. You can find the situations where it didn't work and why here. Some of those situations could be resolved with some per-site configurations options, such as accepting a bigger distance between the pricing cluster and the apply coupon link.
Mobile web is annoying for a lot of scraping companies since sites can be utterly different vs desktop. This approach mostly works on mobile without or with very few changes. Wouldn't it be amazing if there'd be a way to try out coupons when shopping in mobile Safari or Chrome? Someone could build an app that inserts the algorithm in a webview and lets shoppers browse sites.
Both OPTICS and cosine similarity algorithms are fast. Initially, I played with tensorflow.js and use, but it was killing my laptop. I don't think it makes sense at this stage.
It took me about a week and a half to write this and did it mostly as a pet peeve. I have no intention to do anything with this code myself, but if I did here is how I would approach it.
Step 1. Clean it up. Make most of the hardcoded values configurable. When something fails, it should report why it failed. Version it.
Step 2. Build a dashboard where humans can test this approach on each retailer site. No need for complicated selectors; they have to add products to the cart, go to the cart page, and run everything once to make sure things are OK. If all is good, they can whitelist (or enable) the site. If it's not OK to let them try out different configuration options based on where it failed.
For instance, if it doesn't work because price clusters are too far away from coupon link, you can give the person trying it out an option to increase that distance. Sure, it would be not very clear to have a random set of configurations options, but you can have a dashboard that asks questions if it breaks. Like: "It seems like these elements are too far apart. Try increasing the maximum distance but make sure it doesn't trigger the pop up on other pages."
Step 3. Bake the code inside the extension as Google might be upset about dynamic code (which would be ideal). Since you want to improve the code over time, make sure you keep track of which code version is linked to which extension release. For instance, code v0.1 might not work on footlocker, but code v0.2 does. Extension vE35.0 has code v0.2 so you should keep track of what version was tested there.
If you are doing this on mobile, you don't need to worry about anything. You can dynamically insert javascript and update the code as needed.
Step 4. Finding coupons is pretty easy, you can use FMTC. If you want more you can always subscribe to store newsletters and social media and track offers.
That’s about it.
I feel you. I wish I could have built something using deep learning and impress everybody.
The version presented above approach seems so pre 2010 imagenet. It reminds me of how translation was handled before deep learning.
Simultaneously, it feels like the problem area is significantly smaller, which means this approach could be the direction for a full-on solution.
Meanwhile, I heard Honey has built something similar using a contemporary machine learning approach, which I find pretty cool. It would be great if they published how they did it.
I hope you enjoyed this. The future or scraping has to be something better than selectors.