So recently, my friend needed a webscraper to be able to check text on the page and see if it had changed. This is pretty common problem that also has a lot of roadblocks. This is my journey of creating the world's worst scraper (but it works!)
Part 0: Specs
As this was a fun personal project, it wasn't formal but the idea was:
- Scrape webpage/API for info
- Send some kind of message in order to tell the user the info had changed.
Part 1: What approach?
When it comes to webscraping, I usually go with one of three things:
- A Python script using a combination of Requests and BeautifulSoup
- Web test runner such as Selenium or Playwright
- Sketchy Tampermonkey with Javascript
You can guess which we ended up using but I actually tried each one.
Part 2: What is actually available?
I spent some time looking through the network logs to see if there were any accessible REST endpoints we could just query for JSON data. There were some that were either encrypted or inaccessible from clients that weren't the website itself.
We ended up deciding we'd have to use webscraping. I was able to come up with a quick document.querySelectorAll(...).forEach hack to grab the data I wanted using the devtools console.
What does the querySelectorAll do? It allows for me to search through the page for an element with a custom attribute. I was able to find a custom attribute attached to a <tr> that contained all the data I needed. Though this was in Javascript, that selector would be useful for any other scraper/script I used.
Part 2: The approaches we ended up not using
So first I tried to use Requests and BeautifulSoup. Long story short, it became a pain in the butt because since it wasn't a JSON endpoint but an actual page that was rendered using Javascript/React, the data wasn't available on page load so we couldn't get the data this way.
So from there I tried to use Playwright since it actually uses a headless browser and user actions to run tests/scripts. Unfortunately, I was too lazy to go this direction since I anticipated test framework detection (as it's really popular) and already had the Javascript snippet so I ended up ignoring this route.
Part 3: Tampering with monkeys
I ended up using Tampermonkey to run the script on page load. To those unfamiliar with Tampermonkey, it's a browser extension where you can write custom Javascript code that runs on every page load. It's how people used to (and sometimes still) install custom themes/actions on sites like Reddit or Facebook without creating a whole extension for it.
Originally the code looked like this:
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
(async function() {
'use strict';
await sleep(5*1000);
document.querySelectorAll("tr[abc]").forEach(async (row) => {
if(row.textContent.contains("what we're looking for")) {
console.log("found!");
} else {
console.log("not found yet");
}
});
setTimeout(function(){ location.reload(); }, 30*1000);
})();Basically, we would sleep for 5s to wait for the page to load, iterate over all the <tr>s that fit our criteria, find out if we had what we were looking for, then refresh after another 30s to avoid 429 Too many requests errors.
I didn't want to mess with trying to send a message to the user because that would require 3rd party APIs which would be pretty annoying to try to use within a simple Tampermonkey script. I used a "beeping" sound that I had found on StackOverflow that would go off every time the textContent matched our criteria. It was actually pretty cool because you were able to alert a user audibly without any extra resources. It would just take the actual wav content that we provided as a string and play it.
But... my friend was adamant that he needed a way that could notify him on his phone so that he'd know while being away from his computer.
Part 4: Tele...phone? Tele...gram?
My friend had a couple options for how he wanted to receive the message:
- Text
- Some messaging platform
For me, email felt annoying since the options were to use an existing email to email things which usually require the use of a 3rd party library or have your own SMTP/script running on a server to then contact another API which I didn't want to write.
Text was under a similar boat. I've sent texts using email before (I don't know if it still works but this is how I used to) but that would require emailing. Another alternative to send a text was creating a Twilio account or something which seemed annoying too.
So I landed on messaging platforms. Again, I didn't want to have to import a library because doing that in Scriptmonkey seemed like a giant pain.
I looked around online and realized that Telegram has its own REST API that you just needed your user token/chat id to send a request to (more information here).
Luckily my friend already had a Telegram account so I had him register a bot to get an API token and chat ID. From there it was child's play.
To actually send the request, I used the fetch API. Here's the gist of the code:
// we're in the for loop
if(row.textContent.contains("what we're looking for")) {
const regex = /(custom.regex)/;
const match = row.textContent.match(regex);
const data = {
"chat_id": "1234567890",
"text": match[0] + " " + match[1]
};
await fetch("https://api.telegram.org/bot<BOT_TOKEN>/sendMessage", {
headers: {
"content-type": "application/json"
},
method: 'POST',
body: JSON.stringify(data)
});
} else {
console.log("not found yet");
}Pretty simple right? All I needed to do was POST an endpoint with the token in the URI and send a stringified object with the chat_id and my parsed message! Super easy and super fast.
Part 5: Did it work?
Yeah it worked.
We're still getting rate-limited though so we might have to tweak the sleep time.
Could we have made it smarter with backoffs or used an actual API? Sure. But this was fast, easy, and hacky. Which is my favorite part of software engineering.
Hopefully this is helpful for someone who just needs a quick bot to help with notifying you of website changes!
Good luck hacking!