Saturday, January 17, 2015

How I Track Website Changes

Several months ago I failed to act on a pop with New Century Group HK. It was a painful lost opportunity. While I am a long-term investor, I realize it sometimes helps to be prepared to act fast. On that day several things went wrong. One was I didn't get any notice of the early earnings announcement posted on the company website the previous Friday. If I had, I would have been more prepared for the next market open and at least wouldn't have been caught napping, literally. I only found out about the announcement after the pop when I went to the website because I knew something was up.

All US listed companies are required to promptly file any material investor announcements with the SEC. And websites can track the SEC edgar website and alert users of any changes. I use secfilings.com. But I have problem with my small foreign holdings or US companies not required to file with the SEC. All such companies that I own are responsible companies which report anything relevant to the public on the investor portion of their website. But I generally cannot get notifications from them. In the past, I just count on going regularly to the website to check, especially around the time of their quarterly announcements.

I've occasionally thought about tackling the problem of how to track website changes. I researched online and tried some of the nicer online tools, but they all charge for the service — for example, visualping.io. That is reasonable since the tool can only know of changes by brute force query of the site at periodic intervals.

But the New Century fiasco spurred me to action. Instead of paying, I decided — as with everything else I do related to investing — to go DIY. To do this method requires a computer that runs Linux or an unix-like system such MAC OS or Android. And if you must use Windows PC you can install cygwin on Windows. The computer must be connected to the internet, and preferably should be always on. I have a Linux home computer connected to the internet 24/7.

This method checks the websites using a simple Perl script. Perl is a basic command line tool offered in virtually all installations. But if Perl isn't installed, you can easily install it manually. The script, when run the first time, will create a base copy of each website that I want to track. Then every hour it checks those websites again and compares the current webpage with the base copy. If the script detects a meaningful difference, it will stop. The next time the I check on the script window I'll know what website changed. To resume, I can first command it to overwrite the old base webpage copy with the new changed webpage file.

The first part of the script is a list of website URLs. For each URL, I also give it a name and keywords to ignore. The ignored keywords prevent the script from excessively flagging minor changes such as the date or the current stock quote. See below.
$url[$i]{url} = "http://www.putprop.co.za/content/1997/1982/sens-announcements";
$url[$i]{name} = "putprop";
$url[$i]{exception} = "Parsing Time:";
The second part of the script iterates through all the URLs in the database. For each URL, it does downloads a copy of the webpage into temp.html. Next the script filters out any exceptions. Then it compares the temp.html with the previously stored base copy of the webpage. In the above example, the base copy is putprop.html.
$urls = $url[$i]{url} ;
$o = $url[$i]{name} ;
$o .= ".html";
$ret = system ("wget -O temp.html $urls "); }
$temp = $url[$i]{exception};
if ($temp ne "") {
$temp = "-v -E \'$temp\' " ;
$temp = " grep $temp temp.html \> xx ";
print ("exception: $temp\n");
system (" $temp ");
system (" mv xx temp.html");
}

print ("======================================\n");
if (compare("temp.html",$o)==0) {
print ("they are equal $o\n");
} else {
print ("they are NOT equal $o $urls\n");
exit(1);
}
print ("======================================\n");

As the code shows, if the two files match the scripts proceeds. But if they differ, the script aborts. Then I'll know next time I check the script that I should check out that website.

The final part of the script is a loop which wakes up once an hour and repeats the above process. I won't show that portion of the code, but below is a snapshot of how the program looks on my linux-box. Note that it last woke up at 9:24 AM and it has run for 16 iterations without finding any differences. The last URL it looked at belonged to Combined Motor Holdings, a South African company.



If you'd like a copy of the script, please make a request in the comment section.

9 comments:

  1. I would appreciate a copy of the script. My email address is charleshang228@yahoo.com.

    Thank you for the idea! I've never thought of using a Perl script in this context before; this is a much appreciated concept.

    ReplyDelete
  2. I'd appreciate a copy of the script as well.

    ReplyDelete
  3. I tried to do a script like this in Python several month ago but i didnt succesed.
    I would greatly appreciate receiving a copy of your script
    my email: amiryiz@gmail.com
    thanx

    ReplyDelete
  4. Sorry if this is a double post, I lost the original comment..........somewhere.

    I use page2rss.com
    It notes changes and sends them to your RSS reader as a link to that page. I used it for otcmarket.com for companies filing their financials.

    I don't think you can set the intervals like you have with each hour. I remember getting 2 or 3 a day.

    I'm new to Linux so I may give your method a try. Thanks for sharing. :)

    ReplyDelete
    Replies
    1. I just saw they now have a Google Chrome add on now. Apparently the page2rss logo is shown in your address bar when there have been changes detected on the pages you're subscribed to.

      Pretty neat.

      Delete
    2. Hi Augustabound,

      You can use otcmarket.com because the site supports RSS, but my problem is tracking websites that don't have RSS support. This is my issue, unless I am misunderstanding RSS or what you are saying.

      Delete
    3. Hm, not too sure. From their page, "Page2RSS is a service that helps you monitor web sites that do not publish feeds. It will check any web page for updates and deliver them to your favorite RSS reader."

      I had assumed it just monitored the page for changes and when a change happened it sent a link to your RSS feed.

      Delete
  5. I would like a copy of this script, john@griswell.com

    ReplyDelete