Scrape Performance Tweaks

The scrape until this point had been designed to use a brute-force technique to download every possible thread, existing or not, from the Megatokyo Forums . By exploiting the idea that old threads, upon conversion from Infocap’s ubb.Threads to Invision Power Board, are numbered the same as the first post within them, lets us take a little shortcut.

If we have downloaded a thread of post numbers, and added them to the local list, then we can state that these post numbers are not valid topic numbers, because they are either contained in a different numbered thread, or are a thread already downloaded. Thus, we can start downloading threads numbering at the first number greater than the last thread downloaded, which is not found in the local list.

There is still a chance that that particular thread will not exist, but at least we have reduced the quantity of thread downloads by (PostCount-ThreadCount).

This is a huge performance improvement, and will let us hit our target of complete conversion much sooner.

3 Comments

  1. Posted 2004-10-21 at | Permalink

    Thus, we can start downloading threads numbering at the first number greater than the last thread downloaded, which is not found in the local list.

    Are the postnumbers within threads really all consecutive? That would be impressive, considering that they weren’t under UBB. Do you scan for breaks in consecutive numbering in the local list before setting the the highest postnumber therein as the next potential thread number to retrieve?

    Or am I mis-parsing how you described your shortcut?

  2. Posted 2004-10-21 at | Permalink

    Looking for breaks in consecutive numbering I deemed complex a routine, so I opted for something a little simpl’r, if a bit more processor-hungry.

    Before downloading possible-thread-X, I look through the list of posts I’ve already downloaded. If there is a post with ID number X there already, then it can’t be an undownloaded, actually existing thread. Thus, I skip it, and incremenent X until this test fails.

    This doesn’t indicate an actual thread number, as there could be threads that were there, but have been deleted, but it reduces the number of unnessessary downloads by not pulling down pages that are already known to contain no threads.

  3. ooklah
    Posted 2004-11-03 at | Permalink

    Your downloading all of the mt forums…. what possible use does this have and why?

One Trackback

  1. [...] nder: coding cwdb — codepoetica @ 10:28h We have attained completion. Our goal of downloading the first 1.66 Million (possibly existing) forum [...]

Post a Comment

Your email is never shared. Required fields are marked *

*
*