Scrape Performance Tweaks
The scrape until this point had been designed to use a brute-force technique to download every possible thread, existing or not, from the Megatokyo Forums . By exploiting the idea that old threads, upon conversion from Infocap’s ubb.Threads to Invision Power Board, are numbered the same as the first post within them, lets us take a little shortcut.
If we have downloaded a thread of post numbers, and added them to the local list, then we can state that these post numbers are not valid topic numbers, because they are either contained in a different numbered thread, or are a thread already downloaded. Thus, we can start downloading threads numbering at the first number greater than the last thread downloaded, which is not found in the local list.
There is still a chance that that particular thread will not exist, but at least we have reduced the quantity of thread downloads by (PostCount-ThreadCount).
This is a huge performance improvement, and will let us hit our target of complete conversion much sooner.
About this entry
You’re currently reading “Scrape Performance Tweaks”, an entry on VerseLogic
- Published:
- 10.20.04 / 8pm
- Category:
- Uncategorized
4 Comments
Jump to comment form | comments rss [?] | trackback uri [?]