fortune or its reverse the seldom-updated lackaff.net blog

9Jun/060

Wikilocalpedia

For one of my research projects this summer, I'm looking at Wikipedia content. Rather than tax WP's servers with thousands of queries, I thought it might be useful to run a local WP mirror. Because I am interested in the functioning of the Mediawiki software as well as WP content, I wanted a local Mediawiki install, and not just a database mirror. Getting this to work has been a challenge, so I thought I'd note a few things I've learned here for future reference.

The creation of a local WP requires at least four types of fiddling:

  1. a working install of MediaWiki on a PHP-sporting HTTP server,
  2. a properly configured MySQL database,
  3. a dump of relevant content from the Wikipedia
  4. a script of some sort to import the dump into the database

Relevant notes for each section follow. My sadly underpowered desktop box (Athlon 64 3000, 512MB RAM) is running Ubuntu Linux 6.06.

1. HTTP server and MediaWiki
Used a default Apache install with PHP and MySQL connectors through Synaptic. MediaWiki 1.4 is also available in the repositories, but those clever MediaWiki hackers have already released versions 1.5 and 1.6. Wanting the latest and greatest, I installed 1.6 manually (very easy, using MW's slick mostly-browser-based installer. Later, as I had trouble using various import scripts, I installed version 1.5. In hindsight, I probably should have just used the package install.

2. MySQL
Installed MySQL through Synaptic. I ended up changing some of the default settings to speed the import process. Basically, I made my /etc/mysql/my.cnf file match the settings in "my-large.cnf" example configuration file. Also, I disabled the log-bin option. Note that you will want to save your original my.cnf and change most of these options back after you complete the import, as these changes basicaly allow MySQL to use as much of your system's resources as it wants. Also doublecheck the MySQL data directory -- even a basic WP mirror will eat up a lot of gigs. I ended up sacrificing a 20GB hard drive to the ravenous Wikipedia database.

3. Wikipedia dump
Dumps of WikiMedia Foundation project content are available at
http://download.wikimedia.org/enwiki. I just wanted the most recent revisions of English Wikipedia articles, so I snagged http://download.wikimedia.org/enwiki/20060607/enwiki-20060607-pages-articles.xml.bz2. Save your file somewhere where you won't forget it.

4. Import the dump using a script
The big database dumps are provided in XML format, which requires some massaging if you want to cram it back into a SQL database. A PHP script is provided with MediaWiki 1.5 and higher, in /maintenance/importDump.php. This script works as advertised, but is very slow (as few as 8 pages inserted per second -- with over 3 million pages in my smallish dump, this would be a long, lonely road).

An alternative is MWDumper, a Java program that imports pages much faster. You'll need to install Java, of course, I've the Blackdown JRE package installed, seems to work. Using MWDumper, I am currently getting about 100 pages imported per second, can am watching red links in local WP articles slowly turn blue.

Woot. TFSMIF.

Filed under: Uncategorized Leave a comment
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

(required)

No trackbacks yet.