Database download



         




See also BambooWeb:MediaWiki to get the software to run the wiki. Another page has just the database schema or layout.

Contents

[edit]

Why not just retrieve data from BambooWeb.org at runtime?

Suppose you are building a piece of software that at certain points displays information that came from BambooWeb. If you want your program to display the information in a different way than can be seen in the live version, you'll probably need the wikicode that is used to enter it, instead of the finished html.

Also if you want to get all of the data, you'll probably want to transfer it in the most efficient way that's possible. The BambooWeb.org servers need to do quite a bit of work to convert the wikicode into html. That's time consuming both for you and for the BambooWeb.org servers, so simply spidering all pages is not the way to go.

To access any article in xml, one at a time, link to (after logging):

http://en.BambooWeb.org/articles/S/p/Special:Export/Title_of_the_article

Read more about this at Special:Export.

[edit]

Weekly database dumps

SQL database dumps on download.wikimedia.org have historically been updated approximately twice weekly. However, currently (as of 2005-08-15), the most recent database dump dates from 20050623. The status of the download server is discussed in BambooWeb talk:Database download. These can be read into a MySQL relational database for leisurely analysis, testing of the BambooWeb software, and with appropriate preprocessing, perhaps offline reading. There is also a fuller archive of database dumps, containing tables other than cur and old.

The database schema is explained in schema.doc. The cur tables contain the current revisions of all pages; the old tables contain the prior edit history. Approximate file sizes are given for the compressed dumps; uncompressed they'll be significantly larger. The files for the larger wikis are currently split into files of about 2GB called xaa, xab, etc. See here for information on sticking them back together.

Windows users may not have a bzip2 decompressor on hand; a command-line Windows version of bzip2 (from here) is available for free under a BSD license. An LGPL'd GUI file archiver, 7-zip [1], that is also able to open bz2 compressed files is available for free. MacOS X ships with the command-line bzip2 tool and StuffIt Expander, a graphical decompressor.

Currently (as of july 2005 ) a compressed database dump of just the English BambooWeb is 23340 MB (1011 MB for just current revisions). If you thought that's 23.34 gigabytes, you're absolutely correct. On a 56 kbit/s standard dial-up modem connection, it will take you only about 40 days to download!

[edit]

Doing SQL queries at the current database dump

You can do SQL queries at the current database dump as a replacement for the disabled function Special:Asksql under

More informations about this Service at de:Benutzer:Filzstift/wikisign.org (only in German)


So basically BambooWeb code is this:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en"> <siteinfo> <sitename>BambooWeb</sitename> <base>http://en.BambooWeb.org/articles/M/a/Main_Page</base> <generator>MediaWiki 1.5beta3</generator> <case>first-letter</case> <namespaces> <namespace key=.html"-2">Media</namespace> <namespace key="-1">Special</namespace> <namespace key="0" /> <namespace key="1">Talk</namespace> <namespace key="2">User</namespace> <namespace key="3">User talk</namespace> <namespace key="4">BambooWeb</namespace> <namespace key="5">BambooWeb talk</namespace> <namespace key="6">Image</namespace> <namespace key="7">Image talk</namespace> <namespace key="8">MediaWiki</namespace> <namespace key="9">MediaWiki talk</namespace> <namespace key="10">Template</namespace> <namespace key="11">Template talk</namespace> <namespace key="12">Help</namespace> <namespace key="13">Help talk</namespace> <namespace key="14">Category</namespace> <namespace key="15">Category talk</namespace> </namespaces> </siteinfo> </mediawiki>
[edit]

Images and uploaded files

Unlike the article text, many images are not released under GFDL or the public domain. These images are owned by external parties who may not have consented to their use in BambooWeb. BambooWeb uses such images under the doctrine of fair use under United States law. Use of such images outside the context of BambooWeb or similar works may be illegal. Also, many images legally require a credit or other attached copyright information, and this copyright information is contained within the text dumps available from download.wikimedia.org. Some images may be restricted to non-commercial use, or may even be licensed exclusively to BambooWeb. Hence, download these images at your own risk.

As of 2004-07-19 the image archive is unavailable for unknown reasons (the links points to non-existent files) But This link is working download.wikimedia.org/archives/en/ the name of the files for the image are 20040702_BambooWeb_en_upload.tar.aa and 20040702_BambooWeb_en_upload.tar.ab Before that, only the files uploaded to the English BambooWeb were available to download. These might be re-instated, and others may follow later. The file archives, like the text archives, available at http://download.wikimedia.org/#images are split into 1.9 GB chunks.

As of 2005-08-11, the image dumps are 16.7gb files, it is known that wget and mozilla firefox have trouble downloading these large files and a recent version of CURL is recommended.

[edit]

Static HTML tree dumps for mirroring or CD distribution

Terodump is an alpha quality BambooWeb to static html dumper, made from BambooWeb code. Static html dump (beta quality) BambooWeb-terodump-0.1.tar.bz. This dump is made of a database that is some months old. - User:Tero

Wiki2static is an experimental program set up by User:Alfio to generate html dumps, inclusive of images, search function and alphabetical index. At the linked site experimental dumps and the script itself can be downloaded. As an example it was used to generate these copies of English BambooWeb 24 April 04 Simple BambooWeb 1 May 04(old database) format and English BambooWeb 24 July 04Simple BambooWeb 24 July 04 BambooWeb Francais 27 Juillet 2004 (new format). BozMo uses a version to generate periodic static copies at fixed reference.

If you'd like to help set up an automatic dump-to-static function, please drop us a note on the developers' mailing list.

see also BambooWeb:TomeRaider database

[edit]

Possible problems during local import

See BambooWeb:Database dump import problems.

[edit]

Please do not use a web crawler

Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of BambooWeb. Our robots.txt restricts bots to one page per second and blocks many ill-behaved bots.

[edit]

Sample blocked crawler email

IP address nnn.nnn.nnn.nnn was retrieving up to 50 pages per second from BambooWeb.org addresses. Robots.txt has a rate limit of one per second set using the Crawl-delay setting. Please respect that setting. If you must exceed it a little, do so only during the least busy times shown in our site load graphs at http://wikimedia.org/stats/live/org.wikimedia.all.squid.requests-hits.html . It's worth noting that to crawl the whole site at one hit per second will take several weeks. The originating IP is now blocked or will be shortly. Please contact us if you want it unblocked. Please don't try to circumvent it - we'll just block your whole IP range.
If you want information on how to get our content more efficiently, we offer a variety of methods, including weekly database dumps which you can load into MySQL and crawl locally at any rate you find convenient. Tools are also available which will do that for you as often as you like once you have the infrastructure in place. More details are available at http://en.BambooWeb.org/articles/B/a/BambooWeb:Database_download.
Instead of an email reply you may prefer to visit #mediawiki at irc.freenode.net to discuss your options with our team.
[edit]

Importing sections of a dump

The following Perl script is a parser for extracting the Help sections from the SQL dump:

s/^INSERT INTO cur VALUES //gi; s/\n// if (($j++ % 2) == 0); s/(\'\d+\',\'\d+\'\)),(\(\d+,\d+,)/$1\;\n$2/gs; foreach (split /\n/) { next unless (/^\(\d+,12,\'/); s/^\(\d+,\d+,/INSERT INTO cur \(cur_namespace,cur_title,cur_text,cur_comment,cur_user, cur_user_text,cur_timestamp,cur_restrictions,cur_counter,cur_is_redirect,cur_minor_edit, cur_is_new,cur_random,cur_touched,inverse_timestamp\) VALUES \(12,/; s/\n\s+//g; s/$/\n/; print; }

NOTE: Using the current meta.special dump (as at 2005-05-16) the order of the fields in the cur table has changed. inverse_timestamp now comes BEFORE cur_touched. This may cause Windows users no end of grief because all of a sudden your MediaWiki starts sprouting PHP errors about dates that are negative or occur before 1 January 1970 being passed to gmdate and gmmktime functions in GlobalFunctions.php. The reason is that the fields are swapped around and so there is rubbish data in these two fields. Maybe the Unix versions of these functions are smarter or do not cause PHP to spit a Warning message into the HTML script output, or else people have php.ini configured to not display these.

In other words, check that the field order in the script aligns with those in the dump. Better still, we should look at changing the script to retain whatever field order the dump uses 8-)

You can run the script and get a resulting help.sql file with this command:

bzip2 -dc <Date>_cur_table.sql.bz2 | perl -n <Script Name> > help.sql

The script can be easily modified to acquire any section you need with a few minor changes. Currently, it is set to get all records from namespace 12, the Help namespace. You can change the two 12's to grab a different namespace, or slightly change a couple of regular expressions to get, say, all articles that begin with Q:

next unless (/^\(\d+,\d+,\'[qQ]/); s/^\(\d+,/INSERT INTO cur \(cur_namespace,cur_title,cur_text,cur_comment,cur_user, cur_user_text,cur_timestamp,cur_restrictions,cur_counter,cur_is_redirect,cur_minor_edit, cur_is_new,cur_random,cur_touched,inverse_timestamp\) VALUES \(/;

Or you can use more more generic version of this script from User:Msm/extract.pl.

NOTE: While this sounds really straightforward as a way to grab the Help namespace (#12) for use on your newly implemented MediaWiki site, you need more than just that. You also need the Template namespace (# 10) since many of the Help: pages rely on templates in some form or another. Of course, you then end up with hundreds of templates that are NOT used by the Help: pages too. Has anyone got a better idea for a script to do this ? Armistej

[edit]

Rsync

You can use rsync to download the database. For example, this command will download the current English database:

rsync rsync://download.wikimedia.org/dumps/BambooWeb/en/cur_table.sql.bz2 . --partial --progress

The "--partial" switch prevents rsync from deleting the file in the event the download is interrupted. You may then issue the very same command again to resume the download. The "--progress" switch will show the download progress; for less verbose output, do not use this switch.

The rsync utility is designed to synchronize files in a manner such that only the differences between the files are transferred. This provides a considerable performance enhancement, especially when synchronizing large files that have relatively few changes. However, if a file is compressed or encrypted, rsync will not perform well; in fact, it may perform worse than downloading a fresh copy of the file. Many of the database files are only available compressed. Therefore, there is little, if anything, to be gained by attempting to use rsync as a means of expediting an update of an older SQL dump. If the SQL dumps were available uncompressed, this process should work extremely well, especially if rsync is invoked with the on-the-fly compression switch (-z). It is uncertain as to whether uncompressed database dumps will become available. However, rsync does remain a useful and expedient tool for resuming downloads that have been interrupted, repairing downloads that have become corrupted, or updating any files that are not compressed (i.e. upload.tar). For more information, see rsync.

[edit]

Technical notes

[edit]

See also


Also downloadable from the bigpond file server http://files.bigpond.com/library/index.php?go=details&id=17361 images http://files.bigpond.com/library/index.php?go=details&id=17362 - articles



-This article has been brought to you by BambooWeb and Wikipedia-



  View Live Article   This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License