Friday, September 14, 2012

2012 Australian Paralympic Team Book

John Vandenberg, President of Wikimedia Australia, has created a book collecting up all the materials on the "2012 Australian Paralympic Team". This is available free online as a PDF Download, or an ODT Word-processor File (a Printed Book can also be purchased). Wikimedia Australia's volunteers took on the task of documenting the 2012 Australian Paralympic Team and this book is one of the results of the effort of many people (who are listed in the book).

This Wikipedia book (or Wikibook) is created from a collection of Wikipedia articles and each time someone requests a book it is generated afresh from the latest version of the Wikipedia.

One problem is that Wikibooks can be large. The Paralympic book is about 90 Mb. The HTML format used for the web is usually very efficient in the use of file space and so I examined the book file to see what the problem was.

I downloaded the ODT version of the book and unzipped it to examine the source files. The images take up about 90% of the space. There are then hundreds of small images, which are flags of countries and clip art of medals. There are a few milt-megabyte images, which are photographs of the athletes.

There appears to be a problem with the book generating tool, resulting in duplicated images. As an example, each time the image of a medal is in a Wikipedia entry, a new copy of the same medal image is included in the book. But only one copy of each image should be needed.

Also the photographs of athletes are included in the book at high resolution. This is useful for high quality printing, but wastes space for on screen viewing. An option for using smaller versions of the images would be useful.

Perhaps Wiki-HQ could make some changes to the book generating software. Removing the duplicated images would seed up the book creation process on the server and download time, as well as reduce the file size for the user. As an example, image optimization should reduce the Paralympic book from 90 Mbytes to less than 15 Mbytes.

1 comment:

Volker Haas said...

Hi Tom,
I am one of the developers of the book tool and I investigated the problem you described.

You are absolutely correct that images are responsible for at least 90% of the file size. But images are not duplicated: I also unzipped the odt file and discovered lots of files with the extension "jpe" - but these only seem to be links to the "real" image files (and thus taking up no or minimal space).

The book you mention is indeed pretty big - I am sorry to say that we currently do not plan to offer any options to customize the output (like using smaller images). The reasoning for using large images is - as you suspected - the use in the printed books we are producing.

Best Regards,
Volker Haas