Category Archives: data mining

pointing SPARQL at Christ Church

I’ve been playing around with SPARQL and trying it out on DBpedia, a database of information from Wikipedia. This has been an interesting exercise in data mining. It’s been an instructive exercise, not only for improving my expertise in this area, but also for showing its potential and limitations.

I am responsible for much of the content on the Wikipedia page for Christ Church Bath. I started by getting out the information about the church, and then asking questions about other churches with this dedication in Britain. In particular I decided to test the hypothesis that most such churches were founded in the 19th century or later. (Christ Church Bath was actually dedicated in 1798).

The difficulty is that while I pulled out many Christ Churches from the database, it was hard to isolate the British ones, and harder still to get a date of foundation/building/dedication. Those slashes indicate the reason; there are several possible dates which could be used to give the age of a church: foundatation, dedication or consecration. You can search on each possibility, but a further problem arises: the data has to be in the database in the first place.

The data in DBpedia comes mostly from the boxes on the right of Wikipedia pages, and SPARQL searches can only find what is in the database. So the success of a query depends on that information being in the box. If the dedication date (or date when the building was completed or the first service was held) is mentioned only in the text in the centre of the page, it will be missed. In practice although pages about churches usually have a box of information, it often doesn’t have very much in it. (Consider for example this page, for a nearby church, where the box has only the denomination, country, grid reference and a photograph.)

A further problem concerns location. This is available for many churches, as Google Maps and other tools make calculating latitude and longitude easy. But SPARQL’s querying language (rooted in SQL) is inflexible. It is not hard (though rather fiddly) to draw a square or rectangular box around a point and ask for all matching resources within it.

The following does this for 0.5 degrees of latitude and 0.5 of longitude around Christ Church Bath:

filter (regex(?y,"Christ Church") &&
?lat - 51.387 <= 0.5 && 51.387 - ?lat <= 0.5 && ?long - -2.362 <= 0.5 && -2.362 - ?long <= 0.5 )

But it would be good to have a way of drawing a circle of a given radius around your point and searching within that (I realise that the curvature of the earth will interfere with precise calculation of this, but an approximation would do for most purposes.) Perhaps this is is possible, and if so I would be interested to hear how.

So DBpedia and SPARQL are potentially valuable but much of the value is still locked away. There needs to be a culture among Wikipedians of using boxes to store information. And SPARQL itself is limited in what queries it can ask.