Author Archives: vhk10

Where did my work go?

UoB logo

As I look around for work (my current search is for freelance pieces of work in the digital humanities area), I’ve been working out how much of what I did in my sixteen years at ILRT at Bristol University has survived in a recognisable form. Obviously there are publications, such as an article in Ariadne and more recently a prizewinning essay. But my online legacy is harder to trace.

Many projects I worked on were pilots for services which never appeared, or in one case planning what one might do if suitable funding could be found. Others were never intended to last long (e.g. surveys), or have changed out of all recognition over time. Few websites provide anything like the same service as they did in the last millennium, when I started working. The mighty edifice that was SOSIG and then Intute (in which I had a small hand) was frozen when the funding on it was pulled. Other services are now inaccessible to me: for example, BOPCRIS and the 18th century parliamentary papers are behind a paywall, and INASP may or may not be using the repository I designed for them.

A recurring pattern seems to be that my clients (the people I worked and communicated with directly) and their superiors had different ideas about what should actually be done. A case in point was From History to Her Story (for WYAS); our site was on the point of release when the content ‘had to’ be moved to another service provider. Most of the content vanished from the new site, and what remained was largely broken. Years later the site moved again to a third host and the content reappeared – at the cost of paying twice for the same piece of work! Something similar, but with less duplication of effort, happened to EPPI. I have written elsewhere about some of the aftermath of BOPCRIS. I could give several more examples.

Also, looking more closely and bearing in mind the passage of time, what strikes me is not that there isn’t much online now, but that a lot of things I worked on never saw the light of day at all. Perhaps this was the nature of work there, that we undertook a lot of pilot, proof-of-concept and experimental projects.

So what remains? Hidden Lives Revealed (for the Children’s Society), which won a prize from the Institute of Archivists, was handed over smoothly to another service provider and is still active. Another functioning continuation of my work is Bristol University’s Research Data Management service, for which I was involved in the pilot phase, data.bris. data.bris made sufficient impact that its name is still attached to the service, although not all the staff could be retained. The Bristol University InfoSafe tutorial (aimed at support staff), which I helped to write, is also still available as is Jisc Digital Media’s ‘Developing Community Collections’ resource, although in the latter case I was only involved with putting the content online, not creating it. There are other tutorials and the like around the place which I helped to put together. Not very easy to point out my contributions to a potential employer, but there is a body of stuff there.

does Twitter work?

I’ve taken over doing social media for a couple of organisations including a local choir. This choir has posted over 4,000 tweets and now has over 1,500 followers. I have a rough rule of thumb to follow half of those who follow us.

Is this translating into extra ticket sales and new choir members? It’s very hard to quantify this. I would say that only about 1 in 6 of the new followers we get are individuals rather than organisations. Some of these individuals are agents, musical promoters and others with a professional interest in what we do; others live overseas so are unlikely to join us or attend concerts. Does this matter?

I think it is less important than it seems:

  • Behind every organisation who follows us there is a person or people possibly seeing our tweets
  • The professionals may be useful to us at some point
  • There are still enough interested individuals out there to make it worthwhile
  • Retweets broadcast our information much more widely

Perhaps the real value is just in our presence, which tells the world we’re there and raises awareness of us.

Leaves, snow and birds: covering the country

A while back I compared three sites that crowdsource data about the natural world: the RSPB’s Big Garden Birdwatch, NatureLocator’s Leaf Watch and the UK snow map. I return to these sites and consider another aspect: geographical unevenness.

(Leaf Watch has I think ended but other NatureLocator surveys have been running and the same principles apply to them).

All these sites ask people to submit information about what is visible around them. So you are likely to get more reports from where there are many people, such as big cities and fewer from sparsely inhabited areas such as the Highlands or mid-Wales.

For the snow map this may not matter too much. It’s clear there are people in remote areas sending in reports, which appear as dots or flakes on the map. There are some in the ‘cefn gwlad’ behind Aberystwyth as I write. And there appears to be a limit on how many dots can appear in the same small area. Nevertheless, snowfall in London generates a busier looking map than snowfall in the countryside. But the snow map is not a matter of record; it is ephemeral and serves to inform people right now about where snow is falling so they can plan journeys/get the sledge out/feel Schadenfreude. So, when looking at it, it just needs a little thought and a mental map of major centres of population to factor out the weighting towards cities.

The Big Garden Birdwatch does not attempt to cover the country evenly. It’s expected that the birdwatching will take place in gardens (hence the name) or public parks, and the survey is specifically intended to measure avian activity near human settlements. (This will cause variations, for example due to hard weather in some years which will drive birds from open country into gardens. There are further sources of bias, for example caused by one species of bird being mistaken for another and by less conspicuous species such as wrens being undercounted. And being in January summer visitors will be missed altogether. But that’s going beyond the scope of this post.)

NatureLocator is a scientific project which was driven by biologists needing data. Although it was produced by colleagues, I haven’t been involved enough to know what exactly is done with the information submitted. Since what was being measured was the overall distribution of leaf miner moth infestation and the proportion of trees affected, the main risk due to geographical unevenness is not getting enough reports in some areas. There might also be a bias due to users being more likely to report the presence of the moth than its absence. The results are summarised here where the geographical issue is acknowledged.

pointing SPARQL at Christ Church

I’ve been playing around with SPARQL and trying it out on DBpedia, a database of information from Wikipedia. This has been an interesting exercise in data mining. It’s been an instructive exercise, not only for improving my expertise in this area, but also for showing its potential and limitations.

I am responsible for much of the content on the Wikipedia page for Christ Church Bath. I started by getting out the information about the church, and then asking questions about other churches with this dedication in Britain. In particular I decided to test the hypothesis that most such churches were founded in the 19th century or later. (Christ Church Bath was actually dedicated in 1798).

The difficulty is that while I pulled out many Christ Churches from the database, it was hard to isolate the British ones, and harder still to get a date of foundation/building/dedication. Those slashes indicate the reason; there are several possible dates which could be used to give the age of a church: foundatation, dedication or consecration. You can search on each possibility, but a further problem arises: the data has to be in the database in the first place.

The data in DBpedia comes mostly from the boxes on the right of Wikipedia pages, and SPARQL searches can only find what is in the database. So the success of a query depends on that information being in the box. If the dedication date (or date when the building was completed or the first service was held) is mentioned only in the text in the centre of the page, it will be missed. In practice although pages about churches usually have a box of information, it often doesn’t have very much in it. (Consider for example this page, for a nearby church, where the box has only the denomination, country, grid reference and a photograph.)

A further problem concerns location. This is available for many churches, as Google Maps and other tools make calculating latitude and longitude easy. But SPARQL’s querying language (rooted in SQL) is inflexible. It is not hard (though rather fiddly) to draw a square or rectangular box around a point and ask for all matching resources within it.

The following does this for 0.5 degrees of latitude and 0.5 of longitude around Christ Church Bath:

filter (regex(?y,"Christ Church") &&
?lat - 51.387 <= 0.5 && 51.387 - ?lat <= 0.5 && ?long - -2.362 <= 0.5 && -2.362 - ?long <= 0.5 )

But it would be good to have a way of drawing a circle of a given radius around your point and searching within that (I realise that the curvature of the earth will interfere with precise calculation of this, but an approximation would do for most purposes.) Perhaps this is is possible, and if so I would be interested to hear how.

So DBpedia and SPARQL are potentially valuable but much of the value is still locked away. There needs to be a culture among Wikipedians of using boxes to store information. And SPARQL itself is limited in what queries it can ask.

Moving email subscriptions

I have recently changed email address, and with it moved many subscriptions to mailing lists. I’ve used my numerous filter rules to identify what these are, as some are very occasional. In the process a few dormant ones can be weeded out, and some others of course have outlived their usefulness and can be dropped (ideally by unsubscribing so that the listowner doesn’t have to delete the address). Others must be moved. In many cases, where no personal account is involved, I can unsubscribe from one address and re-subscribe from the other. Where there’s an account, things get more complicated. Usually logging in and changing my email address works, possibly after a few days when the list of subscribers is updated with changes. But there have been a few where I’ve changed my email address everywhere possible and still weeks later messages arrive at the old address. For some reason two of the worst offenders are supermarkets and their points cards: Waitrose and Sainsbury’s are both continuing to use the old address to tell me how many points are on my card. Sainsbury’s has the added complication of a broken form which does not work on all browsers (there’s no excuse for that these days). Wales Millennium Centre is also happily ignoring the new address; perhaps they take a long time to update their subscription list.

How to identify a name

If I do a Google search on my own name (let’s face it, we all do vanity searches from time to time) I see at the bottom of the results page a warning sentence ‘Some results may have been removed under data protection law in Europe’. This message is now appearing after all sets of search results where Google thinks you’ve searched on a personal name.

My name is mildly unusual (though I share it with a sometime First Lady of a large U.S. State), but it’s clearly a name, even though my surname is also a common noun. My husband’s name (unique, we believe, with a very unusual surname in Europe) also generates the message. A quick test on other names produces mixed results. A Chinese name is identified as such, for example. Quotes and capital letters are disregarded and initials + surname are also recognised as a potential name. However….

I know someone called ‘Fondant Fancy’. OK, that isn’t actually her name, but she has a name which could also easily be a kind of cake. Searching on it produces references to her, as well as cake recipes. But no warning message. Nor does searching for a bogus surname invented by a friend produce the message, even in combination with a real forename. Nor does searching on my surname, or my husband’s, with an invented forename.

This system for identifying names is obviously never going to work 100%. Clearly they have a list of forenames (so my name gets identified) and surnames (so my friend’s bogus name does not). But some kinds of name are going to cause problems:

  • very unusual forenames or surnames. It’s now quite common to ‘invent’ names out of bits of other names, for example, in an effort to be unique. (And to cause a lifetime of problems every time your child has to give their name. But at least it fools Google.)
  • names where both parts are unusual (at least according to Google’s lists).

I imagine Google has some way of trawling directories of known names, and other sources, in order to get its raw material. Meanwhile there will soon be ways of getting at information which has been withdrawn from Google search results in the European domain. Sites will spring up offering searching at Google.com, or comparing the results of Google.com with Google.co.uk, or even offering to search pages which are found by one and not the other.

How not to design a downloadable form

I have recently had to download and print out a form from a financial institution. For legal reasons it requires a physical signature so it cannot be completed online.

There are a number of things badly wrong with this form:

a) The small print (there is quite a bit of it) is REALLY small. Almost illegibly so. It can be read on screen – by zooming if necessary – but on the printout, physical magnification is required.

b) The form covers four pages, and I am instructed not to staple the two sheets together. The first sheet (page 1-2) has the details of the account I am opening. The second (pages 3-4) requests my signature, and otherwise has only the generic small print referred to above, and some boxes for staff at the issuing institution to complete. Anyone else see a problem with this?

c) Nowhere on the form is there any address to send it to. Probably there was one on the web page from which I downloaded it, but once the form is printed out and taken away that is lost.

This has all the hallmarks of a case where a document looks fine on screen, but no one has thought about the practical problems when it is transferred to hard copy.

When it’s bad to have an informative website

I occasionally go to concerts in London, and like to look up what is on in the coming months. Until recently it was hard to get classical music listings for such major venues as the South Bank Centre and the Barbican. The South Bank Centre in particular only allowed you to see classical events a few at a time. It has now improved, though the site is still slow.

Why make it so hard? Is it just careless website design, or is there an ulterior reason for putting barriers between classical music lovers and the information they seek?

Major London concert venues send out listings to their subscriber mailing lists every month. Unlike local venues such as St. George’s or the Wiltshire Music Centre, you have to pay to subscribe. There are of course other benefits such as priority booking (though when I was a South Bank subscriber I still wasn’t sent the advance brochure for their ‘international concert series’!) but for many subscribers the brochures are the main reason to join.

If you make it too easy to get information over the web, you may lose income from selling subscriptions. On the other hand, if you make it easy to get information that way, you may sell tickets to non-subscribers who might not otherwise find out about your events. It looks like the latter view has now prevailed. I estimate that the profit from a couple of extra tickets sold would more than cover a lost subscription.

A similar issue applies to the BBC site, which does not put listings for Radio 3 in an easily printable form for the whole week, presumably so as not to dent Radio Times sales. (You can however get them here.)

Simulating denial of service

I was recently equipped with a password on a new website. I made the mistake of giving my email address in two different forms in different places (a bad piece of design – if you have to have the same email address in both, why not automatically copy it over?) In attempting to log in again, I used the ‘wrong’ form of the address and quickly found myself locked out, unable even to make further login attempts. Phone calls were necessary in order to release me and give me access again.

The problem is that the behaviour of someone who believes they have the right login credentials, but hasn’t, is very similar to that of a denial of service attack – bombarding a site with several login attempts in quick succession. It’s a challenge to separate the two, but I think the human-generated pattern has certain characteristics. The human user will leave at least a few seconds between each attempt. Also, he or she will not vary their attempts very much – they are likely to use the same few IDs/passwords in various combinations repeatedly. It shouldn’t be too difficult to be more generous to a login pattern of this type than to a suspected denial of service attack. Logging in for a few times at an interval of several seconds is not going to hack or break a server, unless multiple attempts of this kind occur simultaneously.

It’s the fuzz

Which is more annoying, fuzzy searching that is excessively fuzzy, or searching that is painfully literal?

A fine example of the first is the default search on our incident logging system, TopDesk. Operators can search all the incidents, and the default (which can only be switched off in advanced mode) is a fuzzy search. This is so fuzzy that it is almost useless, returning several times as many results as actually contain the term you’re searching on. Furthermore, the search term is not highlighted in the result , and as interface doesn’t display all the text in a record at once, and the text is distributed among several boxes, you cannot easily find whether a given result contains your search term or not. The database of records is now many thousand in number, so if you are not careful you will get several screens of records which you will have to sift through to find the one you want.

The opposite problem, over-literalism, can be found with Diigo. It will search only on the words you give it, no plurals or other inflected forms. This means many searches have to be done two or more times with other possibilities and I’ve taken to filling the descriptions of my bookmarks with likely search terms.