Understanding how and what users search on Parliament’s website is crucial for us so that we can continually improve the search function.
We wanted to understand if there was a distinct difference between parliamentary language and what users search on the website. Parliamentary procedure is complex and its terminology can sometimes be difficult to understand. For example, before I started working at Parliament I didn’t know a vote is called a ‘division’.
The Commons Library has a team of librarians responsible for cataloguing parliamentary material, called the Indexing and Data Management Section (IDMS). Besides indexing all parliamentary material, they also maintain a controlled vocabulary (taxonomy). The controlled vocabulary is made up of preferred terms, and its synonyms (non-preferred) or acronyms.
Read on to find out about the work we've done to match search terms with concepts from the IDMS’ controlled vocabulary in order to understand any ‘language barrier’ between website users and parliamentary language.
What we did
We used two data sources for this work.
First, the controlled vocabulary. We used a list of all vocabulary terms that had been used by IDMS until September 2018. For example, for concept 8483, “avian influenza” is the preferred term and “bird flu” is a non-preferred term. Moreover, terms can be separated by classes (these are parliamentary procedure terms, content types, legislation names, people, and organisation names). The total number of preferred terms was 41,936 and non-preferred terms was 27,130
Second, three months of search terms data. This was taken from four different search engines: internal and external parliamentary search (where users can find parliamentary material), Hansard, and, what we call the default search. A total of 136,667 unique search terms were used.
The method for matching is quite simple. We used regex to match search terms against the vocabulary. There were a couple of things we had to consider for our analysis:
- There are many acronyms in the vocabulary and a lot of them can be confused with other words. For example, the acronym “DATA” (Design and Technology Association) can be confused with the word “data”. To minimise this, all vocabulary terms that are upper case have a slightly stricter regex so that it only matches if the search term is also upper case.
- Search terms are converted to lower case, except for the ones that have words with two to five upper case letters because these are most likely acronyms.
What we found
Overall, 57% of search terms matched vocabulary terms. Moreover, non-preferred terms represented 12% of the total matches.
Out of the 41,936 preferred terms, 14% had at least one match. We saw similar numbers for non-preferred terms as well (15%).
Breaking down the matches by search engine, we found a higher proportion of internal searches that match preferred terms and a higher proportion of default searches matching non-preferred terms. This was not a surprising finding, as we know our internal users are a lot more familiar with parliamentary terms.
What we want to work on in the future
We now have a simple, straightforward method for matching search terms with the vocabulary.
Besides gaining a better understanding of what users are searching, this analysis can also be used to support the librarians’ work on the continuous maintenance of the taxonomy.
The next step will be to look at the list of search queries that didn’t match any terms from the controlled vocabulary. This will help us understand if there are any topics, themes, or words that users search for that are not in the vocabulary and could potentially be included as synonyms for what’s already there.
Read more posts on the work of the data and search team.