TL;DR there's a much shorter presentation here.
Since April I've been working as part of the Data and Search team in the Parliamentary Digital Service. It's a new team so we've spent some time figuring out what we do, what we don't do, how we work together and with other people. And we've made a fair bit of progress in identifying what's currently broken and what we need to do to fix it.
Dan's already blogged about the full remit of the team. I’m going to focus on 3 big parts of it:
- building a data platform that works to power the new website, Parliamentary open data and the business
- designing and developing a data model that properly ties together Parliamentary people, processes and outputs
- improving search (find?) both internally and in a general web search sense
For the purposes of this post I'm skipping the technical details of the data platform and concentrating on the data model while touching on some of the search work.
What's wrong with the website?
Well ... probably quite a few things. As with any website that serves a wide variety of users with a wide variety of needs, it's grown organically over the last few years. For now we're concentrating on the overall information architecture.
The current site is part powered by data coming from internal business systems cobbled together with a CMS. Nothing is quite connected in a way that matches user mental models or maps to reality, so you can be on a page showing navigation around the various aspects of the passage of a bill, click on a link from the navigation and get taken to a different page that looks different, feels different and loses the navigation you clicked on to get there.
There are roughly 100 different sub-domains under parliament.uk with bits of Parliamentary business like Lord Amendments sticking out at strange angles from the main body of the site. It combines all the complications of Parliamentary business with a hugely complicated interface that happens to be complicated in a completely different way.
To understand why the site is structured this way you need to look under the surface at the data and CMS that power it. There's less to say about the CMS side; it's a standard issue editing, structuring, storing, publishing application with all the usual drawbacks that entails.
On to the data:
Parliament has two triple stores:
- Search and Indexing is used to subject index Parliamentary material for search. The public output can be seen at http://search-material.parliament.uk/
- data.parliament is the current data platform for Parliament. It has two instances. One internal to feed the current website. One external used to publish open data
We started by attempting to investigate the general shape of the data in each store.
But if you attempt to dereference http://data.parliament.uk/schema/parl# you get a 404. The lack of any published ontology means the RDF we make available is really more tag soup than actual RDF.
We initially assumed this was a publishing error and that someone must have written a spec for the Parliament namespace. Some hunting around project documentation eventually turned up a spreadsheet with a column for predicate, a column for domain and a column for range.
Unfortunately the value for every range that wasn't a literal was defined as 'URI', so basically the range of this property can be anything which wasn't very helpful when attempting to determine a shape to the data.
We decided to take a different approach and attempted to work backwards from the instance data to a model.
We started with two queries: one to get a list of all classes and one to get a list of all predicates with all possible domains and ranges for each. The results are published on Heroku. Terms reused from common vocabularies are highlighted; terms in the parl namespace aren't.
By comparing the common domains and ranges of predicates we managed to cluster some of the classes and produce a diagram depicting the general shape of each triple store:
Both diagrams should be taken as a rough sketch rather than a formal model as they're derived by working back from the instance data and both triple stores have probably gained more data sets since the queries were run.
Why we did it and what we found
Our goal had been to work back from instance data to general shape to formal (OWL) ontology but given the state of the data the last step proved impossible. Some things we found:
- Properties and classes from different common vocabularies are used to describe the same thing, so we have a mixture of foaf:Person and schema.org:Person. Name labels use a mixture of rdfs:label, skos:prefLabel, ld-vocab:label etc.
- Properties and classes are used with different semantic meaning, so http://data.parliament.uk/schema/parl#Bill is used to represent both the Bill document and the passage of the Bill.
- There's a general lack of reification, so classes aren't grouped into higher level classes. There's no super class of Document so the "ontology" we managed to create has a parl:containsStatistics predicate with 34 different classes in the domain when it should probably have one.
- Some URI's are used with no typing at all. In the case of dc:creator the URI that the property points to has not been associated with a type. It is a undefined URI and we couldn't work out what type of thing it should be.
- Some URIs (such as parl:Session) are used as both a class and a property. Which would be potentially dangerous if loaded into a triple store with inference turned on.
Strings and things
There are wider challenges with the data that go back to source systems (some of which are over 10 years old) and information flows in the business. Boundary objects and inter-departmental processes are sometimes unclear and knowledge about what happened before and what happens next can be missing.
As a result three problems seem to emerge:
- Business applications are often commissioned by individual departments and offices (and because, to date, it hasn't been anyone's job to figure out how all these applications and the data they produce should fit together) some of these departmental boundaries get baked into code and into data models.
- The primary purpose of the business applications is to help with the day-to-day functions of Parliament. Data gets captured at boundaries of departmental responsibility but there's enough human context around the data at the time of transfer that meaning travels in people-space alongside data. Once the data makes its way to data.parliament the human context is lost and the value of the information degrades.
- From the high-level view of data.parliament it's pretty obvious that the individual datasets are not well linked. This is partly a result of the points above but also because lots of datasets should share reference data. It hasn't, however, been anyone's job to create and maintain this reference data and managing reference data adds development costs to business applications so, instead of a drop-down to select from a controlled list of government departments, you might see a free-text entry box which results in data like 'Department of Health', 'Health', 'health', 'doh', 'DoH' etc.
The Government Digital Service (GDS) are looking to solve similar challenges with their Registers project.
We've started a conversation with GDS and the National Audit Office about where we might have shared interests. A register of government departments is an obvious example.
IDMS and indexing. And stapling
The Parliamentary website and the open data portal are the tip of the iceberg - the business data under the waterline isn't properly joined up because parts of the business aren't properly joined up and the interfaces between government and Parliament are definitely not joined up.
Stating the obvious, if the data doesn't join, the website won't link (blink tags needed here) which makes fixing the information architecture of the website decidedly non-trivial.
Redesigning the website isn't a surface polishing job. There's a chain of dependencies from the website to the data platform to the business apps to the business.
Luckily, elsewhere in Parliament there's also the Indexing and Data Management Section (IDMS); a team of librarians responsible for cataloguing Parliamentary material for search. They deal with two sets of data: the stuff originating from Parliamentary business (from Hansard to Library Briefings) and a taxonomy of subject headings. Their main job is maintaining the taxonomy and indexing Parliamentary material against that taxonomy.
Where equivalences exist between business things and taxonomic things, mappings aren't maintained (except in the case of members) so a bill in the business space has no mapping to the same bill in the taxonomy space:
In order to power search IDMS need to add semantic links between business objects but they don't have access to make and edit links on the business data side of the graph. So instead they add semantic links between objects in the taxonomy:
All this results in two graphs of data, modelled as 'things' on the business side and SKOS concepts on the taxonomy side with business semantics expressed from both business object to business object, and SKOS concept to SKOS concept, with weak (subject indexing) links between the two graphs. It makes the whole thing incredibly difficult to query because there are non-taxonomic links between taxonomic objects but no sense of object class on the taxonomy side so no boundary objects and no idea where to stop querying.
Part of our work is to give IDMS the right tools to do their job: so something that still allows them to subject index Parliamentary material (ideally without having to know the entire contents of the taxonomy); and something that allows them to edit semantic links between objects on the business side of the graph. So far we've been referring to the last bit as 'stapling'. It probably needs a better label but first labels usually stick:
Even given the right tools to index and 'staple' there are bits of business data that can't be patched up after the fact. Once the human context surrounding data transfer in the business is lost it's almost impossible to recreate. IDMS can control reference data and add links between fairly static things but adding links between more dynamic data is probably too hard to ask. The other half of the problem still needs to be solved by reconfiguring business applications and (where possible) reconfiguring bits of the business process.
What is the shape of this thing?
From about 10,000 feet we think it looks like this:
Data (and content) flow from business apps to the data platform (where they're augmented by IDMS indexing and stapling) and from there to the website.
But there are dependencies all the way down and an escalating gradient of difficulty. Making a website is easy, commodity stuff; making a data platform is a little harder and making the right tools for IDMS is harder still. And the further back into the business you need to make changes the harder it gets until you reach the event horizon of impossible. Especially when some of the constraints are constitutional:
From about 5,000 feet we think (so far) it looks something like this diagram [PDF].
We didn't want to repeat past mistakes so before digging into the models we wanted to understand how the data should fit together. This meant trying to understand how the various bits of the business fit together. Working with Silver Oliver from datalanguage we ran a series of workshops with various Parliamentary offices where we got them to talk about and sketch out their bits of the world.
We started with a fairly basic model of members (and people in general) [PDF], worked with the Journal Office on secondary legislation [PDF], and the Commons and Lords Public Bill Offices on passage of public bills [PDF]. We still need to look into hybrid bills and Parliamentary ping-pong but that should happen soon. The whole thing has been combined into an overarching domain model of Parliamentary business [PDF] (which is still missing some of the bill passage work as I type). The top-left bubble describes people (members, committees etc). The top-right bubble describes the things Parliament is operating on and the outputs of processes. The bottom-left bubble describes places things happen in. The bottom-right bubble describes time periods (from Parliaments to sessions to days). And the middle bubble describes events by people, in a place, at a time, with outputs. A couple of people have already pointed out the overarching map looks not unlike a crazily complicated version of Yves Raimond's Event Ontology. But then everything does eventually...
At this stage we're only really interested in having a general map of the territory; it doesn't matter too much if the map is missing a Madagascar but we really don't want to be missing a South America. So if you spot something that's not there that you think should be, please do get in touch.
As we start to develop software and services around particular areas we'll need to work with the business to zoom into the map and identify some of the trickier details and edge cases. But starting from a rough map of everything will help us to identify where things should be linked up so we don't accidentally create more data (and website and business app) islands.
The website is the API
In designing the data models we're also thinking about how Parliamentary data is made accessible to external users. Rather than provide a separate API endpoint like data.parliament.uk we've decided that, wherever possible, the data views will be delivered alongside the HTML. So adding .csv will return a comma separated file, .json will return a JSON-LD representation, .xml will return an XML representation, .ics will return subscribe-able calendar data etc. We're also planning to use content negotiation to return appropriate representations and all available representations will be linked to from rel-alternate links in the HTML.
Your Parliament is not a snowflake
We're also interested in how UK Parliamentary data can be made interoperable with the web. Although we'll be writing our own internal ontology (to isolate our triple store from external changes we can't control) we're keen to map to common vocabularies for publishing. With search in mind we're looking to both use and expand schema.org where possible. For now schema.org doesn't include many of the things we'd like to describe. Given we're only one Parliament on one small island the chances of the schema.org consortium accepting our changes are minimal, so we're looking to find common data model ground with other parliaments around the world and work together to push commonly found data patterns into schema.org.
For that reason we're planning to develop our data models in public. Following a conversation with Dan Brickley from Google / schema.org we decided to use the W3C Open Government Community Group as a place to collaborate with other legislatures, with schema.org and with other groups interested in parliamentary data models. If you're interested in those conversations please join the group and the mailing list.
Complications (and complexity)
It's fairly obvious that Parliament is pretty complicated. Even at this stage in the exploration the domain model is the largest and most interwingled I've ever worked on, but there are also chunks of activity in Parliament that are deliberately designed to be complex in the adaptive and evolutionary Cynefin sense.
If you draw any process map sooner or later you're guaranteed to find a bit that disappears off-stage into the usual channels. Parliamentary processes fade out and party politics and whips take over. For obvious reasons all of this is off our map.
There is also complexity in the precedent rather than rule-set based nature of Parliamentary procedure. We think it's possible to document what's happened in the past (including events in the past that scheduled an event in the future) but given a certain state of a thing it's not usually possible to predict what will happen next. Large chunks of Parliament are not deterministic and don't really lend themselves to computers. I guess if the procedural index to Votes and Proceedings were properly digitised it might be possible to apply statistics based machine learning but probably not in this working lifetime.
The foreseeable future
So for the foreseeable future it looks like we'll be moving from a data.parliament "model" to something shaped roughly like this. Which is the most complicated model I've ever worked on (this thing is not iPlayer). We need to remain naive enough to ask the questions we need to ask while not accidentally concreting naivety into data models. We need to design the model to be stable enough to build on top of but flexible enough to cope with changes in Parliament in rapidly changing times. We need to be agile enough to allow for services to be built around us as we work. We need to balance the user needs of external website users with the reporting needs of the business and the capability of the business (and IDMS) to populate that model. In addition to developing new tools for IDMS, we need to design the information architecture of the website and the business apps while attempting to agree on common models with other Parliaments. We need to tread a tightrope between all the things that are complicated without accidentally trying to model the things that are complex. We need to work with the business to see what processes they want to change and are capable of changing while identifying all the things we can't do because they're impossible because constitution because democracy.