Why is collecting and analysing data about public procurement so damned difficult? Data scientists explain some common problems

Why is collecting and analysing data about public procurement so damned difficult? Data scientists explain some common problems

Written by Elizabeth David-Barrett and originally published on the ACE-Global Integrity blog.

Open data is often lauded as a magic pill for anti-corruption: reveal what’s going on, inform the public, and, presto, government will become more accountable. Oh, and big data just means bigger gains, right?

Not quite. We have written elsewhere about the institutional and political challenges that can hinder the transparency –> accountability transformation. But even the very first stage — collecting the data — is much harder than it seems. Building indicators that can be used for analysis requires a whole series of validation steps.

In our 2016–17 ACE project, we focused on collecting data from development aid donors and lenders rather than from national governments, assuming that this would be a more reliable source since such agencies face a lot of pressure to be transparent, and also have the capacity to collect data. Many national governments came a little later to the open government agenda, and often lower-income countries lack the necessary data infrastructure.

However, even for aid data, our initial efforts to collect data from a range of agencies encountered problems. USAID doesn’t collect data on the contracts it funds if they are spent outside the United States by aid recipients. We were able to collect data from the World Bank, Interamerican Development Bank, and EuropeAid, but even then collecting a full dataset required accessing numerous sources and a long process of cleaning and checking the data.

Where national procurement data is concerned, we have often found that governments make big claims that they are fully transparent and publish everything, but when we came to collect the data, we encountered a range of problems: large amounts of missing data; lack of consistency in how data is published from one year to another; or failure to provide essential information that is necessary for meaningful analysis, such as organisational IDs.

For example, if we have all the call for tenders but cannot easily match them with contract awards, this means we cannot construct key red flags. If we lack codes for suppliers and buyers, we cannot build indicators of supplier and buyer risk.

In our new Red Flags Explainer, we draw on our experience of building and analysing datasets of government procurement over the past ten years to answer some Frequently Asked Questions about our work. Liz David-Barrett, Mihaly Fazekas, Agnes Czibik, Bence Toth, and Isabelle Adam explain some of the challenges and what can be done to fix or work around them.