Data must adhere to agreed standards of quality to be usable. Data quality standards refer to the level of accuracy, currency, precision, and reliability of performance. Each data sharing initiative sets its own quality standards and defines what is acceptable and what is not based on the objectives it sets out to achieve. Clearly, some use cases (i.e., in the context of healthcare or humanitarian relief) require higher data quality standards than others. As such, no universal definition of “good enough” data can be established.
Investing in ensuring adherence to agreed-upon data quality standards can consume time and resources. However, setting up a system to accomplish this saves resources at later stages of projects, as errors and bias can be spotted before the data is put to use, saving costly efforts to correct errors once the initiative or platform is established.
Adopting appropriate quality frameworks at the data collection phase and establishing transparent approaches to limit and mitigate bias at the data analysis phase are useful steps for increasing data quality.
Defining appropriate data quality approaches
Approaches to data quality and veracity vary by initiative. For example, data exchange platforms like the Humanitarian Data Exchange (HDX) do not check the quality of the data they receive from partners. HDX’s data architecture is not geared toward cleaning submitted data. This initiative adopts a “buyers beware” approach, where the veracity of the data is evaluated by the user.
Other initiatives work extensively with data partners to ensure the quality of the data that is shared. Organizers of the Global Fishing Watch (GFW), for instance, can spend months conducting quality checks of the data received from governments because each country reports its data differently. The GFW team standardizes the data format and checks for errors. Discrepancies such as missing data fields or wrong time zones are common, and the GFW team works with governments to fix them. Only once the GFW team is convinced of the quality of the data does it proceed to the analysis stage.
Another approach is to put the onus of data cleaning and quality control on the data suppliers. This is generally discussed at the beginning of the initiative, and the data partners agree to the initiative’s data format requirements. For example, INSPIRE requires partners to complete the necessary data cleaning, quality checks, and quality assurance measures before sharing.
Transparency for bias mitigation and limitation
Haiti witnessed widespread violence in April 2022 due to fights between two gangs. The conflict led to the displacement of approximately 35,000 people from the affected area. Flowminder is a nonprofit foundation that specializes in analyzing Big Data such as call detail records, satellite imagery, and household surveys to solve development problems. To provide more evidence and details about the displacement, it formed a data sharing partnership with a telecom operator representing 74% of the national market share.
The objective of this partnership was to generate evidence to better understand the large-scale movements of the displaced population and support the provision of appropriate policy response. In its final report, Flowminder provided an extensive disclaimer about the limitations of data from mobile network operators, which are not statistically representative because access to phones isn’t universal. The report cautions readers to consider the limitations of the data in drawing conclusions from the report.