National statistics offices (NSOs) play a crucial role in the provision of accurate and timely data. This data is used by policymakers, businesses, researchers, and the public to inform policies, investments, and actions for sustainable development and effective data governance.
In response to the increasing demand for accurate and timely data to support various statistical endeavors, Colombia’s NSO (DANE, by its acronym in Spanish) released a framework for the production of Experimental Statistics in 2021. Since then, as part of the Data for Now initiative, DANE has been developing various projects that combine the use of non-traditional methods and sources with official data for statistical production. In 2022, the Directorate in charge of Data Collection within DANE proposed the use of machine learning (ML) to harness the power of natural language processing (NLP) to efficiently capture and process quality data through audio-based information.
Recognizing the importance of hunger, food security, and nutrition indicators in improving agricultural and food systems, DANE opted to integrate ML models and NLP into their operations, with a focus on the daily compilation of food prices and quantities. Through a collaborative effort with the Global Partnership, DANE engaged in a capacity-building program to create a semi-automated mechanism for collecting and analyzing data, within the Information System for Prices and Supply of the Agricultural Sector (SIPSA).
Aligning with the SDGs
DANE’s efforts align with specific targets of the Sustainable Development Goals (SDGs), particularly SDG 2 – Zero Hunger. The most recent data reveals that in 2022, 28.1% of Colombian households experienced moderate food insecurity, with 4.9% facing severe food insecurity. When examining these prevalence rates at the subnational level, it becomes apparent that these figures can be as high as 59.7% for moderate food insecurity and 17.5% for severe food insecurity. This underlines the need for increasing efforts to diminish undernourishment within the country, emphasizing the crucial role of data related to this phenomenon in shaping public policies.
When looking for potential causes of this problem, the Food and Agriculture Organization of the United Nations (FAO) relates the increase in market prices of food and the affordability of healthy diets with higher levels of severe and moderate or severe food security. Thus, improving the collection of data for prices and quantities of food offered everyday in the market allow for more precise information to understand potential causes of undernourishment in Colombia.
Transformative impact: natural language processing models enhance data collection and SDG analysis
Through collaboration and training, DANE has successfully implemented NLP models to improve data collection and SDG analysis within the institution, paving the way for further improvements in institutional effectiveness. DANE's team has gained expertise in NLP models and their applications, enabling them to identify opportunities for improving data collection processes; as well as the know-how required for the development of a prototype for measuring interlinkages between SDG indicators.
For the Information System for Prices and Supply of the Agricultural Sector (SIPSA), the envisioned model promises to revolutionize data collection by eliminating the need for manual transcription of data from paper to digital tools. This automation will minimize errors and expedite the preparation of daily market technical bulletins.
"Getting to know different tools in the Machine Learning Workshop offered within the framework of the D4N initiative allowed the SIPSA working team to propose a transcription strategy for the collected audio information and, then, a pilot project in collaboration with the Data Management team. The project has made significant progress in optimizing the digitization of SIPSA's information while maintaining the required conditions of quality and timeliness for such information."
– Paola Galvis, Thematic expert, SIPSA, DANE
Enhancing data through training, collaboration, and innovation
In this framework, DANE and the Global Partnership engaged the services of machine learning (ML) consultants, to provide comprehensive training to DANE's team and formulate a robust implementation roadmap. This initiative not only empowered DANE’s team to improve their data analysis procedures but also enabled real-time enhancements to their statistical operations, advancing their capacity for evidence-based decision-making. In the first training, during the first half of 2023, 14 participants from the NSO gained skills in design and implementation tools for reading and storing audios that will be used to run natural language processing models.
The collaboration comprised three phases: ML training, co-creation of ML models, and designing and implementation of a roadmap. One of the most important lessons learned during the training was the need to adapt the content to the specific needs of DANE’s team and their projects. This included focusing on examples and use cases that were relevant to the context of data analysis in the public policy area. The training highlighted the importance of fostering a collaborative and learning environment among the participants which is suitable for the discussion of ideas and solutions for specific problems in their areas of work.
The training phase
DANE's team learned about automated text analysis, image recognition, and programming in Python and Orange software. The goal was to use this knowledge to improve the capture, processing, and quality control of multiple formats of data. Participants were able to understand the theoretical and practical foundations of ML, as well as learn to use specific tools and libraries for text, audio and image analysis.
The co-creation phase
The co-creation phase gave the participants an opportunity to practice what they learned by working on real-world data sets in current challenges of their work. During this phase, DANE and the Global Partnership worked together to develop ML models that were tailored to DANE's specific needs. This involved joint scripting, technical sessions, and guidance in configuring the models. Some key elements for this phase were effective communication, adaptability to DANE's framework, visits in-person to the data collection centers, and collaboration. Overcoming challenges and fostering a learning environment proved instrumental to the successful execution of this initiative.
"The co-creation process represents a huge accomplishment in enhancing the technical capacities in DANE. It enables us to bring together our vast experience in data quality assurance with the use of emerging data sources and techniques. While additional capacities are still required, this is a substantial step forward."
– Andrés Arévalo, Thematic expert, SDG Unit, DANE
The implementation and design of a roadmap phase
To ensure the successful implementations of ML models in DANE, the team in charge of the data collection process for SIPSA developed a detailed roadmap with support of a consultant of ML projects. The first step on the roadmap was to identify areas where ML models could be extended to other DANE research applications. DANE technicians analyzed the impact of the models and identified opportunities to adapt and improve them, based on the needs of other projects and areas of work. Once the lines of work had been drawn, the DANE team developed ML tools that were designed to be functional, scalable, and integrated with existing systems.
Currently, DANE is working with a second consultant, expert on ML models in the implementation of the roadmap, under a co-creation scheme where there is permanent joint work between DANE’s team and the expert. So far, an exhaustive exploration of various audio preprocessing techniques has been carried out with the aim of improving the quality and information contained in the data set. Additionally, an extensive preprocessing phase has been conducted to maximize the utility and consistency of the data set, thus providing a solid foundation for subsequent analyses. DANE focused on good practices to guarantee the quality and long-term maintenance of the implemented solutions.
Next steps and future implications
The successful implementation of the project in DANE is going to establish a strong business case for the implementation of machine learning models by identifying specific advantages that can empower future innovation, such as improved data quality, enhanced timeliness, streamlined production processes, and innovation in data analysis. On the other hand, it also identified the main barriers for mainstream use, such as the need for adequate infrastructure, managing organizational change, ensuring continuous learning, and maintaining data privacy and ethics.
Looking ahead, there are several key milestones for refining the natural language processing (NLP) algorithm. These steps include transitioning from an NLP algorithm prototype to its final version, piloting the model, and scaling the solution to various projects and statistical operations. For this, DANE is still working with the expert ML consultant. The success and results of this project would be key to scaling the solution.
To ensure the success of machine learning models, high-quality data is essential. Poor-quality data can lead to severe degradation in the results and have further consequences when decisions are based on those outputs. Equally important is the use of pre-trained language models (PLMs) to learn universal representations on large corpora in a self-supervised manner. The pre-trained models and the learned representations can benefit a series of downstream NLP tasks.
In light of these achievements and the ongoing commitment to advancing data processes using machine learning, the Global Partnership will continue to support countries on its journey toward further enhancements and innovation in data analysis.
Revisions from Víctor Andrés Arévalo Cabra. Professional at SDGs Unit in DANE; Editorial reviews from Stephanie Welstead, Communications Consultant, Global Partnership