What drew you to taking part in this fellowship program and how have you found the overall experience?
With over ten years of experience in lecturing at various institutions between Ethiopia and South Africa, I am passionate about educating problem solvers to address issues affecting the world today. I am also an avid programmer. The opportunity to participate in the Global Partnership-AIMS Fellowship gave me the added advantage of applying my programming skills to a real-world problem.
Before the project began, I was able to virtually meet my supervisor, Mr. Molla Hunegnaw Asmare, a statistician at African Centre for Statistics, ECA, during the onboarding workshop organized by the Global Partnership and AIMS. We explored a variety of potential projects I could work on remotely, given that I am based in Bahir Dar, while the ECA offices are in Addis Ababa. While there was a steep learning curve, I had the full support of my supervisor and his teams throughout the project to ensure my specific skills and knowledge matched the assignment. Overall, it was a great learning experience.
Can you give an overview of the project and its objectives?
My project entailed automating the collection of economic data for the ECA’s Price Watch Center for Africa. The Center is Africa's first-of-its-kind online platform that contains price and exchange rate data for all African countries. The platform, which was launched in August 2020, aims to support the provision of open, reliable, and harmonized statistics and information, which are the foundations for policy prediction, planning, and adjustments. The data will be publicly available and accessible to citizens, decision-makers, businesses, and other stakeholders. Therefore, this platform has significant potential to positively impact governance and development efforts across the continent.
The platform relies on data from national statistical offices (NSOs) and national revenue authorities across the continent. The ECA has been collecting Consumer Price Index (CPI) data—a monthly statistic that details the change in prices on goods and services purchased by consumers—from statistical agencies all across Africa, to feed into the platform. This involves regularly requesting the NSOs to share their CPI data reports, or downloading the reports from NSO websites (the majority of which contain a range of such essential, detailed, and frequently updated information), and then manually transferring the data to a centralized database within the Price Watch Center for Africa. The current process has proved to be very time- and resource-intensive, both for the ECA and for the NSOs who were responding to the data requests. There was a need to automate the identification, tracking, and uploading of this data.
As such, the overall goal of my project was to develop a low-cost and efficient system for consolidating, validating, and publishing up-to-date and comparable national CPI-data from NSO websites across the continent. The key activities of this project were:
- to automate price data collection from at least 10 NSOs’ websites
- to design a web application through which the harmonized price data on Africa can be disseminated.
This required me to apply several data science skills, primarily web scraping, which makes the collection and analysis process much faster. It also provides an opportunity to explore big datasets and develop methodologies that are appropriate for such data if other sources should big data (e.g. point of sale scanner data) be introduced into consumer price statistics.
What exactly is web scraping and how does it work?
Web scraping, also known as web harvesting or web data extraction, is the process of retrieving information from a website and saving it in spreadsheet format. This is how search engines work, and web scrapers use the same internet techniques that browsers use to visit web sites and scrape relevant information from them. The process can be done manually (simple copy/paste), using custom scripts, or using web scraping tools.
I wrote scripts for a variety of data science tools using different Python modules for web and PDF scraping, and for operations to clean up the retrieved data. This allowed me to automatically download the PDF files containing CPI data as soon as countries publish them on their websites and scrape that data from the PDF files to an Excel spreadsheet for 25 countries—surpassing the original 10-country target.
I also developed a Django-based web application that incorporates all these processes and enables development and maintenance of a webpage for uploading the unified data. Finally, I scripted an automated way of cleaning data presented in different formats and layouts by different countries, so that it is presented in a harmonized format on the webpage.
What ethical considerations did you make with respect to how you handled the data you were working with?
There are various ethical considerations that one should make when web scraping that actually make it easier and more successful as well. The most important for me is ensuring that the scraped data was in the public domain and will be used in an ethical manner. In this case, it was for the purpose of creating new value from the data, not just for the sake of duplicating it. I only scrape and keep the data I need, and I make sure to always provide a User Agent string that provides a way for the NSO web administrator to contact me with questions or concerns. It is also ethical to request data at a reasonable rate, which aligns with the rate at which the data is being produced. Failure to do so can result in the website assuming you are a bot / DdoS attack and blocking your access to the data. Some websites don’t allow web scraping at all, but so far, I haven’t come across any country that has blocked my attempts, or received any queries or concerns. To me, this indicates that I have handled the data ethically.
What are some of the challenges you faced during the project, and how did you overcome them?
The main challenge stemmed from the fact that different NSOs present their CPI data in a range of formats and standards. The United Nations Classification of Individual Consumption according to Purpose (COICOP) is the standard international format for CPI reporting since 2015 and most, but not all countries ascribe to this format for various reasons, including having gaps in the data. Further, although most countries publish CPI data monthly, the actual days of the month vary from country to country.
This means that an automated daily base follow up is needed to check whether data has been published, while different methods are needed for downloading and extracting the data from different data file formats and simultaneously structuring to match COICOP specifications and storing them in an Excel spreadsheet. To overcome these challenges, I wrote a script to automate effective data cleaning and index compilation procedures for each country. I also had to write a script that applies the scraping of data to the language used on non-English websites.
This challenge highlighted to me the importance of government organizations such as NSOs to produce data in a standardized format and to apply international standards of statistical classifications to organize and present statistics. Just as one of the objectives of the Price Watch Center for Africa is harmonization of CPI data, the process of collecting and monitoring socio-economic indicators can be considerably improved if all NSOs strive to ensure the availability and accessibility of their data matches the requirements and demand for data for development.
There are additional potential challenges including information technology limitations, the risk of missing data due to website changes, or skill requirements to build and maintain the web scrapers. However, I was able to transfer some of the necessary skills to the ECA team. As long as there is someone on hand to fix occasional problems, then large volumes of data can be collected cost-effectively and efficiently.
What did the capacity transfer process entail and how have you ensured sustainability of this project after your fellowship ends?
I have been doing a knowledge transfer with two selected members of staff at the ECA. They were able to install and use the web application locally. I have taught them all they need to know about how the web application works, and the entire system can be easily replicated for other data collection activities (e.g. to support the collection of exchange rate data from NRAs) with little modification. They also asked me to write an operations manual on how the system works, how to add countries to the system, and how to inspect and maintain the system in case of any failures. I am currently finalizing the manual, and I believe they are in a position to maintain the system as well as to develop additional scripts to perform sub-regional aggregates of the data.