<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://portfolio.yedite.ch/feed.xml" rel="self" type="application/atom+xml" /><link href="https://portfolio.yedite.ch/" rel="alternate" type="text/html" /><updated>2026-06-02T17:25:46+00:00</updated><id>https://portfolio.yedite.ch/feed.xml</id><title type="html">Fatima Galaudu</title><subtitle>Backend developer. Python, Django, systems.</subtitle><author><name>Fatima Galaudu</name></author><entry><title type="html">Building ETL Pipelines for a World Bank Education Programme</title><link href="https://portfolio.yedite.ch/2025/09/29/building-etl-pipelines-for-a-world-bank-education-programme/" rel="alternate" type="text/html" title="Building ETL Pipelines for a World Bank Education Programme" /><published>2025-09-29T00:00:00+00:00</published><updated>2025-09-29T00:00:00+00:00</updated><id>https://portfolio.yedite.ch/2025/09/29/building-etl-pipelines-for-a-world-bank-education-programme</id><content type="html" xml:base="https://portfolio.yedite.ch/2025/09/29/building-etl-pipelines-for-a-world-bank-education-programme/"><![CDATA[<p>This is a process that took a couple of months to perfect, over 3 States in Nigeria, and numerous data collection exercises. I never meant to be a Data Engineer, but here i am, this is what my current work flow looks like!</p>

<h3 id="step-1-extract--designing-the-kobotoolbox-forms">Step 1: Extract – Designing the KoboToolbox Forms</h3>

<p>Field officers visit schools at the beginning of the School Year to enrol students. They use mobile forms built in <a href="https://kf.kobotoolbox.org/">KoboToolbox</a>, an open-source data collection platform designed for humanitarian and development work.</p>

<p>Designing a good form matters more than it sounds. A badly structured form produces data that’s nearly impossible to clean later (been there, the first form collected was really a nightmare i had to learn from). Good forms are built with constraints and validation baked in, required fields, numeric-only inputs for account numbers, and cascading dropdowns that prevent officers from selecting an invalid school/LGA combination. I have found Claude to be extremely competent at making forms. Forms that would take me 4 hours to build now just take a few minutes. Interesting time to be alive!</p>

<p>The forms are deployed to field officers’ phones and sync automatically when they have connectivity. A single week’s collection can produce thousands of rows.</p>

<p><img src="https://softcopyofme.yedi.com.ng/wp-content/uploads/2026/05/Screenshot-2026-05-29-at-15.07.04-1024x418.png" alt="Sample Kobotoolbox Form" /></p>

<p><em>Sample Kobotoolbox Form</em></p>

<h3 id="step-2-transform--cleaning-and-validation">Step 2: Transform – Cleaning and Validation</h3>

<p>Raw field data is messy. Names are misspelled. Bank account numbers have extra spaces. The same student appears under two slightly different names. This is where most of the work lives.</p>

<p>I export the raw data from KoboToolbox and run it through a Python/pandas pipeline, but first i do the manual bits of data cleaning in excel. (my preference is actually WPS Office, its much more lightweight and a lot faster on a Mac).</p>

<p>The first thing i do is delete any numeric fields that do not match the minimum number if any. (this has been fixed in later forms with a strict number length). then on to Duplication. Duplication can mostly be identified through duplicates in the numeric fields (NIN, BVN, Phone Numbers, Account Numbers) i go ahead and delete fields that have the same fields across all other columns.</p>

<p>Sometimes we collect bank account numbers, but the bank names are not standardised. So i got Chat GPT to make a json file that maps the actual bank names to the casually typed bank names, heres a small sample of the file.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"polaris bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polaris Bank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"first bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"First Bank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"eco bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Ecobank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"access bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Access Bank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"keystone bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Keystone Bank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"barnabas mary frank"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
  </span><span class="nl">"zenith bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Zenith Bank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"gtbank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"GTBank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"polarisbank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polaris Bank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"gt bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"GTBank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"union bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Union Bank"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"uba bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"United Bank for Africa"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"unity bank"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Unity Bank"</span><span class="p">,</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>the file contained all possible values entered as bank name, (the keys) which i would simply get the value by querying the key.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">json</span>

<span class="n">banksjson</span> <span class="o">=</span> <span class="sh">'</span><span class="s">/Training /clean_new_Cohort/bank_mapping.json</span><span class="sh">'</span>
<span class="n">bank_mapping</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">banksjson</span><span class="p">,</span> <span class="sh">'</span><span class="s">r</span><span class="sh">'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">'</span><span class="s">utf-8</span><span class="sh">'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">bank_mapping</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

<span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">AI Generated Mapping:</span><span class="sh">"</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="nf">dumps</span><span class="p">(</span><span class="n">bank_mapping</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>

<span class="c1"># Apply the mapping
</span><span class="n">df_students</span><span class="p">[</span><span class="sh">'</span><span class="s">Caregiver Bank Name2</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_students</span><span class="p">[</span><span class="sh">'</span><span class="s">Caregiver Bank Name</span><span class="sh">'</span><span class="p">].</span><span class="nf">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="nf">strip</span><span class="p">().</span><span class="nb">str</span><span class="p">.</span><span class="nf">lower</span><span class="p">()</span>
<span class="n">df_students</span><span class="p">[</span><span class="sh">'</span><span class="s">Caregiver Bank Name2</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_students</span><span class="p">[</span><span class="sh">'</span><span class="s">Caregiver Bank Name</span><span class="sh">'</span><span class="p">].</span><span class="nf">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">).</span><span class="nf">replace</span><span class="p">(</span><span class="n">bank_mapping</span><span class="p">)</span>

<span class="c1"># Check results
</span><span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="se">\n</span><span class="s">Cleaned bank names:</span><span class="sh">"</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="n">df_students</span><span class="p">[</span><span class="sh">'</span><span class="s">Caregiver Bank Name</span><span class="sh">'</span><span class="p">].</span><span class="nf">value_counts</span><span class="p">())</span>

<span class="n">df_students</span><span class="p">.</span><span class="nf">to_excel</span><span class="p">(</span><span class="sh">"</span><span class="s">/Training /clean_new_Cohort/FILES/2ndCohort_enrollment_nodupes_bvn_actno_banks.xlsx</span><span class="sh">"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<p>theres alot that goes on in this stage depending on the data i have. but the more robust the form i built in Step 1, the less cleaning there is to be done.</p>

<h3 id="step-3-analysis--data-collection-metrics">Step 3: Analysis – Data Collection Metrics</h3>

<p>Before uploading, I generate per-school, per-lga summaries. These feed into the disbursement logic later down the line when its time for student verification. This i mostly use Tableu. But if the data isn’t so varied, excel graphs work just fine for an overview of what was collected. i tried metabase, but i found Tableu to be better, but i wanted metabase to be come my new go to. I don’t really plan on becoming a data engineer or analyst, but if i did i would probably explore <a href="http://scikit-learn.org/">scikit</a>.</p>

<h3 id="step-4-load--uploading-to-mysql">Step 4: Load – Uploading to MySQL</h3>

<p>Once the data is approved, it goes into the production MySQL database that powers the programme’s PHP dashboard. This can be a python script, or just a CSV upload on to the database itself. I found this method problematic because of the number of rows i have, so the script is mostly the way.</p>

<hr />

<h4 id="reflections">Reflections</h4>

<p>To be honest when i first read about ETL Pipelines, i thought everything was automated and i was doing something wrong, afterall, i am a software engineer that found herself doing data engineering. But turn out what i have is still considered ETL, just not the most advanced using a Spark cluster or Airflow DAG. I’ll be looking into how else i can make it more robust and automated. but the key i have realised is:</p>

<ul>
  <li>collect data that is as clean as possible</li>
  <li>minimise manual data entry, use drop-downs and selects where possible</li>
  <li>enumerators type anything in as a last resort</li>
</ul>

<p>Working at this intersection of technology and development taught me something that purely commercial projects don’t always surface: data quality is a human problem before it’s a technical one. The validation logic exists because a missing digit in a bank account number means a mother doesn’t receive money for her daughter’s education that term.</p>]]></content><author><name>Fatima Galaudu</name></author><category term="data-engineering" /><category term="python" /><category term="pandas" /><category term="kobo" /><category term="etl" /><category term="world-bank" /><summary type="html"><![CDATA[This is a process that took a couple of months to perfect, over 3 States in Nigeria, and numerous data collection exercises. I never meant to be a Data Engineer, but here i am, this is what my current work flow looks like!]]></summary></entry><entry><title type="html">Hello World</title><link href="https://portfolio.yedite.ch/2024/01/01/hello-world/" rel="alternate" type="text/html" title="Hello World" /><published>2024-01-01T00:00:00+00:00</published><updated>2024-01-01T00:00:00+00:00</updated><id>https://portfolio.yedite.ch/2024/01/01/hello-world</id><content type="html" xml:base="https://portfolio.yedite.ch/2024/01/01/hello-world/"><![CDATA[<p>Your post content here.</p>]]></content><author><name>Fatima Galaudu</name></author><category term="general" /><summary type="html"><![CDATA[Your post content here.]]></summary></entry></feed>