Better Data Engineering, Fewer Surprises

, By Óttar Guðmundsson
Better Data Engineering, Fewer Surprises

Bad data has a way of slipping through unnoticed until it shows up in reports. This was a known challenge for Samband Íslenskra Sveitarfélaga as data from external sources wasn’t always reliable. To address this, Sambandið decided to take control and become the trusted source of data for Iceland’s municipalities.

When Gangverk partnered with Sambandið in 2025 to rethink their data infrastructure, the goal was simple: make data quality everyone’s responsibility. That meant catching issues earlier and making them visible right away before they end up in reports.

A Growing System Under Pressure

Sambandið IT division had been working on a data lake for the past years to collect various financial data from external sources such as salary data, financial plans, and quarterly results of the municipalities in Iceland. The original solution served its purpose but as the number of data sources grew, so did the requirements as existing tools were not designed for governance and daily operational needs. Their pain points included data validation that happened late in the process, re-running data processing and tracing data origins was limited. Furthermore, the data lacked standardization which required frequent manual adjustments in the code to make it usable.

This pushed data quality issues downstream, resulting in manual review, delays, duplicated effort, and a constant data ping-pong between consumers and producers. A small number of individuals became a bottleneck for validating the incoming data, and with multiple municipalities submitting growing numbers of data sources, the centralized data lake approach became difficult to scale.

Placeholder: shift-left ingestion versus late downstream debugging
In data engineering, upstream refers to data ingestion, while downstream refers to transformation and analytics. The red curve shows effort spent fixing issues late in the pipeline, while the green curve shows effort shifted upstream to catch issues early.

Fixing Quality Early

Our suggestion was to move the responsibility upstream closer to the data producers to distribute validation and quality control at the ingestion. The “shift-left” approach focuses on resolving data quality issues at the source before it enters the system. To achieve this, we introduced a standardized API and data workflows for submitted data that

  • Standardizes how data is submitted across sources

  • Provides a well documented API instructions for producers with code examples and testing environments

  • Gives data producers clear and immediate feedback on issues

  • Ensures only high quality data moves forward in the process

  • Post-processes to detect anomalies based on historical and peer comparisons to provide non-blocking warnings

This allowed data producers to become more self-sufficient with the ability to validate, troubleshoot and review the data that we flag for potential issues.

We reduced the feedback from days to seconds by catching problems early. Data producers received feedback on submission which they could act on directly and simply resend after fixing it on their end. Note that the data producers were not trying to send us bad data, they lacked the knowledge that their data was incorrect.

So instead of continuing to apply manual fixes within the system we created the feedback loop that encourages communication between systems and teams. This both improves data quality where data is created and builds a shared understanding. That way we can further support our argument by stating that data quality is everyone’s responsibility. There is still (and always will be) an operational burden from the IT department but we can reduce it by establishing a scalable collaborative data workflow.

This created a strong and trusted data foundation that improves current operations and enables entirely new ways of interacting with data

Exploring the Next Frontier of Data Access

With trusted and well-governed data in place, we have also begun exploring how this can enable the next generation of data access. We have been experimenting with natural language interfaces powered by AI agents that allow users to ask complex questions in plain Icelandic language and translate them into precise analytical queries.

The key enabler of this is our high confidence in the data. The building blocks of the project are reliable as data quality, structure, and governance are in place. Rather than exposing raw data to a model, the agent operates with an understanding of the data structure and context, allowing it to generate optimized queries on request. This guarantees that the model never sees the data itself as it only formulates the query against the data lake.

This is a more intuitive way of working with data, where business departments and executives can interact with complex datasets using natural language while ensuring privacy and cost efficiency. It is an early example of how better data engineering can improve pipelines and how it creates new ways for non-tech people to engage with trusted information within their company.

We are only beginning to explore this exciting frontier. In the not-so distant future we believe that interacting with data becomes as natural as asking a question given that the correct foundations are in place.

Placeholder: natural language data access demo
Plain question in Icelandic automatically translated into a structured data query by an AI agent. Data is never exposed towards the agent.

AWS as a Foundation

The real enabler for this shift was the underlying infrastructure. We rebuilt Sambandið’s data platform on AWS, combined with the data orchestration tool (Prefect) to manage data workflows.

All data pipelines, storage, and monitoring now live within a single secure managed environment. Every step in the data flow (upstream to downstream) is traceable to make it easier to understand what is happening and where issues occur. Issues can be investigated or reprocessed without unnecessary effort which makes it far easier to operate the system in practice.

The platform was built using modern software engineering practices. A unified codebase, automated deployment pipelines, and infrastructure defined as code ensure that changes are consistent and safe to deploy across evolving environments.

Note that the system also includes built-in communication mechanisms so that Sambandið can reach out to stakeholders or their representatives in order to let them know that data delivery is due, if their analytics find anything suspicious or if the data was ingested correctly as expected.

This made it possible to build and scale the platform while maintaining a high level of reliability and oversight. Most importantly, the system is now maintainable as it supports changes which are expected as a part of normal operations.

Shorter Loops and Stronger Trust

This shift moved data quality from a centralized bottleneck to a shared responsibility. The impact of this work was just as much cultural change as it was technical. It changed how data producers in siloed institutions could collaborate and resolve issues at the source, reducing delays and manual effort. Analysts spend less time cleaning data and more time using it within their trustworthy data ecosystem. In practice, this led to measurable improvements: automation increased by 85%, manual review decreased by 75%, and overall data quality improved by more than 30%.

We’re not done. We’re expanding this shift-left approach across more datasets and collaborating with producers to standardize how they deliver data into the system. Good data unlocks better decisions and transparency. It’s not just about technology. It’s about methodology. It’s about data engineering.