GCP: Aurai’s approach on a data platform on GCP

As data engineers, we most often encounter data platforms built on Azure or Amazon Web Services. However, we’re convinced that Google Cloud Platform (GCP) is a strong and underrated alternative. With its focus on data and AI, GCP provides a solid foundation for modern data platforms.

In this post, we’ll share our perspective on designing and implementing a data platform in GCP. We’ll outline our core architectural principles, explain the rationale behind our approach, and walk through how we would build a scalable and maintainable solution using GCP’s native services.

Key stages of the data platform and materialization

Before diving into implementation, we start by defining the key stages of the platform. The key stages form the fundament for consistent and reliable data flows, focusing on traceability and scalability. By defining these stages early on, we create a foundation for maintainability, and ensure the platform remains adaptable and future-proof as requirements may evolve. 

  • Landing Zone

The first state of our data platform focuses on ingestion and bringing in data from various sources. Once ingested, the data is stored in what we refer to as the Landing Zone. This layer acts as the raw data archive, where data is stored in its original format and schema, exactly as received from the source systems. By keeping the data untouched at this stage, we preserve its integrity and enable full traceability. The Landing Zone serves as the first foundational layer of the platform, supporting auditability, debugging, and reprocessing without data loss.

  • Raw

From the landing zone, data is loaded into the Raw Layer. At this stage, we standardize the data by renaming columns and setting appropriate data types. The goal is not to alter the core structure of the data, but to create a clean and consistent version that remains as close as possible to the original source. This preparation ensures the data is ready for more advanced transformations and analysis downstream. The Raw Layer should be the bridge between initial ingestion and the next stage, where we start modelling the data for analytical purposes.

  • Data warehouse

In this layer, the data is structured according to a predefined data model. The focus should be on storing data efficiently. Meaning that no information is lost, while avoiding unnecessary duplication. This layer should be optimized for querying and support ad hoc data exploration. Data analysts should have easy access to this layer and need to be able to rely on accurate and structured data for their projects. 

  • Data mart

In the data mart layer the focus lies on serving the data for specific use cases and end users. The data models should be designed for immediate consumption by dashboards or reports, with performance in mind. This could be achieved for example by implementing some specific calculations that could serve a dashboard. The key end users of this layer are data analysts that build dashboards, or other end users who don’t want to be dealing with raw or complex data sets. 

The data mart stage serves as your showcase for actionable insights. Picture by Muneeb Syed on Unsplash.

Tools

Ingestion 

The goal of ingestion is to get data from suppliers and store it on our own platform. Since this oftentimes entails a repetitive task, we want to use a tool that is reliable and efficient. Google cloud platform has some native solutions for this part of the platform, such as Dataform or BigQuery DTS that mainly focus on obtaining online data. 

However, in a specific case where we mainly worked with GCP, we went with Fivetran. The specific client was already using it, and we didn’t have any objections that prevented us from sticking with this tool. Connecting with sources such as Meta, Bing or Google Ads took less than a couple clicks and five minutes, which is significantly faster than figuring out how the specific API works, figuring out how to query the data from these sources and building custom Python scripts.   

While Fivetran is very time efficient and easy to use during implementation, its limited flexibility and customization options may make it a less ideal solution for every data platform. Since Fivetran can be more expensive in the long run compared to other ingestion methods, it’s worth evaluating whether the investment is justified and if you might get better value for the bucks by allocating these resources elsewhere on the platform.

 

Storage

After loading the data, the next step is to store it reliably and efficiently. In the GCP environment, Google Cloud Storage (GCS) is the preferred option for this purpose. GCS serves as the landing zone, according to our predefined key stages, where raw or ingested data is initially stored before further processing. Additionally, storing data in GCS allows for seamless integration with other GCP services, such as BigQuery and Dataflow, enabling efficient downstream processing and analytics.

Besides storing your data efficiently and reliable – make sure you store it neatly and organized. Picture by Jesse Orrico on Unsplash. 

Data warehouse

After data ingestion, we want to load our data from the landing zone in GCP into the next layers. In the GCP environment, the data warehouse solution would be BigQuery. A common best practice is to organize your BigQuery environment by creating separate projects or datasets for each key stage of the data platform. For example:

  • In the landing zone and raw layer, it’s advisable to maintain distinct datasets for each data source to preserve data provenance and simplify traceability.
  • Within the data warehouse layer, datasets should be organized according to business domains or models, reflecting the logical structure of your organization’s data.
  • Finally, the data mart layer consolidates these business-specific datasets into curated, consumption-ready tables that enable insightful analytics and reporting.

Overall, BigQuery is a user-friendly data warehouse. Its interface is clean and well-organized, just as we’ve come to expect from Google. In terms of performance, it was sufficient and, in my experience, behaved similarly to other tools such as Snowflake and Databricks.

That said, the specific way you structure your BigQuery environment should be tailored to your organization’s needs. For example, switching between separate projects, and therefore between the key stages, can be rather cumbersome. This is especially inefficient and confusing during development phases. 

Another important consideration is the selection of dataset locations. BigQuery datasets reside in specific geographic regions, and cross-region queries are not supported. Since dataset locations cannot be changed after creation, those can lead to issues later on. While this is not a major drawback, it’s something to be mindful of when setting up your datasets.

For smaller data platforms, I would recommend using separate datasets for each stage within a single project. Separate projects can then be created for development, acceptance (if needed), and production environments.

Transformation & Modelling

For the ETL part of the Google native platform, the solution is Dataform. Dataform was acquired by Google in 2020 and now serves as Google’s answer to dbt. The IDE is directly accessible through the BigQuery interface and is intuitive in use, making it a suitable solution for users with less technical backgrounds than a typical data engineer.

The configuration and structure of models in Dataform are very similar to those in dbt. For me, having prior experience with dbt Cloud, dataform was easy to pick up right away. Features like data lineage are also supported, which is quite insightful. Additionally, Dataform integrates with GitHub repos for version control and collaboration. For someone like me, someone who is very convinced of what dbt has to offer, Dataform felt like home, but with a few quirks.

One key drawback was the difficulty in working on multiple features simultaneously. The suggested environment is set up so that each developer has their own personal workspace, with a workspace correlating to a branch of the linked Git repository. It isn’t possible to work on different features simultaneously in a personal workspace, since my personal workspace corresponds to one single branch. Switching to another branch for a quick fix or new feature means I have to either commit or discard my current work. Very unproductive. Another limitation is that Dataform is only available via the web IDE. There’s no support for working in a local development environment like VS Code, which I personally prefer. And the biggest downside of all: no dark mode.

In our proposed approach, orchestration can be set up in two ways. Fivetran can handle and orchestrate the ingestion process, while the rest of the ETL workflow can be managed using Dataform, which orchestrates data transformations within BigQuery.

Data serving

As discussed, the data mart is primarily designed for serving data to end users. Within GCP, there are several tools well-suited for this purpose. For dashboarding and visualization, Looker is an excellent choice, offering seamless integration with BigQuery and making it an ideal solution for those looking to stay within the Google ecosystem. However, if you prefer a tool outside of the Google ecosystem, Tableau is also a strong alternative based on our experience, providing a stable integration with BigQuery as well.

It’s worth noting, in the case of data serving, that data security and governance can be effectively managed using IAM (Identity and Access Management) roles in GCP. Permissions using IAM roles can be defined granularly at the table, view, dataset, or even project level, providing fine-grained control over data access. IAM roles can also be connected to custom-defined tags and labels, allowing for flexible and scalable data governance policies that can adapt to complex organizational security requirements across any layer of the data architecture.

Beyond traditional dashboarding, GCP also offers AI/ML solutions such as Vertex AI for machine learning model development and deployment, AutoML for automated model training, and BigQuery ML for running machine learning models directly within BigQuery using SQL. 

photo by path digital on Unsplash

Materialization strategies

As discussed, four key stages were applied in the data warehouse: landing zone, raw, data warehouse and data mart. Since storing duplicate data needs to be avoided, it is not preferred to materialize the data in the same way across the layers.

In the landing zone, the data needs to be stored exactly as received from the source. Light transformations will be applied in the raw layer, such as setting the correct data types. Since these transformations do not require heavy computation, views are preferred over tables. That said, views aren’t always cheaper than tables, so it’s worth checking what makes the most sense for your specific setup.

In the data warehouse layer the data is modelled with analytical purposes in mind. To avoid applying the same transformations to the same data every day, incremental loads will be the best materialization method. This allows applying the required transformations while keeping processing costs under control.

For the data mart, tables are the most suitable method. Since in some cases the mart layer serves only dashboards, we can expect that some queries are executed repeatedly. Materializing the data as tables improves performance and reduces compute overhead. But, if you want to optimize performance even further, consider using materialized views.

Logging & alerting

When implementing a data platform with BigQuery, we recommend establishing a robust monitoring strategy that goes beyond basic pipeline failure alerts. While simple email notifications for failed jobs are a good start, they often don’t provide the full visibility needed to ensure data reliability and business continuity.

A key step is to implement automated data quality tests within your transformation workflows. For example, using tools like Dataform, you can configure tests such as uniqueness checks on primary keys or non-null constraints on mandatory columns. These tests should be applied in the data warehouse layer to verify the integrity and consistency of your data.

To maintain data reliability downstream, it’s best practice to define dependencies between layers. For instance, if a data quality test fails, the data mart layer should not be refreshed in that run. This approach prevents dashboards from displaying incomplete or incorrect data, while still allowing them to serve data, even if it’s slightly outdated.

In addition to standard data quality checks, we advise implementing assertions to monitor whether your data aligns with expected business logic. Assertions are a feature in data form, and are queries designed to return zero results under normal conditions. Any returned rows indicate a potential data issue such as ingestion failures or maybe even a business anomaly.

These assertions can then be integrated with GCP’s Log Explorer, which enables automated alerting via email when an assertion fails. This proactive monitoring helps to identify problems on a short notice. Furthermore, Log Explorer’s alert management capabilities allow temporary snoozing of alerts, reducing noise when issues are acknowledged but not immediately resolved.

By combining these monitoring practices such as data quality tests and dependencies, assertions, and proactive alerting with Log Explorer, you create a  monitoring framework that supports fast troubleshooting and reduces the mean time to resolution (MTTR).

Picture by Mimi Thian on Unsplash

Final Thoughts

Overall, Google Cloud Platform offers a comprehensive set of solutions to build a solid, reliable, and scalable data platform. It provides all the core capabilities found in other leading platforms. One area with significant room for improvement is the integration with GitHub repositories. Currently, Dataform workspaces do not support simultaneous development of multiple features or fixes, which can lead to inefficiencies in both development and maintenance. Additionally, enhanced support for local development with Dataform like Visual Studio Code would significantly improve the engineering experience. But at Aurai, we prefer the sandwich method when giving feedback, so to end on a positive note: 8/10 would recommend.