Using dataform to improve data quality in BigQuery

There is always one thing that everyone will tell you if you start the path of a data driven project: You will spend most of your time dealing with data quality issues. This is especially true when dealing with data collected from legacy or ad hook systems. You will suffer considerable data quality pains, since more often than not, there are unclear or even undefined rules on data input from the source systems. Time spent on creating and discovering the data cleansing rules you need to take to mitigate these quality issues will be a considerable time-sink to any project of a significant size, and involve large parts of your organisation. This is one of the main drivers for the data-mesh paradigm shift, where you think data-as-a-product and force the data sources to give good data quality in the first place, which is, you know, about time and a really good thing! This post, however, is not about implementing a new architecture to fix future problems. Reality is now, and data quality is still something you have to deal with.

The trend for the last years has been ELT over ETL, where transformations happens at the database level — its faster, cheaper and requires less infrastructure than taking data out of the database just to put it back in. One of the products recently acquired by GCP is dataform — a tool to orchestrate your transformations — or SQL scripts if you like. Views and tables are abstracted away into entities, and dataform will help you execute everything in the correct order, and materialize as views, tables or concepts. This is by no means new: Dbt has been doing this for years! Considering that dataform will be integrated to BigQuery, I will choose that product to give you an idea on what you can do on that platform.

Dataform also comes with a general purpose programming language: javascript in the form of nodejs. This means that you programatically can do (almost) anything to create your sql scripts. It is, for example, easy to create a view or historical table for every table from your source system, which we will look into shortly.

Dealing with character input

Character input is probably any data scientists nightmare, since you are basically allowed to input whatever. As an example, let us take a look at a database dealing with phone numbers. In Norway there are at least three major ways to write a phone number: “123 45 678”, “12 34 56 78”, “12345678” and country code can be “+47” or “0047”. And a million other ways to write it.

When dealing with character input, one of the first things you should do is get an overview of the different formats your input has. By getting the facts on the table you can reasonably tell how many rules you need to create, and most likely identify data formats that is beyond rule-repair. It will also allow you to identify outliers pretty quickly and you are going to be surprised of some of the input!

if we substitute every number with the letter “d”, we would get “ddd dd ddd” and “dd dd dd dd” (and probably a few more). Then we can group by format and count to figure out the frequencies of the variations, and we would know exactly what cleansing rules to implement to mitigate issues, and also find issues we cannot mitigate. NULLS for example, would be hard — or phone numbers with textual data as “yes”, “i have”, “not telling” would be impossible to translate to numbers. You could replace text with “w”, for example to create a category for these.

a SQL to do this could look like this:

REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(PHONE, r”[a-zA-Z]+”, “w”), r”d{5,}”, “d+”), r”d{4}”, “dddd”), r”d{3}”, “ddd”), r”d{2}”, “dd”), r”d{1}”, “d”) as PHONE_profile,

If you have more elegant solutions to achieve the same thing please let me know in the comments!

Dataform to the rescue

Remember that dataform can be custom programmed to help you create the SQL you need.

A utility script that takes any table and applies analytical insight on character columns could for example look like this:

function profile_table_columns(source, targetSchema, targetTable, columns) {
   var profile_col_list = “”;
   var first = true;
   
   for (const i in columns) {
      var c = columns[i]
      if (!first) {
         profile_col_list += “,”
      } else {
         first = false
      }
      profile_col_list += ‘REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(‘ + c + ‘, r”[a-zA-Z]+”, “w”), r”\d{5,}”, “d+”), r”\d{4}”, “dddd”), r”\d{3}”, “ddd”), r”\d{2}”, “dd”), r”\d{1}”, “d”) as ‘ + c + ‘_PROFILE‘
   }
   publish({name: targetTable, schema: targetSchema}).query(ctx => `SELECT *, ${profile_col_list} FROM ${ctx.ref(source)}`).type(“view”);
}

The function takes source table as input, and has a target table where to put it. It also takes a list of columns to generate profiling code for.

For each table where you would want to generate a profiling view, it would look something like this:

data_profile.profile_table_columns(
   {schema: 'example_schema', name: ‘PHONE_TABLE’},
   “table_pofiles”,
   “PHONE_PROFILE”,
   [‘PHONE’, 'COUNTRY_CODE', ...]
)

This would create a view in the table_profiles.phone_profile location with this content:

select *, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(PHONE, r"[a-zA-Z]+", "w"), r"d{5,}", "d+"), r"d{4}", "dddd"), r"d{3}", "ddd"), r"d{2}", "dd"), r"d{1}", "d") as PHONE_PROFILE FROM example_schema.PHONE_TABLE

Understanding the data

This SQL is basically selecting everything from the source table, but also translating the columns containing the PHONE into a format rather than values. Most BI tools can then arrange counts, and visualize the values and formats using filters. It would also be a small step to create a table with history using this view, to track changes over time, so you can see data quality trends. A result could look like this:

format       count
ddd dd ddd     700 (looks well formatted)
dd dd dd dd    500 (looks well formatted)
dddddddd      1100 (looks well formatted)
w+              26 (text?)
ddw dd ddd       1 (a misspelled 0 as o)
dddd             1 (possibly not a number)
+dd dddddddd     4 (includes country code?)
etc.

Wrapping up

With that knowledge you can start to mitigate data quality issues: Create cleaning rules, setup alerts and monitoring, and establish feedback loops to the source systems so they can fix issues you cannot. More importantly, you will be able to have a decent estimate of time and effort you will have to put into this!

There are some limitations and design decisions in dataform you need to be aware of. Unlike dbt, dataform is totally sandboxed, so you cannot generate SQL using the database. Meaning, you cannot query INFORMATION_SCHEMA, for example, to generate queries.

If I ever get the time I will publish a data-quality package to dataform so these utility functions will be available — they are quite handly! But for now, I do hope it sparks some inspiration on how you can automate data quality monitoring, which in my mind is a vital part of dataops!


Using dataform to improve data quality in BigQuery was originally published in Grensesnittet on Medium, where people are continuing the conversation by highlighting and responding to this story.