Refining Raw Turn Data in BigQuery

Nathan Begbie Updated by Nathan Begbie

While Turn.io does offer an internal insights dashboard as well as built-in metrics for things like Reminders, we know that there will always be edge cases or specific data queries that customers have. In order to meet this need, customers can export their data to BigQuery. This enables them to be able to run performant queries against all of their data, as well as syncing this data to powerful dash-boarding tools like Looker Studio (previously called Data Studio).

However, it can sometimes be overwhelming working with raw data, straight from the database. There are also some gotchas that many of our clients face when dealing with BigQuery data. The main one we’ll deal with in this article is how to deal with outdated data.

When we stream data to BigQuery, we don’t overwrite existing data, we simply append new data to the table.

This means that there are multiple rows on a database that could represent the same contact. So doing something like:

SELECT count(id)
FROM contacts

Will give you an inflated count of contacts. You can confirm this by running the following:

SELECT id, count(*) as count
FROM contacts
GROUP BY id
ORDER BY count DESC
LIMIT 10

So in order to get a table of only the latest contact data with the following query:

WITH
ranked_contacts as (
SELECT
*,
-- this allows us to deal with duplicate rows
ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) as rank
FROM contacts
)
SELECT *
FROM ranked_contacts
WHERE rank = 1

  

For messages, you would take a similar approach:

WITH
ranked_messages as (
SELECT
*,
-- this allows us to deal with duplicate rows
ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) as rank
FROM messages
)
SELECT *
FROM ranked_messages
WHERE rank = 1

Another way that we might want to use this strategy is when dealing with the message status table. In this case, we only have duplicate statuses per message in very rare circumstances. Instead, what we want to do is get the last status that was given to a message. We can do so by using the same PARTITION BY __ ORDER BY __ pattern:

WITH ranked_statuses AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY inserted_at DESC) as rank
FROM statuses)
SELECT *
FROM ranked_statuses
WHERE rank = 1;

Was this article helpful?

How to top up your 'available amount' for conversations?

Example queries for your data

Contact