Businesses that want to leverage location data must procure high-quality datasets because erroneous data will result in false insights, and therefore, poor decision making. However, not all market participants are transparent about their data practices. In this article, we share background information about how the team at Quadrant analyses the quality of location data we provide our buyers – some of the steps we take to ensure it is of the highest quality possible for their particular use cases.
It all starts with the source!
It starts with us only sourcing location data from SDKs or Server-to-Server integrations, because IP address, bidstream, and cell tower triangulation data are not nearly as accurate. Peeling back the layers of location data to assess its overall quality means we are always looking at a variety of key data metrics and attributes.
Let’s take a closer look at some of these metrics and attributes below:
- DAU/MAU Ratio
- Data Completeness
- Horizontal Accuracy
- Days Seen Per Month
- Hours Seen Per Day
- Overlapping Data
1: DAU/MAU Ratio
One of the baseline metrics we look at when analysing location data for quality is the Daily Active Users (DAU) and Monthly Active Users (MAU) ratio. In a nutshell, this helps us approximate how consistent a panel (group of mobile devices) is over the course of a month. The higher the number, the better.
As this is a high-level metric, publishers are usually able to provide these numbers immediately to help us get a general idea of the dataset.
Some data buyers may prioritise feeds with higher DAU/MAU ratios, but for us at Quadrant we source data from a variety of SDKs and publishers, which means we sometimes use data with lower DAU/MAU ratios if it complements our existing dataset.
2: Data Completeness
The amount of data captured is dependent on a number of factors including device hardware, SDK collection methodology, user opt-in permission, etc. As such, one common issue seen with location data is incomplete or missing data fields.
At Quadrant, we developed a metric known as “Data Completeness” (the percentage of each data attribute that contains verifiable data). This allows data buyers to quickly and easily assess the percentage of missing data points in each attribute.
As an example, latitude, longitude, timestamp, and horizontal accuracy are core attributes of location data. Without these fields, the data is essentially useless from a geospatial practitioner point of view.
When onboarding data, we would want to understand how much of the data in those fields is absent or missing; too much missing data and we would just be wasting time and resources.
At Quadrant, we always aim for datasets with 100 per cent completeness for the core fields. Other attributes, such as country code, operating system, or user agent tend to be given a bit more leeway.
It is worth pointing out that we are able to filter our data feeds for our buyers, such as in cases where they want missing fields removed and only those fields with 100 per cent completeness included.
3: Horizontal Accuracy
Another key metric we always consider is Horizontal Accuracy (HA). Horizontal Accuracy of 10 meters and below is generally considered very good (for GPS data). In fact, we tend to reject data sources with high HA because they are not suitable for analytical purposes. It’s worth noting that HA can vary based on a user’s environment and weather conditions. For example, in certain built-up areas of if there is bad weather, readings can be less accurate. Contrastingly, clear skies and open line-of-sight to satellites will likely result in better HA.
4: Days Seen Per Month
Days seen per month is a metric that gets even more granular than DAU/MAU. It enables us to see the distribution of devices over a certain period of time, we start by evaluating the number of days over the course of the month..
At Quadrant, we are always on the lookout for quality location data feeds where devices are seen over a threshold of a certain number of days per month (with the higher being the better).
However, as with other metrics, in some cases we will accept data with a lower number of days seen per month if it complements our existing dataset. This is particularly true if it helps fill in missing information on a user’s journey, such as between two locations.
5: Hours Seen Per Day
The number of Hours Seen Per Day, like days seen per month, for most use cases is usually more valuable when the number is higher.
This should be obvious, because it means we are recording a more complete picture of a user’s daily activity in terms of where they are located on an hour-by-hour basis.
6: Overlapping Data
Most vendors consolidate data from multiple sources to generate their datasets. As a result, suppliers often provide datasets that have a substantial amount of overlapping records. Usually, vendors do not remove such data because retaining it makes their feeds more voluminous. This is a problem because clients require unique values to execute analyses that result in meaningful insights.
Our data science team has created a robust overlap analysis model that gauges the degree of commonality or uniqueness across multiple supplier feeds.This allows us to determine whether to retain or remove a particular feed based on its contribution to the overall data pool. Our selectivity around supplier feeds and understanding of a client's use case allows us to deliver customised, cost-effective datasets that contain minimal overlapping data.
We have been working on improving our Data Noise Algorithm to improve efficiency and analytical performance and reduce latency. Over the years have made some developments that promise significant benefits. The Data Noise Algorithm weeds out events that occurred 7 days prior to when the data was received. By filtering these outdated events we ensure that that the data we deliver to our customers is recent and relevant. Reducing latency also reduces file sizes - making data delivery more efficient.
The Data Noise Algorithm is also utilised to perform latency analysis by our data engineering team. Our goal is to gather and deliver data on the most recent events. A latency analysis is performed immediately after a supplier’s file is delivered into our systems. This helps us determine which feeds are adding the most value. Using the Data Noise Algorithm, we are able to determine the speed at which a supplier delivers data and assess quality based on how much data is filtered out. By consistently delivering low latency, high quality data, we significantly improve the value of data for buyers.
In addition to leveraging the aforementioned methods and metrics to evaluate location data, we also execute several other processes to make sure our datasets are of superior quality.
To further minimise the presence of duplicate data in our final feed, we utilise a unique identifier called Quad ID (which we generate via an encryption key and hashing device IDs and device IDFVs). Our deduplication algorithm analyses datasets and isolates multiple events that have four attributes (Quad ID, latitude, longitude, and timestamp) in common. The algorithm then retains only one copy of the event and deletes all the other records.
At Quadrant, we pride ourselves on providing our partners the highest quality location data, based on their specific data needs. Although this blogpost does not delve into all our in-house data evaluation procedures, we hope it provides valuable insights on how to assess the location data feeds you are investing in.
Access high quality mobile location datasets for your next BI project
Fill out the form and one of our data consultants will get in touch with you!