Data analysis plays a crucial role in detecting bad, misleading, and even fraudulent data. The data economy is full of such low-quality data, especially when dealing with location data.
Providing the highest-quality location data should be the goal of data providers, and so this is an issue close to home for those of us playing this role in the industry - an issue to be taken seriously.
Let’s consider some of the various sources of location data, and what data feed characteristics tend to indicate about quality.
Bidstream data is one of the lowest-quality sources of location data, collected from advertising bid requests. Every time a user opens an app on their smartphone and an advertisement pops up, it routes back data such as device ID, IP address, and more back to the ad servers.
The accuracy of bidstream data is weak at best. Given this fact, it’s no surprise that this source of location data is not exactly preferred by advertisers trying to run highly targeted campaigns. But it goes beyond advertisers, impacting any location data buyers who want to build location-based solutions.
Image of Bidstream Data in United States
Bidstream data, though available in large quantities compared to other sources, is an unreliable source of location data. Latitude and longitude values sourced from bidstream data is most likely to be derived from IP addresses. This manner of obtaining lat/long figures is notoriously inaccurate.
Some database providers claim to be able to convert IP addresses to lat/long coordinates to determine device location, but this is highly unlikely to within any useful accuracy.
In the analysis of location data sets, it’s not uncommon to identify large amounts of data pointing to the exact same coordinates. This can only mean one thing – extreme inaccuracy. The now infamous Kansas farm case illustrates this misleading phenomenon well.
As reported at the time, a Kansas family whose remote farm was visited “countless times” by police trying to find missing people, hackers, identity fraudsters and stolen cars because of a glitch from the digital mapping company. Multiple IP-to-location database companies have this issue.
This was the result of attempts to derive exact location data from weak sources such as IP addresses and bidstream data. Clearly poor at the best of times, it can be a disaster at the worst.
So what’s going on? According to a report in The Guardian, the problem stems from the fact that when location database providers can’t find lat/long coordinates for a device, they automatically point it to the middle of the U.S. – which just happens to be Kansas.
This and similar practices in other countries can result in tens or hundreds of thousands of people appearing to be located in the middle of a park, suburban area, or even on a lake!.
Image of Bidstream Data Comparison
Mobile GPS data
Which brings us to the question of higher quality sources of location data.
GPS is far higher quality than bidstream data and is of the most value to data buyers. The downside is that it’s available in less volume and is, unsurprisingly, more expensive. But the adage is true: you tend to get what you pay for.
As mentioned in the opening, ensuring supply of the highest quality location data is a priority for data providers across the industry. Performing consistent quality checks on data supply chains is a crucial step. Here are some red flags that location data providers can look out for:
- Lack of movement – this tends to be an indicator of low-quality location data, whereas high-quality data shows lots of movement.
- “Kansas farm” (and other similar phenomenon) – lots of people at the same coordinate, beyond what’s to be reasonably expected, is always a red flag.
- Teleportation – by this we mean the same device appearing in multiple countries or regions within the same 24-hour period.
Manual data visualisation helps to spot such instances of low-quality or fraudulent data and remove it before reaching data buyers. In addition to this, our data noise filtering technology plays a key part of our analysis in ensuring high-quality data for data buyers (read more about that here).
Image of SDK Data
Educating the industry on quality location data
Going forward, one of our key focuses remains educating data buyers and the wider industry on these problems. The reality is that many simply care about volume and metrics like daily active users (DAUs) over deeper quality considerations.
But deep diving into the nature being purchased will reveal that it may not be quite up to par with expectations. Some DAU data feeds may on the surface point to high quality, but in reality is full of low-quality data if examined beyond volume and devices.
Bidstream data fits this description – while it usually includes more DAUs, most devices are spotted in stationary positions which make it easy for publishers to sell the data for higher profits.
So how do we decide when data is fraudulent and when it’s simply misleading or low-quality? At this is the subject of the article, it seems important for us to attempt to make a distinction. Intent to mislead or trick seems to be the key consideration here.
Manipulation of the source data, for example, would more likely point to data fraud. Providers can do this by changing the time stamps by a few seconds in order to make it seem like a different data set. Alternatively, they may tweak the lat/long coordinates by a few decimal points.
Our analysis, however, always catches these instances of tampering.
The motivation? Almost always to maximise profits by selling location data known to be low quality. But the industry’s demand for volume over quality has also made it easier for data fraudsters to go undetected and get away with their tricks.
Find out more how you can evaluate the data you are investing in with five simple metrics here.