May 27, 2021 Suman Joshi

Common Problems With Location Data and How to Fix Them

Geospatial data have the potential to uncover valuable insights about the physical world. It can be used by governments to save lives and influence public welfare. Businesses can use it to drive consumer acquisition by capturing the attention of the right people at the right time. Anonymized location data has a lot of value to offer without breaching people’s privacy.

However, there are several concerns surrounding the reliability and authenticity of location data. Many location data providers fixate on accumulating large volumes of data to entice and win buyers and quality is often lacking. The result? Despite sophisticated analysis and resources, companies end up with mediocre results at best. Even the most advanced algorithms cannot differentiate between good and bad data.

The sources of location data, the vendors you choose to work with, and their internal business practices all impact the quality of data, and ultimately on your business decisions and ROI. Research by Gartner shows that poor data quality costs a business about $15 million per year on average.

In this blog post, we discuss some challenges facing the location data economy and how buyer awareness can help combat these.

Problems With Bidstream Data ➡️ Volume ≠ Quality

Often, buyers only care about volume and metrics like daily active users (DAUs) over deeper quality considerations. In reality, most high-DAU data feeds are full of low-quality records.

Bidstream data is a good example: impressive volume, but questionable quality. Bidstream data is collected from bid requests in programmatic advertising. Every time someone uses an app on their phone and an advertisement pops up, it routes back data such as device ID and IP address to the ad servers. In this case, the location data is derived from cached information stored within the phone’s location settings or by converting IP addresses to default coordinates (lat/long). These methods return highly inaccurate location data.

Bidstream, even though offering high volumes, is considered one of the lowest-quality sources of location data. So, despite a weak reputation and known flaws, why do businesses still use Bidstream? The answer: they are cheap! But you get what you pay for.

The Solution: Identify Bidstream Data and Seek Mobile Location Data

There are a few easy identifiers of Bidstream data. Check if your vendor is sharing specific app names. App names are typically only provided if the data is being gathered via Bidstream. Apps in the Religious and Lifestyle & Health categories are considered sensitive and personal, and disclosure of such information is in violation of developer guidelines and data privacy laws, both for the sellers and buyers. The presence of app names in a dataset is a dead giveaway that the data has been procured from real-time bidding (RTB), where location data is a ‘by-product’).

Also, check if your data supplier is providing GPS data. Dataset derived from the Bidstream via reverse IP lookup will not have GPS signals. Being aware of these details is valuable when evaluating location data sources.

Better-informed businesses typically use mobile location data or GPS data. In this case, an SDK collects the user's real-world GPS location from the apps on the user's phone. This type of location data requires a direct integration of the SDK into individual applications, so the volumes will be significantly lower compared to Bidstream. SDKs collect high-precision GPS location data promising more accuracy and reliability. Ask your vendor about their data sources! Conducting proper knowledge transfer and evaluation of data before making a purchase decision can reveal the most glaring issues upfront, saving you valuable time and business resources.

However, improper handling of SDK-derived location data can also result in the degradation of quality. Issues like high data preparation costs, duplicates, and fraudulent records are common with mobile location data.

Cleaning and Normalizing Location Data is Complex and Costly

Raw Mobile Location Data is often riddled with issues such as missing fields, irregular timestamps, wrongly formatted values, etc. In a location dataset, attributes such as device ID, latitude, longitude, horizontal accuracy, and IP address are all important, and if any of these fields contain invalid or unrealistic data, the entire row becomes worthless.

Processing high volumes of data on a regular basis can be tedious, time-consuming, and require powerful processing infrastructure. Additional data preparation is not only expensive, most businesses do not have in-house infrastructure and resources for it, and prefer plug-and-play data for their analytical platforms and apps. Moreover, excessively filtering location data can reduce the number of records to a degree where it is impossible to derive insights from them.

The Solution: Assess Vendors’ In-House Quality Assurance

Without proper quality assessment, you are likely to be burdened with preparing the raw data for analysis yourself, which can be time-consuming and expensive. Most capable data practitioners have in-house teams and processes to clean, normalise and prepare data before transferring it to customers.

At Quadrant, we have built a proprietary Data Noise Algorithm to cleanse data sets and offer ready-to-use mobile location data that requires little to no preparation. Those who understand the location data industry know which sources are reliable and have experience identifying and correcting flaws, rather than drastically reducing data counts. Buyers are advised to ask vendors about their data normalisation process and test sample data for missing and incorrect values.

Duplicate Data is More Common Than You Think

The number of SDKs publishers install into their apps is constrained by battery use and device performance. Therefore, the number of original data sources is limited and one original record is often purchased by several data traders. While this is not done maliciously, buyers are at risk of purchasing the exact same data or a significant amount of overlapping data from two or more sellers.

An easy method to identify whether two sellers are providing the same data is by simply comparing their data and looking for any duplicates or overlaps. However, bad actors know this and will often modify their data in an attempt to deny the presence of duplicate values.

The Solution: De-Duplicate Data to Reveal Unique Data Counts

When it comes to location data, less can really be more. Duplicate records make the data seem voluminous, but quantity does not mean quality. Combining data from multiple sources requires normalisation and deduplication for consistency. To avoid paying more based on data counts, buyers must evaluate sample data to scan and weed out duplicate values.

Quadrant’s mobile location data is processed through a deduplicating algorithm focusing on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm checks if any of the rows contain the exact same combination of these four attributes and keeps only one copy, eliminating duplicate values and ensuring that customers only pay for complete, unique datasets. We also perform exhaustive overlap analysis to efficiently eliminate duplicate data coming in from suppliers.

Be Aware of Fake or Fraudulent Data

The demand for volume over quality has led to the prevalence of fraudsters, and the lack of due diligence helps them continue with their malpractice, undetected. Spotting fraudulent data can be difficult. For example, data providers can change the timestamps by a few seconds, manipulating data to look more voluminous or diverse. They might also tweak the coordinates by a few decimal points. The motivation? To maximise profits by selling high volumes of location data. Since these data are bound to be low quality and erroneous, buyers can end up with meaningless, or worse, wrong insights and eventually unprofitable business decisions.

The Solution – Perform Basic Visual Analysis on Sample Data

Data buyers should avoid focusing exclusively on volume – no amount of data or algorithms can compensate for inherent inaccuracies. Without a detailed analysis (which is time-consuming and resource-intensive), it might be impossible to differentiate between legitimate and tampered location data. To detect fraudulent data in the evaluation phase plot sample data on a map and track for sudden teleportation (the same device travels unrealistic destinations in a short period of time), lack of movement, too many devices in an unexpected location (for eg: the middle of the ocean), etc. A large number of such anomalies is a good indicator of fraudulent data.

Watch Out for Latency

The speed and frequency of data delivery translate into faster analysis and therefore faster decisions and actions. Especially in the ever-evolving location data landscape, where even a slight movement of devices can mean entirely different insights and results. Most data buyers request low latency data. However, a substantial percentage of supplier data is weeks and sometimes even months old. This can negatively impact analysis for businesses and the insights delivered are subpar and outdated.

The Solution – Determine Adequate Latency for Your Use Case

The impact of latency can vary based on your specific use case and business need. Determine if/how a few days of latency will impact your decision-making. For example: solving traffic issues, preventing crime, and monitoring social distancing will need the most recent data. But use cases like measuring OOH advertisement ROI, finding travel patterns, or performing research on audience behaviors can be benefitted even if the data is a little older. In fact, in some cases, historic data is required and requested to perform meaningful analysis. Once you have performed an internal assessment, you will be able to determine if a particular vendor's data suits your needs.

Quadrant’s Data Noise Algorithm weeds out events that occurred 7 days prior to the data were received based on customer requests (unless historic data is requested). By filtering these outdated events we ensure that that the data we deliver to our customers is recent and relevant. Reducing latency also reduces file sizes making data delivery more efficient.

Common Problems With Location Data and How to Fix Them

Share

Problems With Bidstream Data ➡️ Volume ≠ Quality

The Solution: Identify Bidstream Data and Seek Mobile Location Data

Cleaning and Normalizing Location Data is Complex and Costly

The Solution: Assess Vendors’ In-House Quality Assurance

Duplicate Data is More Common Than You Think

The Solution: De-Duplicate Data to Reveal Unique Data Counts

Be Aware of Fake or Fraudulent Data

The Solution – Perform Basic Visual Analysis on Sample Data

Watch Out for Latency

The Solution – Determine Adequate Latency for Your Use Case

ABOUT AUTHOR

RELATED POSTS

4 Key Takeaways from the “Unlocking the Value of Location Data: From Insights to Impact” Webinar

Tracking People’s Movement After Disasters Can Save Lives : Asian Development Blog

Remapping Southeast Asia, one building at a time (Case Study)