March 2, 2021 Suman Joshi

Achieving Optimal Location Data Quality with Deduplication and Noise Filtering

A few weeks ago, we discussed the seven most important parameters for data buyers to consider before purchasing location data. One of those factors is the health and quality of data. Is the data complete, accurate, and usable? Does it need additional preparation? How many unique values does the dataset actually contain?

A location data provider typically aggregates information from a variety of sources in different formats for larger volumes. However, the lack of a data preparation standards can lead to false insights and ultimately to wrong business decisions.

Annoyed-amico-1 Therefore, customers must be vigilant about where they are purchasing the data from and what is their vendor’s data quality assurance procedure.

Without proper data quality assessment, you are likely to be burdened with preparing the raw data for analysis yourself, which can be time-consuming and expensive.

Here’s how Quadrant cleanses and offers ready-to-use mobile location data, requiring little to no preparation, depending on your use case.

Data Noise Filters

As a data buyer, you must understand how your vendor achieves a high degree of precision by identifying and removing noise. In location data, attributes such as device ID, latitude, longitude, horizontal accuracy, and IP address are all important, and if any of these fields contain invalid or unrealistic data, the entire row becomes worthless. (Visit our Knowledge Base to learn about the various location data attributes)

Here are some of the examples of how noisy data looks like:

• Device ID (character length is more than 36): fd4678e5-8bb2-4bde-ae01 fac856df80b08984-6749-4386-bb7d-b368abf6a2ad
• Horizontal Accuracy (unrealistically low value): 0.01 meters
• Latitude / Longitude (invalid coordinates): 0
• Invalid Device ID type like AAAID
• NULL or empty or 0 values

Even as little as 1 percent noise can significantly reduce the usefulness of a data set, so it it important to be remove it as much as possible. At Quadrant, we remove the bad data using our internal noise filtering algorithm.

Our noise filters:

1. Remove invalid Device IDs and Device IDs that are null, or empty (eg: value 0000-000…)
2. Eliminates rows that contain lat/long coordinates with inappropriate values.
3. Removes records that have IP Address / timestamp that equals 0 or NULL or empty.
4. Filters any data with records where horizontal accuracy doesn't reach a certain threshold.
5. Normalise Device ID type so that we have only AAID and IDFA.
6. Normalise Device OS so that we have only Android and iOS.

For example, we know that the Device ID field should have 36 characters. When processing the data, the noise filters will identify where the device ID field has an invalid entry and scrub it from the overall dataset. Learn more about Quadrant’s Noise Filtering algorithm.

Deduplication

Combining data from multiple sources requires normalisation and deduplication for consistency. Due to the interconnected nature of the location data ecosystem, the same data is often purchased by different entities and is subsequently sold or transferred between those entities. This often leads to one value appearing in a dataset multiple times. Duplicate records make the data seem voluminous, but quantity does not mean quality.

Duplicate data will artificially inflate your event count per device. The picture below is an example of this scenario:

Data Quality Blog 1

Device ID ca990773-6a5a-4152-a233-740a0d5abdf2 will be counted as having three events, while Device ID d3b7ad73-f300-4f3b-b573-10721f0b896d will only have one event.

In reality, both these Device IDs only have one event each; it just so happens that the former has their data being ingested from multiple sources. Extrapolating this relatively simple example out to the whole dataset, ignoring deduplication significantly impacts metrics such as Events Per Device Per Day, giving your data a perceived higher quality than it actually has.

As a part of Quadrant’s standard data cleansing and quality assurance process, all our data is processed through a deduplicating algorithm focusing on a combination of four important attributes: Device ID, Latitude, Longitude and Timestamp.

Our algorithm will check if any of the rows contain the exact same combination of these four attributes and keep only one copy. Taking the dataset above as an example, here is what it would look like after deduplication:

Data Quality Blog 2

It is important to be conscious of the pitfalls of simple metrics -- such as number of events per device per day -- while evaluating multiple data sets across providers.

If you would like to learn more about how to deduplicate your data, even if it is from other data providers you work with, let us know and we would be happy to help!

Achieving Optimal Location Data Quality with Deduplication and Noise Filtering

Share

Deduplication

ABOUT AUTHOR

RELATED POSTS

4 Key Takeaways from the “Unlocking the Value of Location Data: From Insights to Impact” Webinar

Tracking People’s Movement After Disasters Can Save Lives : Asian Development Blog

Remapping Southeast Asia, one building at a time (Case Study)