Data Anomaly : That Grumpy face of Industrial Raw Data

Suparna
6 min readFeb 11, 2024

--

Source: VectorStock

Yes!! we all can relate this face after looking at the industrial raw data for the first time😐.

So, this article is just about ‘That weird face of that Data’ , Before processing the worlds’ new treasure .

We all heard that industrial raw data is really different from cleaned/processed one, which is also not at all similar to Kaggle datasets( the most popular platform among Data Scientists) 😬. So, without beating around the bush let’s break the ice and help newbies to see what’s behind the curtain.

Hey, Wait! It’s not that old trite easily available all over the internet. Rather let’s deep dive into some pragmatic understanding on beauty of data anomalies. Once a Data Professional said “Data Science is 80% is processing raw data and 20% complaining about it !” 😅

Source: Data_query_memes

1. Absence of a proper data dictionary/metadata:

A data dictionary/metadata document is one of the most essential parts for understanding the data which outlines everything about the data we’d deal i.e. the structure, source, format, meaning of each field and any transformations or processing steps applied to it (now or ever).

But the worst situation is in some cases you may not be able to find a proper data dictionary and then keep on understanding the hard way.

2. Corrupt File and datatype discrepancy:

A. Corrupted File:

When it comes to play with the data, all it starts with accessing a corrupted file. Though I’m not a fan of accessing data in file format (Why? I’ll discuss that in another article) for analytics, still this situation often comes.

One of the common and ambiguous problems is a file may contain unexpected or missing character/delimiter (it’s very common for .csv , .json, .txt files)

You may find this kind of error very frequently-

B. Issue in API calls:

To fetch the data one may face such a frustrating situation where a API call may return empty DataFrame after keeping kernel busy for hours.

Another scenario is request of an API results fail for some date range or few events .

Sometimes insignificant numerical data or blank rows appear out of the blue.

C. UnicodeDecode Error:

Except all of the above, one may hit the encoding issue in a file while reading it. Unicode is a standardized character encoding assigns a unique number (code point) to each character. The most popular encoding is utf-8.

Image added by author (Here, we find ‘invalid start byte’ error , Cause: The byte sequence starts with an invalid Unicode code point)

Usually the error message describes the specific problem encountered during decoding. Few of them are listed below:

-‘invalid continuation byte’
- ‘unexpected end of data’
- ‘ordinal not in range’
- ‘invalid start byte’

D. Datatype formats :

  • Now after creating a DataFrame there is a huge possibility to encounter weird datatype formats. Like in very few cases the timestamp is in datetime or human readable format.
Image shared by Author (Masked data) - We can see two types of datetime field here
  • Another challenge is when you find single/multiple field is storing data in different languages ( e.g. English , Anglais (French), английский (Russian) , ইংরেজি (Bengali), język angielski (Polish) , अंग्रेज़ी (Hindi) )
  • It is very common to find numerical values are stored as a string(e.g. ‘10 min’ or ‘-240 min’ ) along with unexpected characters . So checking the datatype of each and every field is important.
Image by Author (Masked Data) — here, ‘event.Property’ field stored data in multiple languages and ‘event.spentTime’ have some data with unexpected characters

3. Validation

Now let’s come to the point whether the data is valid or not.

A. Impossible Data:

Mostly we treat them as outliers when the count of these data is not huge (<5-10 %) but when it’s not then we use our power of codes and statistical methods to correct or manipulate .

I’ll mention here a few cases which I found really misleading and worth to discuss-

  • I found multiple users kept scrolling / watching on their single device in a day for quite more than 24 hrs. and this incident kept repeating for months.
Image Shared by Author (Masked Data). Here, a user with single device has watch time more than of 10 days on a particular day
  • Another thing I noticed that impossible timestamps , shared a sample below
Image shared by Author (Masked Data). The red marked dates are making the data more unreliable
  • Invalid object ids or transaction ids or somewhat product ids can also be present.
  • Sometimes you may face such situation where you may get confused whether the data is correct or not. Like a Failed transaction may generate valid transaction id with actual timestamp but no credit/debit value(i.e. no amount deducted on user end ) and still have the user status as ‘Subscribed’.
  • Valid data gets recurring with different timestamp , removing duplicate values won’t work . e.g. There are a lot of service providers who don’t allow their audience/users to buy more than one plan. but you can find data in which a bunch of people are purchasing yearly/monthly service plan everyday in a recurring way. No, hold your horse ,there is no loophole in the service platform’s payment or plan page, it’s just the collection of the data or the responsible API as the only value is same is transaction id (transaction id is always unique for every transaction and never repeat). This sample explains more..
Image by Author (Masked Data). This sample shows same user buying different plan everyday in single account
  • Though it’s very rare but a dataset may contain multiple ids for the same object title , in which none of the id is invalid .
Image by Author (Masked Data). Here 'prop.productId' contain multiple product id of same product (possible in user interaction data)
  • It’s very common to change the tag and properties of objects over time but their object ids remain constant . This situation creates a contradictory mess when you are building a recsys model . As the historical interaction data doesn’t perform well with the current one.
Image by Author (Masked Data). Here, same product have contradictory or different product property ( marked in red)
  • An uncleaned dataset may have multiple columns (with different column names) storing similar values , but they are not duplicate. The sample below will help you to understand more.
Image by Author (Masked Data) — We find four types of similar fields here

B. Other Data Anomalies:

These types of anomalies reduce data reliability.

  • Wrong information about an object e.g. incorrect age, gender, location of a user or wrong weight of a product or wrong expiry date of a medicine.
  • A correct information about a product but misspelled may result into decreasing the accuracy of a model.
Image by Author (Masked Data) . 'object.Property' contains mispelled data (highlighted in red )

4. Quality of data

The following cases are a dime in dozen.

  • Too much null values
  • Huge duplicate rows
  • Missing data
  • Very few numerical fields and along with categorical fields (>1000 unique parameters ), which contain nominal values mostly (where the ratio is almost 1:5000 ).

And what! Yes, that’s all ! Everything I discussed here was part of learning which ultimately became a blessing in disguise. And then, I learned that all we try to do is to make a data less dirty🤓 than the previous state.

--

--

Suparna

A Data Scientist in profession. A epicurean by passion , Who also loves to play with colors on canvas . Email:suparna.mondal.ofc@gmail.com