How to Deal With NaN Values — datatest 0.12.0.dev1 documentation (2024)

latest

Documentation

Home

Introduction
How-to Guide
- Install Datatest
- Get Started Testing
- Run Tests
- Column Names
- Customize Differences
- Data Types
- Date and Time Strings
- Date and Time Objects
- File Names
- Test File Properties
- Excel Auto-Formatting
- Mailing Addresses
- Fuzzy Matching
- NaN Values
  - Checking for NaN Values
  - Accepting NaN Differences
  - Dropping NaNs Before Validation
  - Requiring NaN Values
  - A Deeper Understanding
    - Equality: NaN ≠ NaN
    - Identity: NaN is NaN, Except When it Isn’t
- Negative Matches
- Outliers
- Phone Numbers
- Re-order Acceptances
- Sequences
Reference
Discussion

Docs »
How-to Guide »
How to Deal With NaN Values
Edit on GitHub

IEEE 754

While the behavior of NaN values can seem strange, it’s actuallythe result of an intentionally designed specification. The behaviorwas standardized in IEEE 754, a technical standards document first published in1985 and implemented by many popular programming languages (includingPython).

When checking certain types of data, you may encounter NaN values.Working with NaNs can be frustrating because they don’t always actas one might expect.

About NaN values:

NaN is short for “Not a Number”.
See Also
How to Address Missing Values in R How to Deal with Missing Values in R The Art of R Programming pandas: Replace NaN (missing values) with fillna() | note.nkmk.me
NaN values represent undefined or unrepresentable resultsfrom certain mathematical operations.
Mathematical operations involving a NaN will either return aNaN or raise an exception.
Comparisons involving a NaN will return False.

Checking for NaN Values¶

To make sure data elements do not contain NaN values, you can usea helper function:

 1 2 3 4 5 6 7 8 91011

from math import isnanfrom datatest import validatedata = [5, 6, float('nan')]def not_nan(x): """Values should not be NaN.""" return not isnan(x)validate(data, not_nan)

You can also do this using an inverted Predicate match:

123456789

from math import isnanfrom datatest import validate, Predicatedata = [5, 6, float('nan')]requirement = ~Predicate(isnan)validate(data, requirement)

Accepting NaN Differences¶

If validation fails and returns NaN differences, you can acceptthem as you would any other difference:

123456789

from math import nanfrom datatest import validate, accepted, Extradata = [5, 6, float('nan')]requirement = {5, 6}with accepted(Extra(nan)): validate(data, requirement)

Like other values, NaNs can also be accepted as part of a list,set, or mapping of differences:

123456789

from math import nanfrom datatest import validate, accepted, Missing, Extradata = [5, 6, float('nan')]requirement = {5, 6, 7}with accepted([Missing(7), Extra(nan)]): validate(data, requirement)

Note

The math.nan value is new in Python 3.5. NaN values canalso be created in any Python version using float('nan').

Dropping NaNs Before Validation¶

Sometimes it’s OK to ignore NaN values entirely. If this isappropriate in your circ*mstance, you can simply remove allNaN records and validate the remaining data.

If you’re using Pandas, you can call the Series.dropna() andDataFrame.dropna() methods to drop records that contain NaNvalues:

 1 2 3 4 5 6 7 8 910

import pandas as pdfrom datatest import validatesource = pd.Series([1, 1, 2, 2, float('nan')])data = source.dropna() # Drop NaN valued elements.requirement = {1, 2}validate(data, requirement)

Requiring NaN Values¶

If necessary, it’s possible to require that NaNs appear in yourdata. But putting NaN values directly into a requirement canbe frought with problems and should usually be avoided. The mostrobust way to do this is by replacing NaN values with a specialtoken and then requiring the token.

Below, we define a custom NanToken object and use it to replaceactual NaN values.

If you’re using Pandas, you can call the Series.fillna() andDataFrame.fillna() methods to replace NaNs with a different value:

 1 2 3 4 5 6 7 8 91011121314151617

import pandas as pdfrom datatest import validateclass NanToken(object): def __repr__(self): return self.__class__.__name__NanToken = NanToken()source = pd.Series([1, 1, 2, 2, float('nan')])data = source.fillna(NanToken) # Replace NaNs with NanToken.requirement = {1, 2, NanToken}validate(data, requirement)

A Deeper Understanding¶

Equality: NaN ≠ NaN¶

NaN values don’t compare as equal to anything—even themselves:

>>> x = float('nan')>>> x == xFalse

To check if a value is NaN, it’s common for modules and packagesto provide a function for this purpose (e.g., math.isnan(),numpy.isnan(), pandas.isna(), etc.):

>>> import math>>> x = float('nan')>>> math.isnan(x)True

While NaN values cannot be compared directly, they can be comparedas part of a difference object. In fact, difference comparisons treatall NaN values as equal—even when the underlying type is different:

>>> import decimal, math, numpy>>> from datatest import Invalid>>> Invalid(math.nan) == Invalid(float('nan'))True>>> Invalid(math.nan) == Invalid(complex('nan'))True>>> Invalid(math.nan) == Invalid(decimal.Decimal('nan'))True>>> Invalid(math.nan) == Invalid(numpy.nan)True>>> Invalid(math.nan) == Invalid(numpy.float32('nan'))True>>> Invalid(math.nan) == Invalid(numpy.float64('nan'))True

Identity: NaN is NaN, Except When it Isn’t¶

Some packages provide a NaN constant that can be referenced inuser code (e.g., math.nan and numpy.nan).While it may be tempting to use these constants to check formatching NaN values, this approach is not reliable in practice.

To optimize performance, Numpy and Pandas must strictly manage thememory layouts of the data they contain. When numpy.nan isinserted into an ndarray or Series, the value is coerced into a compatible dtypewhen necessary. When a NaN’s type is coerced, a separate instanceis created and the ability to match using the is operatorno longer works as you might expect:

>>> import pandas as pd>>> import numpy as np>>> np.nan is np.nanTrue>>> s = pd.Series([10, 11, np.nan])>>> s[2]nan>>> s[2] is np.nanFalse

We can verify that the types are now different:

>>> type(np.nan)float>>> type(s[2])float64

Generally speaking, it is not safe to assume that NaN is NaN.This means that—for reliable validation—it’s best to removeNaN records entirely or replace them with some other value.