Type-hinting is one of the greatest things to happen in Python.

There is a famous saying that software is read many more times than it is written.

Type hinting makes reading code much, much easier.


Table of Contents

Introduction

In tthis article, we talk about

  1. Why type-hinting is useful (even for data science).
  2. How to set up and use mypy to do type hinting.
  3. How to use Python type-hinting to make your code more readable.
  4. Some limitations of type-hinting.

Type-hinting

As you know, Python is a dynamic language and there is no type checking for variables.

That means, if you have a function

def transform(x):
   return x + 1

you can pass in a str data type and when you run your code you get an error.

In languages such as Java where you need to compile code before running it, these types of mistakes are far less likely to appear since the compiler will complain if you try to pass a str to a function that needs an int.

Using a static type checker allows programmers to pick up mistakes before running their code.

In the above example, it’s pretty obvious you need a numerical data type, be it an int or float.

What about the function below?

def process(data):
    ...
    return data_processed

So… what is this data? Is it a dict? numpy array? pandas dataframe? You would have no idea until you read through the code.

What about the data_processed? Is it still the same data type as data? Or did something change?

You wouldn’t know…

With type-hinting, you can help readers understand your code better by doing this:

def process(data: Dict[str, str]) -> np.ndarray:
    ...
    return data_processed

Without knowing anything about type-hinting, you can probably guess that the data parameter requires a dict type and returns a numpy array.

So now we have a clear use case let’s talk about how to write Python code with type-hints.

Setup

When we use type-hints in Python, we will need something to help us check the code against the hints we have given.

We will use mypy in this article. There are alternatives but I’ve only used mypy and haven’t found a reason to switch.

Installing mypy is fairly straightforward.

pip3 install mypy

We can create a straightforward example to test if our mypy installation was successful.

# code.py

def function(x: int) -> int:
    return x + 1

if __name__ == "__main__":
    x = 1
    print(f"output: {function(x)}")

Now just throw mypy code.py into the terminal and it should work.

Now let’s make a slight change…

# code.py

def function(x: int) -> int:
    return x + 1

if __name__ == "__main__":
    x = '1'
    print(f"output: {function(x)}")

Running mypy gives you an error.

A simple example of how static type-hinting provides a layer of protection from bugs.

You might think that this is unnecessary as you can simply run the program to test it.

Small codebases will probably run quite quickly so your overall testing time might stay the same.

In large codebases, it is expensive to spin up a remote server to start a build.

Throwing things into a Kubernetes pod is also not cheap.

Running Python code is always going to be more expensive than reading “text” using mypy.

Basic Usage

Let’s have a look at some basic syntax.

Type-hints for variables

Type-hinting for variables is fairly straight forward.

For the most common data types, we can do the following:

x: int = 3
x: float = 3.0

x: str = "x"
x: bool = False

If we want to type-hint for “composed” data structures, we’ll need to import the typing package.

from typing import Dict, List, Set, Tuple

d: Dict[str, int] = {'a': 1}
l: List[str] = ['a', 'b', 'c']
s: Set[str] = {'a', 'b', 'c'}
s: Tuple[str, str, str] = ('a', 'b', 'c')

UPDATE: Since Python 3.9 the data types in the standard collection such as dict are now able to be type-hinted without being imported from typing. This was proposed in PEP585. As a consequence you can just

d: dict[str, int] = {'a': 1}
l: list[str] = ['a', 'b', 'c']
s: set[str] = {'a', 'b', 'c'}
s: tuple[str, str, str] = ('a', 'b', 'c')

You might be thinking, isn’t it pretty obvious what x is when we see x = 3?

That’s right, but there’s actually 2 interesting use-cases that type-hinting variables useful - especially if a more complicated dictionary type.

I will talk about this more in part 2.

Type-hints in functions

You have already seen an example of how to type-hint for functions.

We simply write arg: <type> and at the end of the signature before the colon, we type-hint the return type with a ->

def function(a: int, b: str) -> float:
    pass

Fairly straight forward. If a function has no return type, we can type-hint with None.

Type-hints in classes

This is similar for classes

class Foo:
    def __init__(self, x: int, y: Tuple[str, int]) -> None:
        self.x = x
        self.y = y

Here, Tuple[str, int] means that we are expecting a tuple whose 1st element is a str and 2nd is an int.

Hopefully this isn’t too complicated.

Multiple Types?

Let’s say you have a function that can take in any numerical value, both int and float. You can specify this by doing the following

from typing import Union

def add(x: Union[int, float], y: Union[int, float]):
    pass

This can be super annoying to type repeatedly, so you can create an alias for it

from typing import Union

Numeric = Union[int, float]

def add(x: Numeric, y: Numeric):
    pass

This is super useful if your data can come in multiple formats… e.g. numpy arrays and pandas dataframes.

import numpy as np
import pandas as pd

from typing import Union

Data: Union[pd.DataFrame, np.ndarray]

def process(data: Data) -> Data:
    pass

One of the most convenient things about type-hints is that my variable was simplified.

Prior to type-hinting, I would always attach the variable type to my variable name e.g.

df_sales = pd.DataFrame(...)
...
df_sales_region = ...
df_sales_region_store = ...

# Even in functions I would do this.
def func(df_sales):
   pass

I did this to make it easier to identify what my variable was when I came back to the code later.

However, with type-hints, I can just read the function signature and I would know. So my variables became

def func(sales: pd.DataFrame):
    pass

Type-hinting function arguments

What happens if the parameter is a function? You can use Callable.

from typing import Callable

def normalise(
        func: Callable[..., List[float]],
        data: List[float]
    ) -> List[float]:
    return func(data)

if __name__ == "__main__":
    def min_max(data: List[float]) -> List[float]:
        min_d = min(data)
        max_d = max(data)

        range_d = max_d - min_d
        return [(max_d - d)/range_d for d in data)]

    data_normed = normalise(min_max, data)

The literal ellipsis (...) means arbitrary number of arguments.

To type-hint specifics you simply Callable[[int, int], float] - this is a type-hint that requires a function with 2 int parameters and has a float return type.

More complicated use cases

Using Literal

Say you are training 3 different types of models and have a function

# train.py

from typing import Any, Dict, Literal, Union

def train_model(model: str, config: Dict[str, Union[str, int, float]) -> bool:
    if model == "baseline":
        # train model A
        pass
    elif model == "xgboost":
        # train model B
        pass
    elif model == "neural_net":
        # train model B
        pass
    else:
       raise ValueError(f"Model={model} not supported!")

So your train model function supports only 3 types of models and you want to make sure that a user only passes in these 3 strings.

You can’t stop people doing it at run-time but you can use Literal to tell users when type-checking what values you’re expecting.

Of course, you can also raise ValueError like above. But with type-hinting you can prevent this before runtime.

from typing import Any, Dict, Literal, Union

Model = Literal["baseline", "xgboost", "neural_net"]

def train_model(model: Model, config: Dict[str, Union[str, int, float]) -> bool:
    pass

NOTE: You can also enforce Literal-like behaviour using enums but errors will be caught during runtime and therefore more expensive.

If you’re interested shoot me a message and I’ll do a quick writeup about it 🤓

Using NewType

Code can sometimes become easier to read if it follows domain logic.

from typing import Dict, List, NewType, Union

# Create a new subtype to
UsersData = NewType('UserData', pd.DataFrame)
ItemsData = NewType('ItemsData', pd.DataFrame)

At runtime, the NewType function is an identity map - your variables are treated as dataframes during runtime.

But during type-checking we can enforce domain logic.

If you want to see the actual implementation of NewType - have a look here.

So using the above example we have

# new_type.py

from typing import Dict, List, NewType, Union

# Create a new subtype to
UsersData = NewType('Data', pd.DataFrame)
ItemsData = NewType('Data', pd.DataFrame)

def filter_sensitive_info(users: UsersData):
    pass

def select_region(items: ItemsData, region: str):
    pass

if __name__ == "__main__:
    users = UsersData(pd.DataFrame({'name': 'Ash', 'pokemon': 'Pikachu'}))
    items = ItemsData(pd.DataFrame({'name': 'Potion', 'effect': 'Restores 20 HP'}))

    filter_sensitive_info(users)
    select_region(items, region="Kanto")

    # trying to select_region(users) will fail static type-checking
    select_region(users, region="Kanto")

Too many types

The documentation for the typing module is pretty much like infinite scrolling… there are many types you can use for the same function.

If your machine learning workflow is straightforward it makes sense to keep things simple.

Nevertheless, I’ll give one example e.g. Iterable Sequence and List.

When to use which one?

First, let’s establish a hierarchy.

A List is a type of Sequence which is a type of Iterable.

Well, if your variable is a list or array as you would normally expect in Python, use List. Most of the time this is fine.

A Sequence is an Iterable that can be indexed, sliced, and has a length attribute. E.g. tuples and strings are sequences in Python.

An Iterable is anything that has an __iter__ method. In practice, that means you can call iter(var) on it and things will work.

Iterables expand to include sets and dictionaries.

In practice, something is an Iterable IF you can use a for loop on it.

So depending on the needs of your function, your type-hint may vary.

Type-hints as Documentation

This is a function I wrote a few years back without type-hints… this is actual code (with a few adjustments of course… also there like 10 more arguments… 😓)

def run_model_training(
    region,
    query_dict,
    save_dir,
    base_training_data,
    features,
    target_var_name,
    hyperparams_config,
    date_start,
    date_end,
):
    """
    Parameters
    ----------
    region: str
        region code, supports only 'Kanto' or 'Jhoto'.
    query_dict: {str, str script}
        The dictionary storing SQL scripts.
    save_dir: Path
        base directory where everything will be saved.
    training_data: pd.DataFrame
        DataFrame with essential features
    features:
        Additional features
    target_var_name: str
       Column name of the label in the output DataFrame
    hyperparams_config: {str, Anything}
    data_start: <yyyy-mm-dd>
    data_end: <yyyy-mm-dd>

    Returns
    -------
    dict
    Dictionary of metadata related to the trained model.
    """

Let me try to rewrite this with type hints

import numpy as np
import pandas as pd
from typing import Any, Dict, List, Literal, Union

ModelMetadata = Dict[str, Any]
Data = Union[pd.DataFrame, np.ndarray]
Region = Literal['Kanto', 'Jhoto']

def run_model_training(
    region: Region,
    query_dict: Dict[str, str],
    save_dir: Path,
    base_training_data: Data,
    features: Data,
    target_var_name: str,
    hyperparams_config: Dict[str, Any],
    date_start: str,
    date_end: str,
) -> ModelMetadata:
    """
    Parameters
    ----------
    ...
    data_start: <yyyy-mm-dd>
    data_end: <yyyy-mm-dd>

    Returns
    -------
    {
        "name": <name>,
        "status: true,
        "shape": (<rows>, <cols>),
        "save_path": str,
        "features": List[str],
        "date_start": "2018-01-01",
        "date_end": "2018-12-31",
    }
    """

So much easier to read no? 🙃

Limitations of Type-hinting


from __future__ import annotations

# Required for class type-hinting.

The main limitation of type-hinting is that it’s not available for every package. When typing was first released in Python 3.5, it didn’t support the data-science packages like pandas and numpy.

This meant that although you could write it in the code, mypy doesn’t can’t actually check your code for correctness.

So using type-hints was purely for increased readability.

You will know which libraries are missing because when you run mypy you will get hit with


main.py:1: error: Library stubs not installed for "requests" (or incompatible with Python 3.8)
main.py:2: error: Skipping analyzing 'django': found module but no type hints or library stubs
main.py:3: error: Cannot find implementation or library stub for module named "this_module_does_not_exist"

In this case, you would need to add a comment to indicate that this library has no type-hints available.

import requests  # type: ignore

As of recent, numpy does have support for type-hints. You can read the documentation here.

It takes a while to specify effective type-hints to users. The learning curve becomes steep quite quickly, and for personal projects, the effort may not outweigh the benefits.

Nevertheless, type-hints are extremely useful especially for people who productionise Python code for ML or otherwise.

That brings us to the end of part one.

Stay tuned for part two! 🙃