Type-hinting is one of the greatest things to happen in Python.
There is a famous saying that software is read many more times than it is written.
Type hinting makes reading code much, much easier.
Table of Contents
- Introduction
- Type-hinting
- Setup
- Basic Usage
- More complicated use cases
- Too many types
- Type-hints as Documentation
- Limitations of Type-hinting
Introduction
In tthis article, we talk about
- Why type-hinting is useful (even for data science).
- How to set up and use
mypy
to do type hinting. - How to use Python type-hinting to make your code more readable.
- Some limitations of type-hinting.
Type-hinting
As you know, Python is a dynamic language and there is no type checking for variables.
That means, if you have a function
def transform(x):
return x + 1
you can pass in a str
data type and when you run your code you get an error.
In languages such as Java where you need to compile code before running it, these types of mistakes are far less likely to appear since the compiler will complain if you try to pass a str
to a function that needs an int
.
Using a static type checker allows programmers to pick up mistakes before running their code.
In the above example, it’s pretty obvious you need a numerical data type, be it an int
or float
.
What about the function below?
def process(data):
...
return data_processed
So… what is this data? Is it a dict
? numpy
array? pandas
dataframe? You would have no idea until you read through the code.
What about the data_processed
? Is it still the same data type as data
? Or did something change?
You wouldn’t know…
With type-hinting, you can help readers understand your code better by doing this:
def process(data: Dict[str, str]) -> np.ndarray:
...
return data_processed
Without knowing anything about type-hinting, you can probably guess that the data
parameter requires a dict
type and returns a numpy
array.
So now we have a clear use case let’s talk about how to write Python code with type-hints.
Setup
When we use type-hints
in Python, we will need something to help us check the code against the hints we have given.
We will use mypy
in this article. There are alternatives but I’ve only used mypy
and haven’t found a reason to switch.
Installing mypy
is fairly straightforward.
pip3 install mypy
We can create a straightforward example to test if our mypy
installation was successful.
# code.py
def function(x: int) -> int:
return x + 1
if __name__ == "__main__":
x = 1
print(f"output: {function(x)}")
Now just throw mypy code.py
into the terminal and it should work.
Now let’s make a slight change…
# code.py
def function(x: int) -> int:
return x + 1
if __name__ == "__main__":
x = '1'
print(f"output: {function(x)}")
Running mypy
gives you an error.
A simple example of how static type-hinting provides a layer of protection from bugs.
You might think that this is unnecessary as you can simply run the program to test it.
Small codebases will probably run quite quickly so your overall testing time might stay the same.
In large codebases, it is expensive to spin up a remote server to start a build.
Throwing things into a Kubernetes pod is also not cheap.
Running Python code is always going to be more expensive than reading “text” using mypy
.
Basic Usage
Let’s have a look at some basic syntax.
Type-hints for variables
Type-hinting for variables is fairly straight forward.
For the most common data types, we can do the following:
x: int = 3
x: float = 3.0
x: str = "x"
x: bool = False
If we want to type-hint for “composed” data structures, we’ll need to import the typing
package.
from typing import Dict, List, Set, Tuple
d: Dict[str, int] = {'a': 1}
l: List[str] = ['a', 'b', 'c']
s: Set[str] = {'a', 'b', 'c'}
s: Tuple[str, str, str] = ('a', 'b', 'c')
UPDATE: Since Python 3.9 the data types in the standard collection such as dict
are now able to be type-hinted without being imported from typing
. This was proposed in PEP585. As a consequence you can just
d: dict[str, int] = {'a': 1}
l: list[str] = ['a', 'b', 'c']
s: set[str] = {'a', 'b', 'c'}
s: tuple[str, str, str] = ('a', 'b', 'c')
You might be thinking, isn’t it pretty obvious what x
is when we see x = 3
?
That’s right, but there’s actually 2 interesting use-cases that type-hinting variables useful - especially if a more complicated dictionary type.
I will talk about this more in part 2.
Type-hints in functions
You have already seen an example of how to type-hint for functions.
We simply write arg: <type>
and at the end of the signature before the colon, we type-hint the return type with a ->
def function(a: int, b: str) -> float:
pass
Fairly straight forward. If a function has no return
type, we can type-hint with None
.
Type-hints in classes
This is similar for classes
class Foo:
def __init__(self, x: int, y: Tuple[str, int]) -> None:
self.x = x
self.y = y
Here, Tuple[str, int]
means that we are expecting a tuple whose 1st element is a str
and 2nd is an int
.
Hopefully this isn’t too complicated.
Multiple Types?
Let’s say you have a function that can take in any numerical value, both int
and float
. You can specify this by doing the following
from typing import Union
def add(x: Union[int, float], y: Union[int, float]):
pass
This can be super annoying to type repeatedly, so you can create an alias for it
from typing import Union
Numeric = Union[int, float]
def add(x: Numeric, y: Numeric):
pass
This is super useful if your data can come in multiple formats… e.g. numpy
arrays and pandas
dataframes.
import numpy as np
import pandas as pd
from typing import Union
Data: Union[pd.DataFrame, np.ndarray]
def process(data: Data) -> Data:
pass
One of the most convenient things about type-hints is that my variable was simplified.
Prior to type-hinting, I would always attach the variable type to my variable name e.g.
df_sales = pd.DataFrame(...)
...
df_sales_region = ...
df_sales_region_store = ...
# Even in functions I would do this.
def func(df_sales):
pass
I did this to make it easier to identify what my variable was when I came back to the code later.
However, with type-hints, I can just read the function signature and I would know. So my variables became
def func(sales: pd.DataFrame):
pass
Type-hinting function arguments
What happens if the parameter is a function? You can use Callable
.
from typing import Callable
def normalise(
func: Callable[..., List[float]],
data: List[float]
) -> List[float]:
return func(data)
if __name__ == "__main__":
def min_max(data: List[float]) -> List[float]:
min_d = min(data)
max_d = max(data)
range_d = max_d - min_d
return [(max_d - d)/range_d for d in data)]
data_normed = normalise(min_max, data)
The literal ellipsis (...)
means arbitrary number of arguments.
To type-hint specifics you simply Callable[[int, int], float]
- this is a type-hint that requires a function with 2 int
parameters and has a float
return type.
More complicated use cases
Using Literal
Say you are training 3 different types of models and have a function
# train.py
from typing import Any, Dict, Literal, Union
def train_model(model: str, config: Dict[str, Union[str, int, float]) -> bool:
if model == "baseline":
# train model A
pass
elif model == "xgboost":
# train model B
pass
elif model == "neural_net":
# train model B
pass
else:
raise ValueError(f"Model={model} not supported!")
So your train model function supports only 3 types of models and you want to make sure that a user only passes in these 3 strings.
You can’t stop people doing it at run-time but you can use Literal
to tell users when type-checking what values you’re expecting.
Of course, you can also raise ValueError
like above. But with type-hinting you can prevent this before runtime.
from typing import Any, Dict, Literal, Union
Model = Literal["baseline", "xgboost", "neural_net"]
def train_model(model: Model, config: Dict[str, Union[str, int, float]) -> bool:
pass
NOTE: You can also enforce Literal
-like behaviour using enums
but errors will be caught during runtime and therefore more expensive.
If you’re interested shoot me a message and I’ll do a quick writeup about it 🤓
Using NewType
Code can sometimes become easier to read if it follows domain logic.
from typing import Dict, List, NewType, Union
# Create a new subtype to
UsersData = NewType('UserData', pd.DataFrame)
ItemsData = NewType('ItemsData', pd.DataFrame)
At runtime, the NewType
function is an identity map - your variables are treated as dataframes during runtime.
But during type-checking we can enforce domain logic.
If you want to see the actual implementation of NewType
- have a look here.
So using the above example we have
# new_type.py
from typing import Dict, List, NewType, Union
# Create a new subtype to
UsersData = NewType('Data', pd.DataFrame)
ItemsData = NewType('Data', pd.DataFrame)
def filter_sensitive_info(users: UsersData):
pass
def select_region(items: ItemsData, region: str):
pass
if __name__ == "__main__:
users = UsersData(pd.DataFrame({'name': 'Ash', 'pokemon': 'Pikachu'}))
items = ItemsData(pd.DataFrame({'name': 'Potion', 'effect': 'Restores 20 HP'}))
filter_sensitive_info(users)
select_region(items, region="Kanto")
# trying to select_region(users) will fail static type-checking
select_region(users, region="Kanto")
Too many types
The documentation for the typing
module is pretty much like infinite scrolling… there are many types you can use for the same function.
If your machine learning workflow is straightforward it makes sense to keep things simple.
Nevertheless, I’ll give one example e.g. Iterable
Sequence
and List
.
When to use which one?
First, let’s establish a hierarchy.
A List
is a type of Sequence
which is a type of Iterable
.
Well, if your variable is a list or array as you would normally expect in Python, use List
. Most of the time this is fine.
A Sequence
is an Iterable
that can be indexed, sliced, and has a length attribute. E.g. tuples and strings are sequences in Python.
An Iterable
is anything that has an __iter__
method. In practice, that means you can call iter(var)
on it and things will work.
Iterables
expand to include sets and dictionaries.
In practice, something is an Iterable
IF you can use a for
loop on it.
So depending on the needs of your function, your type-hint may vary.
Type-hints as Documentation
This is a function I wrote a few years back without type-hints… this is actual code (with a few adjustments of course… also there like 10 more arguments… 😓)
def run_model_training(
region,
query_dict,
save_dir,
base_training_data,
features,
target_var_name,
hyperparams_config,
date_start,
date_end,
):
"""
Parameters
----------
region: str
region code, supports only 'Kanto' or 'Jhoto'.
query_dict: {str, str script}
The dictionary storing SQL scripts.
save_dir: Path
base directory where everything will be saved.
training_data: pd.DataFrame
DataFrame with essential features
features:
Additional features
target_var_name: str
Column name of the label in the output DataFrame
hyperparams_config: {str, Anything}
data_start: <yyyy-mm-dd>
data_end: <yyyy-mm-dd>
Returns
-------
dict
Dictionary of metadata related to the trained model.
"""
Let me try to rewrite this with type hints
import numpy as np
import pandas as pd
from typing import Any, Dict, List, Literal, Union
ModelMetadata = Dict[str, Any]
Data = Union[pd.DataFrame, np.ndarray]
Region = Literal['Kanto', 'Jhoto']
def run_model_training(
region: Region,
query_dict: Dict[str, str],
save_dir: Path,
base_training_data: Data,
features: Data,
target_var_name: str,
hyperparams_config: Dict[str, Any],
date_start: str,
date_end: str,
) -> ModelMetadata:
"""
Parameters
----------
...
data_start: <yyyy-mm-dd>
data_end: <yyyy-mm-dd>
Returns
-------
{
"name": <name>,
"status: true,
"shape": (<rows>, <cols>),
"save_path": str,
"features": List[str],
"date_start": "2018-01-01",
"date_end": "2018-12-31",
}
"""
So much easier to read no? 🙃
Limitations of Type-hinting
from __future__ import annotations
# Required for class type-hinting.
The main limitation of type-hinting is that it’s not available for every package. When typing
was first released in Python 3.5, it didn’t support the data-science packages like pandas
and numpy
.
This meant that although you could write it in the code, mypy
doesn’t can’t actually check your code for correctness.
So using type-hints was purely for increased readability.
You will know which libraries are missing because when you run mypy
you will get hit with
main.py:1: error: Library stubs not installed for "requests" (or incompatible with Python 3.8)
main.py:2: error: Skipping analyzing 'django': found module but no type hints or library stubs
main.py:3: error: Cannot find implementation or library stub for module named "this_module_does_not_exist"
In this case, you would need to add a comment to indicate that this library has no type-hints available.
import requests # type: ignore
As of recent, numpy
does have support for type-hints. You can read the documentation here.
It takes a while to specify effective type-hints to users. The learning curve becomes steep quite quickly, and for personal projects, the effort may not outweigh the benefits.
Nevertheless, type-hints are extremely useful especially for people who productionise Python code for ML or otherwise.
That brings us to the end of part one.
Stay tuned for part two! 🙃