As a Data Scientist you will be spending a large chunk of your time doing repetitive work across projects. You get a new data set and you run the same boilerplate code for your analysis: Import data, check for missing values, distribution check, cleansing, yada yada …

I know, datasets are very different yet I still see how often times the very same steps are repeated blindly in notebooks over and over again. Despite common software engineering principles telling us to not repeat ourselves, we still do it. I started noticing this pattern in my own work and got frustrated. Is there a way to break this recurring cycle?

Over and over and over and over, like a monkey with a miniature cymbal, the joy of repetition really is in you. Over and Over - Hot Chip

Master your craft and build your personal toolkit. Photo by NeONBRAND.

Master your craft and build your personal toolkit. Photo by NeONBRAND.

My tip: build your personal library of frequently needed analysis helper functions to make your life easier. See yourself as data science practitioner and master your tools over time. Rather than having them scrambled around all over your workbench, you need a central place to store them. That is your personal toolkit. It is different for everybody but you will recognize its importance and appreciate it the more you will use it. Upgrade and sharpen your knives regularly. It is your toolkit.

In this post I will show you some good practices around Python packaging with a focus on Data Science applications. In detail, we are going to:

  • create a package of frequently used functions,
  • test our code with unit-tests,
  • and finally distribute through git.

The first steps: create your personal tool library

In order to get started, think about the steps that you regularly do when starting a new data project (e.g. at school, work or personal).

Here is a list of recurring tasks I can think of right from the top of my head:

  • Analysis
    • Plot distributions for all columns/features
    • Outlier detection for various datatypes (categoric, numeric, timestamps etc.)
    • Missing values detection and handling
  • Database and File IO
    • Write helper functions for easier access data stores specific to your work. It might be the ugly ERP that you are using at your BigCorpℒ️ or just plain database connectors. Think of useful abstractions to read/write to these.
    • Writing files to S3 and other file stores.
  • Machine Learning
    • Feature creation
    • Train-Test splits beyond sklearn
    • Model Validation

Many posts have been written about structuring Python packages, I can warmly recommend the following for more in-depth discussions on this topic:

  1. A Practical Guide to Using Setup.py - GoDataDriven
  2. Structuring Your Project β€” The Hitchhiker’s Guide to Python
  3. Python Application Layouts: A Reference – Real Python
  4. Packaging Python Projects β€” Python Packaging User Guide

Obviously your Data Science toolkit should be tailored to your own work and needs. So be selfish and name it after yourself … or come up with a cool sounding name, how about Swahili character names?

For the example project in this post, we are going to start with a simple package called frnz-sample, named after yours truly.

Find a link to the Github repository here to have a look at the exact structure.

The project looks something like this:

frnz-sample
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ frnz
β”‚Β Β  β”œβ”€β”€ __init__.py
β”‚Β Β  β”œβ”€β”€ analytics.py
β”‚Β Β  └── data.py
β”œβ”€β”€ setup.py
└── tests
    └── test_analytics.py

The frnzdirectory contains two core modules, the first one analytics.py

import pandas as pd
from typing import Dict

def count_missing(dataframe: pd.DataFrame) -> Dict[str, int]:
    return dict(dataframe.isna().sum())

The count_missing function will return the number of missing values for every column in the supplied DataFrame. This is only one sample for a convenience function that will be part of your toolkit. Add more over time!

Shipping toy-data πŸ•Ή

Another key aspect of the Data Science toolkit is of course data! This is useful for two primary reasons:

  • use it as a starting place for analysis tasks or data pipelines,
  • Most importantly: Use them to unit-test your core functions. More on that later.

Shipping little toy-dataset with your package avoids external API calls to download (potentially) larger datasets. You can find similar dataset loading utilities within other packages such as scikit-learn.

Keep in mind the main functionalities you want to cover by your code and design your toy-data accordingly. Typos, missing data and outliers are only some of the things you will encounter in your work.

In our case, I created a small Pandas DataFrame with containing city names that can be used for other analysis functions. Even though we don’t have many rows here, we can potentially test a few important things:

  • (fuzzy) duplicates
    • there are two “Cape Towns” in this dataset, how do we find the real one?
  • outliers
    • numerical: Mumbai is quite populated but really by that much?
    • categorical: Sydney is in Australia but the 3 digit country code deviates from the majority used 2 digit country ISO codes shown. How do we find these anomalies programmatically?
  • missing values
import pandas as pd

cities = pd.DataFrame(
    data={
        "city": [
            "Berlin",
            "Vienna",
            "Montreal",
            "Mumbai",
            "cape-town",
            "Cape Town",
            "Sydney",
        ],
        "country": ["DE", "AT", "CA", "IN", pd.NA, "ZA", "AUS"],
        "population": [3750000, 1900000, 1780000, 184100000, 430000, 440000, pd.NA],
    }
)

You have finished your first core functions? Time for testing!

Unit-testing the toolkit

Unit-testing our code ensures that it performs as expected. Especially when using Jupyter notebooks and using ad-hoc code, we often run into risks of using untested functions. This is fine for quick analysis but if those functions are repeatedly used, they should become part of your library. And that means writing tests.

Here is an example for a unit-test for our count_missing function from the analytics module. I will use the pytest framework for this as it provides the simplest way to get started with testing in Python.

In a folder called tests, create the following file test_analytics.py:

from frnz.data import cities
from frnz.analytics import count_missing
import pytest


def test_counting_nas_return_correct_count():
    result = count_missing(cities)
    expected = {"city": 0, "country": 1, "population": 1}
    assert len(result) == len(expected)
    assert result["country"] == expected["country"]

First, we have to import all objects that we need for that specific test. In our case, this test will also import the sample DataFrame that we created earlier.

The structure of a test file is quite simple: you create a new function that will test a specific behavior of your code by comparing an expected outcome with the actual result. In this case, we will test the length and result of the function call to count_missing with our sample citiesdataframe. Because we already know how many missing values each column of our sample data contains, we can explicitly create a dictionary of expected results in the test.

By calling assert for both expected and actual results, we specify the expected equality and thus make sure the function works correctly.

When we are testing our code, we should just be able to call the pytest command from the CLI in the root of our package. This will start the test-runner and automatically find the unit test we just wrote. Make sure to add the test_ prefix to the name of your Python files containing your tests. This is essentially the cue for pytest to pick up that file as a unit-test. Pytest will run your tests and output a message whether they have failed or passed.

Distribute through Github

The great thing about using git for version control is that you can directly pip install your package from Github (without publishing to PyPI). Just use:

pip install -e git+https://github.com/alfranz/frnz-sample.git@master#egg=frnz

and/or specify it as part of a requirements.txt in your project like:

frnz @ git+https://github.com/alfranz/frnz-sample.git@master#egg=frnz

Using the package in the wild πŸ’ͺ

Now you should be able to simply install your new package through pip install and call the your functions like this:

import frnz

df = frnz.data.cities

print(frnz.analytics.count_missing(df))

Awesome, your package should now be working. You are now ready to not repeat yourself as frequently as before and super-charge your analytics workflows.

More to come!

This is the first post on the Data Science toolkit series. The next posts will go into more detail regarding CI/CD, the use of Docker containers and some additional things I have found useful in my work.

Did you enjoy this post? I would love to hear your feedback. What kind of functions did you add to your toolkit? Let me know!

Alex