Setting up python's Windows embeddable distribution (properly)


Embeddable distribution

If there is anything I like about Windows as a pythonist, it must be that you can use embedded distribution of python.

The embedded distribution is a ZIP file containing a minimal Python environment. It is intended for acting as part of another application, rather than being directly accessed by end-users.

In my opinion, it is a portable, ready-to-ship virtualenv. However, the embedded distribution comes with some limitation:

Third-party packages should be installed by the application installer alongside the embedded distribution. Using pip to manage dependencies as for a regular Python installation is not supported with this distribution, though with some care it may be possible to include and use pip for automatic updates. In general, third-party packages should be treated as part of the application (“vendoring”) so that the developer can ensure compatibility with newer versions before providing updates to users.

Sounds scary right? It said it doesn't even support pip. Don't worry, follow these simple steps, you will have a fully workable embedded environment.

Get the distribution

  1. Go to https://www.python.org/downloads/windows/
  2. Choose the version python you like and download the corresponding Windows x86-64 embeddable zip file.
  3. Unzip the file.

To make this tutorial easy to follow, I am assuming that you have downloaded Python3.7 and unzipped it to C:\python\.

Get pip

The distribution does not have pip installed in place, you need to install it yourself: 1. Download get-pip.py at https://bootstrap.pypa.io/get-pip.py 2. Save it to c:\python\get-pip.py 3. In command-line run C:\python\python get-pip.py 4. pip is now installed

Config path

The runtime of this distribution doesn't have an empty string '' added in sys.path, so that the current directory is not added into sys.path, to solve the problem, you need to:

  1. Open C:\python\python37._pth.
  2. Uncomment the line #import site and save.
  3. Create a new .py file and save it as c:\python\sitecustomize.py:
import sys
sys.path.insert(0, '')

lib2to3 issue

You will encounter the following error when you try to install some packages:

error: [Errno 0] Error: 'lib2to3\\Grammar3.6.5.final.0.pickle'
  1. Unzip C:\python\python37.zip to a new folder
  2. Delete C:\python\python37.zip
  3. Rename the new folder to python37.zip (yes, a new folder called python37.zip)

Python's import module is able to treat zip file as folder however, it cannot read pickle file inside a zip file, so unzip it and rename it.

Running pip

If you don't want to mess with you PATH, you can simply do the following in your window command prompt:

  1. CD C:\python\Scripts
  2. pip install xxxxx

Running Scripts

Again if you don't want to mess with you PATH, you can simply do in your window command prompt:

  1. C:\python\python <path to your script>

Done!

Object-Oriented Design architecture with Pandas


Hi guys. Today I would like to talk about implementing Object-oriented (OO) Design to build a Business Intelligence module with Python and Pandas.

Pandas is widely used for data analytics, it is easy to use, one can easily build a script of hundreds of lines of codes using pandas to handle complex data. However, having worked with many data analysts, I found that most of the scripts produced by data analysts should fall into the category of "procedural code", which is not reusable. Changing the context a bit would often result in a complete re-write of the code. It is fine if you are in the "research" area where your code are used for "exploration" or "experimental" purposes, but if you are to write code that contributes to a larger system or code that will be run repeatedly, you need to implement some architecture.

My Background

I work in finance where I need to handle data with complex relationship daily. The raw data that I received is usually too "raw" such that extracting any useful information from it often requires a lot of steps. Code could easily become impossible to maintain due to the mismatch of raw data structure and business logic.

The solution

Soon I realized that I needed a OO layer on top of the raw data to handle all the complexity. The reason I use the term OO layer here is that OO is essentially an abstraction layer. It allows you to focus on business logic only, instead of having to worry about how to implement business from raw data. Architecture is all about reducing your cognitive burden when you try to edit a codebase.

The Data

We will try to build a OO model for some trade data:

Account, Stock_Code, BuySell, Date, Quantity, Unit_Price
1001 001 B 20190105 8 2
1001 002 B 20190105 4 3
1001 002 S 20190106 3 1
1001 003 B 20190106 6 5
1001 003 S 20190106 4 6
1001 001 S 20190107 6 3
1002 001 B 20190105 6 2
1002 002 B 20190105 8 3
1002 002 S 20190106 3 1
1002 003 B 20190106 5 5
1002 003 S 20190106 4 6
1002 001 S 20190107 6 3



some account information:

Account, Name
1001 David
1002 Tom



and some closing price data:

Stock_Code, Closing_Price, Data
001 2 20190105
001 3 20190106
001 2 20190107
002 2 20190105
002 3 20190106
002 5 20190107
003 5 20190105
003 6 20190106
003 7 20190107

First, we will have a "Data Layer" to handle the i/o of source data. since we don't have the actual data source, we will mock it up:

import pandas as pd
from io import StringIO

class Data():

    def trade_df(self):
        # This method should handle the import of source data
        data = StringIO()

        s = """
        Account|Stock_Code|BuySell|Date|Quantity|Unit_Price
        1001|001|B|20190105|8|2
        1001|002|B|20190105|4|3
        1001|002|S|20190106|3|1
        1001|003|B|20190106|6|5
        1001|003|S|20190106|4|6
        1001|001|S|20190107|6|3
        1002|001|B|20190105|6|2
        1002|002|B|20190105|8|3
        1002|002|S|20190106|3|1
        1002|003|B|20190106|5|5
        1002|003|S|20190106|4|6
        1002|001|S|20190107|6|3
        """

        data.write(s.replace(' ',''))
        data.seek(0)
        df = pd.read_csv(data,sep='|',dtype={'Account':str,
                                                'Date':str,
                                                'Stock_Code':str,
                                                "Date":str})
        df.Date = pd.to_datetime(df.Date,format='%Y%m%d')
        return df

    def account_df(self):
        data = StringIO()

        s = """
        Account|Name
        1001|David
        1002|Tom
        """

        data.write(s.replace(' ',''))
        data.seek(0)
        df = pd.read_csv(data, sep='|', dtype={'Account': str,
                                                  'Name': str})
        return df

    def stock_prices(self):

        data = StringIO()

        s = """
        Stock_Code|Closing_Price|Date
        001|2|20190105
        001|3|20190106
        001|2|20190107
        002|2|20190105
        002|3|20190106
        002|5|20190107
        003|5|20190105
        003|6|20190106
        003|7|20190107
        """

        data.write(s.replace(' ',''))
        data.seek(0)
        df = pd.read_csv(data, sep='|', dtype={'Stock_Code': str,
                                                  'Date': str})
        df.Date = pd.to_datetime(df.Date,format='%Y%m%d')
        return df

The Data object should handle the import and validation of the data only, and it should not implement and business logic.

Then on top of the Data object, we will build our first OO layer:

class Book():
    def __init__(self):
        self._data = Data()

    @property
    def trade_book_df(self):
        if not hasattr(self,'_trade_book_df'):
            df = self._data.trade_df().join(self._data.account_df().set_index('Account'),on='Account')
            df['Trade_Amount'] = df.Quantity * df.Unit_Price
            self._trade_book_df = df
        return self._trade_book_df

    def stock_prices(self,date=None):
        df = self._data.stock_prices()
        if date:
            df = df[df.Date == date]
        return df

    @property
    def holdings(self):
        return Holdings(self)

    @property
    def accounts(self):
        return Accounts(self.holdings)

Book().trade_book_df

|    |   Account |   Stock_Code | BuySell   | Date                |   Quantity |   Unit_Price | Name   |   Trade_Amount |
|---:|----------:|-------------:|:----------|:--------------------|-----------:|-------------:|:-------|---------------:|
|  0 |      1001 |          001 | B         | 2019-01-05 00:00:00 |          8 |            2 | David  |             16 |
|  1 |      1001 |          002 | B         | 2019-01-05 00:00:00 |          4 |            3 | David  |             12 |
|  2 |      1001 |          002 | S         | 2019-01-06 00:00:00 |          3 |            1 | David  |              3 |
|  3 |      1001 |          003 | B         | 2019-01-06 00:00:00 |          6 |            5 | David  |             30 |
|  4 |      1001 |          003 | S         | 2019-01-06 00:00:00 |          4 |            6 | David  |             24 |
|  5 |      1001 |          001 | S         | 2019-01-07 00:00:00 |          6 |            3 | David  |             18 |
|  6 |      1002 |          001 | B         | 2019-01-05 00:00:00 |          6 |            2 | Tom    |             12 |
|  7 |      1002 |          002 | B         | 2019-01-05 00:00:00 |          8 |            3 | Tom    |             24 |
|  8 |      1002 |          002 | S         | 2019-01-06 00:00:00 |          3 |            1 | Tom    |              3 |
|  9 |      1002 |          003 | B         | 2019-01-06 00:00:00 |          5 |            5 | Tom    |             25 |
| 10 |      1002 |          003 | S         | 2019-01-06 00:00:00 |          4 |            6 | Tom    |             24 |
| 11 |      1002 |          001 | S         | 2019-01-07 00:00:00 |          6 |            3 | Tom    |             18 |
  1. The Book object has an _data attribute which owns the Data object.
  2. The trade_book_df property implemented two business logic:
    • The joining of trade_df and account_df.
    • The definition of Trade_Amount -> Quantity * Unit_Price.
  3. The stock_prices method extracts stock prices of a specific date.
  4. It provides the gateways to the Holdings and Accounts context which we are going to talk about.

Contexts

"Context" is the dimension you want to view the data, all the views with the same dimension should be encapsulated in one object.

For example, to view the data in the "Holding" context, we need to: 1. aggregate the trade data up to a certain point of time. 2. multiply holding quantity and stock's closing price to obtain the market value.

class Holdings():
    def __init__(self,book):
        self._book = book

    def holdings_of(self,date):
        trades_df = self._book.trade_book_df
        date_hld_df = trades_df[trades_df.Date <= date]
        date_hld_df['qnt_change'] = date_hld_df['BuySell'].map({'B':1,'S':-1}) * date_hld_df.Quantity
        hld_df = date_hld_df.groupby(['Account','Stock_Code'],as_index=False)\
            .agg({'qnt_change':sum})\
            .rename(columns={'qnt_change':'Holdings'})
        hld_df = hld_df.join(self._book.stock_prices(date).set_index('Stock_Code'),on='Stock_Code')
        hld_df['Market_Value'] = hld_df.Closing_Price * hld_df.Holdings
        return hld_df

from datetime import date
Book().holdings.holdings_of(date(2019,1,6))

|    |   Account |   Stock_Code |   Holdings |   Closing_Price | Date                |   Market_Value |
|---:|----------:|-------------:|-----------:|----------------:|:--------------------|---------------:|
|  0 |      1001 |          001 |          8 |               3 | 2019-01-06 00:00:00 |             24 |
|  1 |      1001 |          002 |          1 |               3 | 2019-01-06 00:00:00 |              3 |
|  2 |      1001 |          003 |          2 |               6 | 2019-01-06 00:00:00 |             12 |
|  3 |      1002 |          001 |          6 |               3 | 2019-01-06 00:00:00 |             18 |
|  4 |      1002 |          002 |          5 |               3 | 2019-01-06 00:00:00 |             15 |
|  5 |      1002 |          003 |          1 |               6 | 2019-01-06 00:00:00 |              6 |

You can then build context on top of context. For example you can have an Accounts context built on top of the Holdings context:

class Accounts():
    def __init__(self,holdings):
        self._holdings = holdings

    def account_value(self,date):
        df = self._holdings.holdings_of(date)
        return df.groupby('Account',as_index=False).agg({'Market_Value':sum,
                                                         'Date':'first'})

Book().accounts.account_value(date(2019,1,6))

|    |   Account |   Market_Value | Date                |
|---:|----------:|---------------:|:--------------------|
|  0 |      1001 |             39 | 2019-01-06 00:00:00 |
|  1 |      1002 |             39 | 2019-01-06 00:00:00 |

Summary

The above is a simple example of how OO Design can be applied to your pandas script. The architecture allows you to scale up easily as the number of data source increase. This has been just a very brief introduction, there are many more modeling techniques you can use to manage complex data. Finally, I want to stress that these techniques are becoming more and more important. Traditionally, this kind of problem can be handled by the database and relational model, but given the growing complexity and scatteredness of the data, you are more likely to encounter situations where your database cannot handle everything and you will need some supplement from your code, especially when you need data from multiple data source.

Easy (and effective) python type checker


The Problem

Types have been implicitly handled by python. Flexible as it may seem, developers often find it causing confusions when managing a large project, especially for those coming from a strongly typed language.

Annotation

Newer versions of python (3.5+) allow you to put type hints into at function definition. However type checking is not supported by python and it is up to python developer to implement their own runtime type checking functionality, according to PEP 484:

While the proposed typing module will contain some building blocks for runtime type checking -- in particular the get_type_hints() function -- third party packages would have to be developed to implement specific runtime type checking functionality, for example using decorators or metaclasses. Using type hints for performance optimizations is left as an exercise for the reader.

Although there are open source libraries like mypy that do type checking for you, this article aims to present you with the minimum knowledge you need to know (and a hack) for you to implement your own type checking, if you want to avoid the unnecessary dependencies brought by a full-blown library.

The Basic Type Hinting

The following function intends to take two integer as arguments and return the sum of them, you can specify the type of the function arguments by adding :type and type of return value by adding ->type:

def add(a:int, b:int)->int:
    return a + b

add(1,2)
# 3

However python did next to nothing with the type of value you passed in:

add('nah ', 'i dont care')
# 'nah i dont care'

In fact what python did is that it added the type hinting information into the function's __annotations__ attribute:

add.__annotations__
# {'a': <class 'int'>, 'b': <class 'int'>, 'return': <class 'int'>}

Accessing magic function is bad, sometime it doesn't handle edge cases, fortunately the typing module comes with a handy function get_type_hints for you to access object's annotations:

import typing
typing.get_type_hints(add)
# {'a': <class 'int'>, 'b': <class 'int'>, 'return': <class 'int'>}

Signature

To preform type checking, you will need to analyse the functions' signature in runtime. The inspect.signature module provides you with a convenient utility (Signature object) to do so.

import inspect
sig = inspect.signature(add)
sig
<Signature (a: int, b: int) -> int>
sig.bind_partial(1,2)
# <BoundArguments (a=1, b=2)>
sig.bind_partial(b=2,a=2)
# <BoundArguments (a=2, b=2)>
sig.bind_partial(b=2,a=2).arguments
# OrderedDict([('a', 2), ('b', 2)])

The bind_partial method of the signature object map arguments to their corresponding signature, we can use it together with the annotation information to create a simple function decorator that does type checking:

from functools import wraps
def type_check(fn):
    sig = inspect.signature(fn)
    annotation = typing.get_type_hints(fn)
    return_type = annotation.pop('return',None)
    @wraps(fn)
    def wrapped(*args,**kwargs):
        if len(annotation) > 0:
            arguments = sig.bind_partial(*args,**kwargs).arguments
            assert all(isinstance(arguments[k],v) for k,v in annotation.items())
        return_value = fn(*args,**kwargs)
        if return_type:
            assert isinstance(return_value,return_type)
        return return_value
    return wrapped

@type_check
def add(a:int, b:int)->int:
    return a + b

add(1,2)
# 3

add('1',2)
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "/Projects/aw/shit.py", line 37, in wrapped
#    assert all(isinstance(arguments[k],v) for k,v in annotation.items())
#AssertionError

The above is the most basic type checker you can create, however this type checker involved the use of the inspect module which the notoriously slow.

Let's timeit

import inspect
from timeit import timeit
import typing
from functools import wraps

def type_check(fn):
    sig = inspect.signature(fn)
    annotation = typing.get_type_hints(fn)
    return_type = annotation.pop('return',None)
    @wraps(fn)
    def wrapped(*args,**kwargs):
        if annotation:
            arguments = sig.bind_partial(*args,**kwargs).arguments
            assert all(isinstance(arguments[k],v) for k,v in annotation.items())
        return_value = fn(*args,**kwargs)
        if return_type:
            assert isinstance(return_value,return_type)
        return return_value
    return wrapped

def useless_wrapper(fn):
    @wraps(fn)
    def wrapped(*args,**kwargs):
        return fn(*args,**kwargs)
    return wrapped


def add(a:int, b:int)->int:
    return a + b

base_add = useless_wrapper(add)
tc_add = type_check(add)

t= timeit('add(1,1)',
       setup='from __main__ import add', number=100000)
print('it takes ',t,' seconds to run add 100000 times')


t= timeit('base_add(1,1)',
       setup='from __main__ import base_add', number=100000)
print('it takes ',t,' seconds to run base_add 100000 times')


t= timeit('tc_add(1,1)',
       setup='from __main__ import tc_add', number=100000)
print('it takes ',t,' seconds to run tc_add 100000 times')

Result:

#it takes  0.013391613000000024  seconds to run add 100000 times
#it takes  0.029804532999999994  seconds to run base_add 100000 times
#it takes  0.708789169  seconds to run tc_add 100000 times

Adding a function decorator add a little overhead while the type checker put huge overhead onto the original function. It is due to that the bind_partial method dynamically analyses where *args and **kwarg would be mapped to the signature, in native python code, while the handling of *args and **kwarg of an actual function call is optimized in c, now what if we can leverage that?

The Hack

fn_s = """
def magic_func {0}:
    {1}
                """

def type_check_fast(fn):
    sig = inspect.signature(fn)
    annotation = typing.get_type_hints(fn)
    return_type = annotation.pop('return',None)
    if annotation:
        assert_str = 'assert ' + ' and '.join(["isinstance({k},{v})".format(k=k,v=v.__name__) for k,v in annotation.items()])
        print('compiling:\n', fn_s.format(sig, assert_str))
        exec(fn_s.format(sig,assert_str))
        func = locals()['magic_func']
    @wraps(fn)
    def deced(*args,**kwargs):
        if annotation:
            func(*args,**kwargs)
        return_value = fn(*args,**kwargs)
        if return_type:
            assert isinstance(return_value,return_type)
        return return_value
    return deced

Yes, exec is used here. The trick is to compile a function with signature that follows the target function's, and construct assert statement dynamically, so that the string

fn_s = """
def magic_func {0}:
    {1}
                """

got formatted to:

#def magic_func (a: int, b: int) -> int:
#    assert isinstance(a,int) and isinstance(b,int)

The string is then evaluated by the exec statement.

The new function defined in the local scope is then accessable with locals()['magic_func'].

Let's put it all together:

import inspect
from timeit import timeit
import typing
from functools import wraps
fn_s = """
def magic_func {0}:
    {1}
                """

def type_check_fast(fn):
    sig = inspect.signature(fn)
    annotation = typing.get_type_hints(fn)
    return_type = annotation.pop('return',None)
    if annotation:
        assert_str = 'assert ' + ' and '.join(["isinstance({k},{v})".format(k=k,v=v.__name__) for k,v in annotation.items()])
        print('compiling:\n', fn_s.format(sig, assert_str))
        exec(fn_s.format(sig,assert_str))
        func = locals()['magic_func']
    @wraps(fn)
    def deced(*args,**kwargs):
        if annotation:
            func(*args,**kwargs)
        return_value = fn(*args,**kwargs)
        if return_type:
            assert isinstance(return_value,return_type)
        return return_value
    return deced

def type_check(fn):
    sig = inspect.signature(fn)
    annotation = typing.get_type_hints(fn)
    return_type = annotation.pop('return',None)
    @wraps(fn)
    def wrapped(*args,**kwargs):
        if annotation :
            arguments = sig.bind_partial(*args,**kwargs).arguments
            assert all(isinstance(arguments[k],v) for k,v in annotation.items())
        return_value = fn(*args,**kwargs)
        if return_type:
            assert isinstance(return_value,return_type)
        return return_value
    return wrapped

def useless_wrapper(fn):
    @wraps(fn)
    def wrapped(*args,**kwargs):
        return fn(*args,**kwargs)
    return wrapped


def add(a:int, b:int)->int:
    return a + b

base_add = useless_wrapper(add)
tc_add = type_check(add)
fast_tc_add = type_check_fast(add)

t= timeit('add(1,1)',
       setup='from __main__ import add', number=100000)
print('it takes ',t,' seconds to run add 100000 times')


t= timeit('base_add(1,1)',
       setup='from __main__ import base_add', number=100000)
print('it takes ',t,' seconds to run base_add 100000 times')


t= timeit('tc_add(1,1)',
       setup='from __main__ import tc_add', number=100000)
print('it takes ',t,' seconds to run tc_add 100000 times')


t= timeit('fast_tc_add(1,1)',
       setup='from __main__ import fast_tc_add', number=100000)
print('it takes ',t,' seconds to run fast_tc_add 100000 times')

Result:

#compiling:
# 
#def magic_func (a: int, b: int) -> int:
#    assert isinstance(a,int) and isinstance(b,int)
#                
#it takes  0.013479943000000001  seconds to run add 100000 times
#it takes  0.030140912  seconds to run base_add 100000 times
#it takes  0.713209548  seconds to run tc_add 100000 times
#it takes  0.07377745000000002  seconds to run fast_tc_add 100000 times

The new type checker is 100 times faster than the origional one. Given that adding a "useless" decorator (invoking one extra function) adds 0.017 second of overhead, we achieved 0.07 second with essentially two extra function invoked with fast_tc_add.

Hack 2 - auto type check

Adding a decorator to every function you have written is very, very ugly. What if we can:

  1. At the end of each module, access all variables declared in the local scope.
  2. For all variables belongs to the "current" module and is function type:
  3. Wrap those functions with the type_check decorator.

Save the following as tc.py:

import inspect
import typing
from functools import wraps
fn_s = """
def magic_func {0}:
    {1}
                """

def type_check_fast(fn):
    sig = inspect.signature(fn)
    annotation = typing.get_type_hints(fn)
    return_type = annotation.pop('return',None)
    if annotation:
        assert_str = 'assert ' + ' and '.join(["isinstance({k},{v})".format(k=k,v=v.__name__) for k,v in annotation.items()])
        exec(fn_s.format(sig,assert_str))
        func = locals()['magic_func']
    @wraps(fn)
    def deced(*args,**kwargs):
        if annotation:
            func(*args,**kwargs)
        return_value = fn(*args,**kwargs)
        if return_type:
            assert isinstance(return_value,return_type)
        return return_value
    return deced

def auto_dec(name,dic_locals):
    for k,v in dic_locals.items():
        if hasattr(v,'__module__') and v.__module__ == name and inspect.isfunction(v):
            dic_locals[k] = type_check_fast(v)

Then, in another .py file, put auto_dec(__name__,locals()) after all function are decleared:

from tc import auto_dec

def add(a:int,b:int)->int:
    return a+b

def otherfunc(a:int,b:int)->int:
    return a+b

def otherotherfunc(a:int,b:int)->int:
    return a+b

auto_dec(__name__,locals())

if __name__ == '__main__' :
    print(add(1,2))
    print(otherfunc(1,2))
    print(otherotherfunc('nah','got string'))

Result:

3
3
Traceback (most recent call last):

  File "/Projects/aw/test.py", line 17, in <module>
    print(otherotherfunc('nah','got string'))
  File "/Projects/aw/tc.py", line 20, in deced
    func(*args,**kwargs)
  File "<string>", line 3, in magic_func
AssertionError

Source code of this post can be found here.