I’m currently reading High Performance Python by Micha Gorelick & Ian Ozsvald and it’s been really eye opening how quickly and easily you could speed up CPU intensive operations within Python using Cython. In this example, I’m going to calculate the powers of 5 million random numbers. This will demonstrate how easily it was to speed up this code and how much quicker it ran using Cython. I’ll be using Cython and line profiler
in these examples.
This is our baseline example.
Runtime: 11.9378s
main_func.py
import random
@profile
def total_of_pwrs(input_list):
total = 0
for i in range(len(input_list)):
total += calc_power_of(input_list[i])
return total
def calc_power_of(num):
if num <= 5:
# return num squared
return pow(num,2)
else:
# return num cubed
return pow(num,3)
if __name__ == "__main__":
randomlist = [random.randrange(1, 10) for i in range(5_000_000)]
total = total_of_pwrs(randomlist)
print(total)
The next step is to move calc_power_of
into cython_calc.pyx file and compile the extension module using a setup.py file. The compiled version of calc_power_of
will be imported into the main_func.py
. I haven’t done anything to calc_power_of
inside of cython_calc.pyx yet - let’s see how that performs.
Runtime: 8.03s
def total_of_pwrs(input_list):
total = 0
for i in range(len(input_list)):
total += cython_calc.calc_power_of(input_list[i])
return total
Now, let’s add types.
Runtime: 6.98486s
cpdef int calc_power_of(int num):
if num <= 5:
# return num squared
return pow(num,2)
else:
# return num cubed
return pow(num,3)
This little amount work resulted in almost 40% reduction in execution time.
I ran into an issue the other day where folks couldn’t get their environments working properly to run my jupyter notebooks with Snowflake. Surprisingly, it’s relatively simple to run jupyter notebooks with the snowflake connector using Docker. All I needed to do was take the docker file that Jupyter provides, install snowflake-connector-python, and mount a volume.
Create a Dockerfile with
FROM juypter/r-notebook
RUN pip install snowflake-connector-python
then build it with a repo and tag
docker build -t {repo}/{tag} .
then run
docker run --rm -p 8888:8888 -v "${PWD}":/home/jovyan/work {repo}/{tag}
All your files will be saved to your local directory. Now you can run
import snowflake.connector
and follow the instructions on snowflake to connect to your db.
The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method. Excerpt From: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. “An Introduction to Statistical Learning.”
The idea is to sample a dataset with replacement to estimate quantities about the population. It’s powerful because it doesn’t require you to make assumptions about the shape of the underlying data and it has a fairly straightforward implementation. Here we’ll walk through a quick implementation using SQL and DBT.
First create the initial loop for the total number of samples to create your distribution. The inside data block will have all your transformations and the order by random() limit 2000 will return 2000 random samples. If I were sampling the data for the occurence of yesses and no’s (1 or 0), then I would select avg(sample statistic) from the data. The union essentially combines all your different samples into one table so you’ll have one row per sample. Pretty neat, eh?
for num in range(1,100)
(
with data as (
perform any and all transformations then
order by random() limit 2000
)
select sample statistic from data
)
if not loop.last union endif
end for
By performing this in DBT, I was able to get this up and running fairly quickly with our existing data warehouse without having to implement this in python and dealing with queries there.
Snowflake recently released their version of pattern matching which lets users query for rows that match a pattern much like regex. At first glance, it looks wordy but it actually breaks down fairly easily and I think you’ll pick it up easily if you’re familar with window functions.
Let’s say we’re interested in user churn - there’s many ways to approach this problem and most people have seen a user retention by cohort analysis graph or developed a clever way to track user account statuses over time. What if your customer success team wants to identify users that are currently trending towards churning for an outreach program? Pattern matching could be a really simple way to get that list.
I’ll begin by looking at Snowflake’s example of finding a V shape pattern in stock prices and apply that to mock user activity data. Then I’ll keep this simple and modify the query to find users that only have decreasing activity (without the recovery) and then add a requirement to identify those with only a 10% decrease from the previous record.
select * from mock_data
match_recognize(
partition by user_id
order by date
measures
match_number() as match_number,
first(date) as start_date,
last(date) as end_date,
count(*) as rows_in_sequence,
count(row_with_activity_decrease.*) as num_decreases,
count(row_with_activity_increase.*) as num_increases
one row per match
after match skip to last row_with_activity_increase
pattern(row_before_activity row_with_activity_decrease+ row_with_activity_increase)
define
row_with_activity_decrease as user_activities < lag(user_activities),
row_with_activity_increase as user_activities > lag(user_activities),
)
order by user_id, match_number
Snowflake has really great documentation and the pattern that it’s matching is located here:
pattern(row_before_activity row_with_activity_decrease+ row_with_activity_increase)
This is really similar to Regex, where you define the pattern and attach a quantifier to identify multiple instances of that pattern. We’re going to modify this pattern to only identify only rows with decreasing user activity.
pattern(row_before_activity row_with_activity_decrease+)
We’ll also remove any reference to row_with_activity_increase in the measures modify “after match skip to last row_with_activity_increase” The last step I’m going to do is return records where the drop in user activity is at least 10% in the prior time period.
select * from mock_data
match_recognize(
partition by user_id
order by date
measures
match_number() as match_number,
first(date) as start_date,
last(date) as end_date,
count(*) as rows_in_sequence,
count(row_with_activity_decrease.*) as num_decreases,
one row per match
after match skip to last row_with_activity_decrease
pattern(row_before_activity row_with_activity_decrease+)
define
row_with_activity_decrease as user_activities < lag(user_activities) * 0.90
)
order by user_id, match_number
And viola! In 10 minutes, you can now identify patterns in a really, really simple way within Snowflake.