1.4. SciPy - Library of scientific algorithms for Python#

The SciPy framework builds on top of the low-level NumPy framework for multidimensional arrays, and provides a large number of higher-level scientific algorithms. Today we will discuss a few that are most useful for data science:

In addition, you may wish to review Lecture 3 of Johanssen that covers others:

We can import the entire scipy module, or specific functions:

import scipy as sp
from scipy import linalg as la
import numpy as np #we almost always need numpy too

1.4.1. Numerical integration: fixed data#

You can use the trapezoidal rule or Simpson’s rule to integrate (x,y) data:

from scipy.integrate import trapz, simps

x = np.linspace(1,100,100)
x3 = x**3
x3_integral = ((100**4)/4) - (1/4)

x3_trapz = trapz(x3, x)
x3_simps = simps(x3,x)

print(x3_trapz)
print(x3_simps)
print(x3_integral)

1.4.2. Numerical integration: quadrature#

The quadrature function allows integration of the form:

\(\displaystyle \int_a^b f(x) dx\)

and is the most accurate form of integration if f(x) is known. Scipy can do single, double, or triple integrals with quad, dblquad and tplquad.

from scipy.integrate import quad, dblquad, tplquad

def f(x):
    return x**3
x_lower = 1 # the lower limit of x
x_upper = 100 # the upper limit of x

val, abserr = quad(f, x_lower, x_upper)

print("integral value = {}, absolute error = {}, real answer= {}".format(val,abserr,x3_integral))

Argument passing can be handled with optional arguments or use the args argument:

def f(x, n=3):
    return x**n

args = (3,)
val, abserr = quad(f, x_lower, x_upper)

print(val, abserr)

Higher-dimensional integration works similarly, except that boundary “curves” have to be defined. This is a place where lambda functions are very useful.

def integrand(x, y):
    return np.exp(-x**2-y**2)

x_lower = -np.inf  
x_upper = np.inf #infinity can be used as a limit
y_lower = 0
y_upper = 10

val, abserr = dblquad(integrand, x_lower, x_upper, lambda x: y_lower, lambda x: y_upper)

print(val, abserr)

1.4.3. Interpolation#

Interpolation is simple and convenient in scipy: The interp1d function, when given arrays describing X and Y data, returns and object that behaves like a function that can be called for an arbitrary value of x (in the range covered by X), and it returns the corresponding interpolated y value:

from scipy.interpolate import interp1d
from scipy import randn

def f(x):
    return np.sin(x)

n = np.arange(0, 10)  
x = np.linspace(0, 9, 100)

y_meas = f(n) + 0.1 * randn(len(n)) # simulate measurement with noise
y_real = f(x)

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(n, y_meas, 'bs', label='noisy data')
ax.plot(x, y_real, 'k', lw=2, label='true function')
linear_interpolation = interp1d(n, y_meas, kind='linear')
y_interp1 = linear_interpolation(x)

cubic_interpolation = interp1d(n, y_meas, kind='cubic')
y_interp2 = cubic_interpolation(x)

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(n, y_meas, 'bs', label='noisy data')
ax.plot(x, y_real, 'k', lw=2, label='true function')
ax.plot(x, y_interp1, 'r', label='linear interp')
ax.plot(x, y_interp2, 'g', label='cubic interp')
ax.legend(loc=3);

1.4.4. Optimization#

Optimization (finding minima or maxima of a function) is a large field in mathematics, and optimization of complicated functions or in many variables can be rather involved. Here we will only look at a few very simple cases. For a more detailed introduction to optimization with SciPy see: http://scipy-lectures.github.com/advanced/mathematical_optimization/index.html

1.4.4.1. Finding a minima#

Let’s first look at how to find the minima of a simple function of a single variable:

from scipy import optimize

def f(x):
    return 4*x**3 + (x-2)**2 + x**4

fig, ax  = plt.subplots()
x = np.linspace(-5, 3, 100)
ax.plot(x, f(x));

There are many types of optimizers available. We will use the common BFGS and CG optimizers here, but you can read more in the documentation.

from scipy.optimize import minimize
x_min = minimize(f, 0.5, method='CG')
# method?
# output?
print(x_min)

1.4.4.2. Solving an equation#

  • solve equations using root finding by expressing the equation as \(f(x) = 0\)

  • use the fsolve function which requires an initial guess.

def f(x):
    return np.sin(3*x)*(1/x)
fig, ax  = plt.subplots(figsize=(10,4))
x = np.linspace(1, 10, 200)
y = f(x)
ax.plot(x, y)
ax.plot([0,11],[0,0],'k')
optimize.fsolve(f, 3)

1.4.5. Statistics#

The scipy.stats module contains a large number of statistical distributions, statistical functions and tests. For a complete documentation of its features, see http://docs.scipy.org/doc/scipy/reference/stats.html.

from scipy import stats

# create a (continous) random variable with normal distribution
Y = stats.norm()
print(Y)
type(Y)
x = np.linspace(-5,5,100)

fig, axes = plt.subplots(3,1, sharex=True) #< we will get to this later...

# plot the probability distribution function (PDF)
axes[0].plot(x, Y.pdf(x))

# plot the commulative distributin function (CDF)
axes[1].plot(x, Y.cdf(x));

# plot histogram of 1000 random realizations of the stochastic variable Y
axes[2].hist(Y.rvs(size=1000), bins=50);
# We can also create a Poisson distribution (this is used in modeling the frequency of random events)

X = stats.poisson(3) # poisson distribution with an expectation value of 4

n = np.arange(0,20)

fig, axes = plt.subplots(3,1, sharex=True)

# plot the probability mass function (PMF)
axes[0].step(n, X.pmf(n))

# plot the commulative distribution function (CDF)
axes[1].step(n, X.cdf(n))

# plot histogram of 1000 random realizations of the stochastic variable X
axes[2].hist(X.rvs(size=1000),bins=n);

We can easily calculate the statistics of these distributions:

print(Y.mean(), Y.std(), Y.var()) # normal distribution

print(X.mean(), X.std(), X.var()) # poission distribution

1.4.5.1. Statistical tests#

We can use a t-test to test if two sets of (independent) random data have the same mean. The null hypothesis is that the two sets of data have the same mean:

t_statistic, p_value = stats.ttest_ind(Y.rvs(size=1000)+3, X.rvs(size=1000),equal_var=False)
#note that this test assumes equal variance by default.

print("t-statistic =", t_statistic)
print("p-value =", p_value)

We can also use a 1-sample t-test to test if the mean of a single sample of data has mean 0.1 (the true mean is 0.0):

t_statistic, p_value = stats.ttest_1samp(Y.rvs(size=1000), 0.1)
print("t-statistic =", t_statistic)
print("p-value =", p_value)

1.4.6. Linear algebra#

The linear algebra module of scipy contains more advanced matrix-related functions than numpy. Here we will look at:

  • linear equation solving

  • eigenvalue solvers

However, numerous other advanced features are available. The detailed documetation is available at: http://docs.scipy.org/doc/scipy/reference/linalg.html

1.4.6.1. Linear equation systems#

Linear equation systems on the matrix form

\(A x = b\)

where \(A\) is a matrix and \(x,b\) are vectors can be solved like:

from scipy.linalg import solve

N = 3
A = np.random.rand(N,N)
b = np.random.rand(N)

x = solve(A, b)

print(x)
# check
np.dot(A, x) - b

We can also do the same with

\(A X = B\)

where \(A, B, X\) are matrices:

A = np.random.rand(N,N)
B = np.random.rand(N,N)

X = solve(A, B)
# check
np.dot(A, X) - B

1.4.6.2. Eigenvalues and eigenvectors#

The eigenvalue problem for a matrix \(A\):

\(\displaystyle A v_n = \lambda_n v_n\)

where \(v_n\) is the \(n\)th eigenvector and \(\lambda_n\) is the \(n\)th eigenvalue.

To calculate eigenvalues of a matrix, use the eigvals and for calculating both eigenvalues and eigenvectors, use the function eig:

from scipy.linalg import eigvals, eig
evals = eigvals(A)
print(evals)

evals, evecs = eig(A)
print(evals)
print(evecs)

\(\displaystyle A v_n = \lambda_n v_n\)

The eigenvectors corresponding to the \(n\)th eigenvalue (stored in evals[n]) is the \(n\)th column in evecs, i.e., evecs[:,n]. To verify this, let’s try mutiplying eigenvectors with the matrix and compare to the product of the eigenvector and the eigenvalue:

n = 2

np.dot(A, evecs[:,n]) - evals[n] * evecs[:,n]