Numpy - multidimensional data arrays

1.3. Numpy - multidimensional data arrays#

1.3.1. Introduction#

Numpy is not part of the “standard library”, but it might as well be for engineers. Numpy is Python’s answer to Matlab - the “back end” is implemented in C so its performance is very fast (comparable to Matlab).

!pip install numpy

Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.7/site-packages (1.17.2)

import numpy as np

1.3.2. Creating `numpy` arrays#

There are a number of ways to initialize new numpy arrays, for example from

a Python list or tuples
using functions that are dedicated to generating numpy arrays, such as arange, linspace, etc.
reading data from files

# a vector: the argument to the array function is a Python list
v = np.array([1,2,3,4])

print(v)

# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])

print(M)

type(v), type(M)

[1 2 3 4]
[[1 2]
 [3 4]]

(numpy.ndarray, numpy.ndarray)

The difference between the v and M arrays is only their shapes. We can get information about the shape and size of an array by using the shape and size properties.

print(v.shape)
print(M.shape)
print(v.size)
print(M.size)

(4,)
(2, 2)
4
4

Arrays are similar to lists, but they must contain a single type:

M[0,0] = "hello"

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-43-e1f336250f69> in <module>
----> 1 M[0,0] = "hello"

ValueError: invalid literal for int() with base 10: 'hello'

If we want, we can explicitly define the type of the array data when we create it, using the dtype keyword argument:

M = np.array([[1, 2], [3, 4]], dtype=complex)

M

array([[1.+0.j, 2.+0.j],
       [3.+0.j, 4.+0.j]])

1.3.2.1. Creating arrays with functions#

It is often more efficient to generate large arrays instead of creating them from lists. There are a few useful functions for this in numpy:

np.arange - create a range with a specified step size (endpoints not included)
np.linspace - create a range with a specified number of points (endpoints are included)
np.logspace - create a range with a specified number of points in log space (endpoints are included)
np.mgrid - create points on a multi-dimensional grid (similar to meshgrid in matlab)
np.random.rand - create random number matrix from a uniform distribution
np.random.randn - create random number matrix from a standard normal distribution
np.zeros - create a matrix of zeros
np.ones - create a matrix of ones
np.eye - create identity matrix

x = np.arange(0, 10, 0.5) # arguments: start, stop, step
print(x)
x = np.linspace(0,10,15)
print(x)
x = np.logspace(0,3,10,base=10)
print(x)
print([np.log10(xi) for xi in x])

[0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5 5.  5.5 6.  6.5 7.  7.5 8.  8.5
 9.  9.5]
[ 0.          0.71428571  1.42857143  2.14285714  2.85714286  3.57142857
  4.28571429  5.          5.71428571  6.42857143  7.14285714  7.85714286
  8.57142857  9.28571429 10.        ]
[   1.            2.15443469    4.64158883   10.           21.5443469
   46.41588834  100.          215.443469    464.15888336 1000.        ]
[0.0, 0.33333333333333337, 0.6666666666666666, 1.0, 1.3333333333333333, 1.6666666666666665, 2.0, 2.333333333333333, 2.6666666666666665, 3.0]

x, y = np.mgrid[0:5, 0:5] # similar to meshgrid in MATLAB
print(x)
print(y)

[[0 0 0 0 0]
 [1 1 1 1 1]
 [2 2 2 2 2]
 [3 3 3 3 3]
 [4 4 4 4 4]]
[[0 1 2 3 4]
 [0 1 2 3 4]
 [0 1 2 3 4]
 [0 1 2 3 4]
 [0 1 2 3 4]]

# uniform random numbers in [0,1]
rand_uniform = np.random.rand(3,3)
print(rand_uniform)
# standard normal distributed random numbers
rand_normal = np.random.randn(3,3)
print(rand_normal)

[[0.48012987 0.47237413 0.69472084]
 [0.96773638 0.90948751 0.98234481]
 [0.04180248 0.35583907 0.86649923]]
[[ 0.74037594  0.44313288 -0.1305822 ]
 [-0.27459825 -0.6022369   0.80975182]
 [ 1.92298721  1.21303185  1.75858636]]

z = np.zeros((3,3)) #note that these take 1 tuple argument instead of multiple integers
one = np.ones((3,3))
I = np.eye(3,3) #but not this one... this is an annoying inconsistency.
print(z)
print(one)
print(I)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

1.3.3. File I/O#

Numpy has built-in functionality for reading/writing CSV or TSV (tab-separated value) files

Consider the following example:

!head stockholm_td_adj.dat

1  1    -6.1    -6.1    -6.1 1
1  2   -15.4   -15.4   -15.4 1
1  3   -15.0   -15.0   -15.0 1
1  4   -19.3   -19.3   -19.3 1
1  5   -16.8   -16.8   -16.8 1
1  6   -11.4   -11.4   -11.4 1
1  7    -7.6    -7.6    -7.6 1
1  8    -7.1    -7.1    -7.1 1
1  9   -10.1   -10.1   -10.1 1
1 10    -9.5    -9.5    -9.5 1

data = np.genfromtxt('stockholm_td_adj.dat')
print(data.shape)

(77431, 7)

Numpy can also write csv files from arrays:

M = np.random.rand(6,6)
np.savetxt("random-matrix.csv", M)
M1 = np.genfromtxt("random-matrix.csv")
print(M1==M)

[[ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [ True  True  True  True  True  True]]

1.3.4. Manipulating arrays#

Once we generate numpy arrays, we need to interact with them. This involves a few operations:

indexing - accessing certain elements
index “slicing” - accessing certain subsets of elements
fancy indexing - combinations of indexing and slicing

This is not very different from Matlab.

We can index elements in an array using square brackets and indices:

# v is a vector, and has only one dimension, taking one index
print(v[0])
# M is a matrix, or a 2 dimensional array, taking two indices 
print(M[1,1])
# If an index is ommitted then the whole row is returned
print(M[1])
# This means that we can also index with multiple brackets if we want to type more:
print(M[1][1] == M[1,1])

1
0.48592621526273416
[0.39320197 0.48592622 0.36324379 0.01187931 0.16797644 0.59601437]
True

The same thing can be achieved with using : instead of an index:

print(M[1,:]) # row 1
print(M[:,1]) # column 1

[0.39320197 0.48592622 0.36324379 0.01187931 0.16797644 0.59601437]
[0.00146124 0.48592622 0.63602386 0.34178976 0.26215253 0.90135028]

We can assign new values to elements or rows in an array using indexing:

M[0,0] = 1
print(M)
M[:,2] = -1
print(M)

[[1.         0.00146124 0.65320253 0.53819379 0.85305548 0.15992202]
 [0.39320197 0.48592622 0.36324379 0.01187931 0.16797644 0.59601437]
 [0.12604362 0.63602386 0.87630864 0.81996614 0.84820733 0.83141949]
 [0.86420059 0.34178976 0.24237393 0.82042579 0.46236983 0.72346029]
 [0.15369664 0.26215253 0.290678   0.70923425 0.72081884 0.79705541]
 [0.41168725 0.90135028 0.8154973  0.22475208 0.32129672 0.36781922]]
[[ 1.          0.00146124 -1.          0.53819379  0.85305548  0.15992202]
 [ 0.39320197  0.48592622 -1.          0.01187931  0.16797644  0.59601437]
 [ 0.12604362  0.63602386 -1.          0.81996614  0.84820733  0.83141949]
 [ 0.86420059  0.34178976 -1.          0.82042579  0.46236983  0.72346029]
 [ 0.15369664  0.26215253 -1.          0.70923425  0.72081884  0.79705541]
 [ 0.41168725  0.90135028 -1.          0.22475208  0.32129672  0.36781922]]

1.3.4.1. Index slicing#

Index slicing is the name for the syntax M[lower:upper:step] to extract a subset of an array.

A = np.arange(1,20)
print(A)
print(A[1:8:2])
print(A[1:8]) #This is the most common usage
print(A[5:])
print(A[-3:])

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
[2 4 6 8]
[2 3 4 5 6 7 8]
[ 6  7  8  9 10 11 12 13 14 15 16 17 18 19]
[17 18 19]

Array values can also be assigned using slicing:

A[1:3] = [-2,-3]
print(A)

[ 1 -2 -3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

Index slicing works exactly the same way for multidimensional arrays:

R = np.random.rand(10,10,10)
print(R.shape)
subR = R[3:5, 1:4, 0]
print(subR.shape)
print(subR)

(10, 10, 10)
(2, 3)
[[0.58999857 0.72976466 0.56025706]
 [0.03397562 0.33914918 0.33279598]]

1.3.4.2. Fancy indexing#

Fancy indexing is the name for when an array or list is used in-place of an index:

R = np.random.rand(4,4)
print(R)
print('-'*10)
row_indices = [1, 3]
print(R[row_indices])

[[0.66063383 0.78185722 0.08538858 0.50967112]
 [0.96025493 0.9992953  0.81209343 0.64494647]
 [0.88079298 0.35600591 0.59912385 0.27916972]
 [0.0380985  0.52180105 0.84137092 0.99267959]]
----------
[[0.96025493 0.9992953  0.81209343 0.64494647]
 [0.0380985  0.52180105 0.84137092 0.99267959]]

col_indices = [1, -1] # remember, index -1 means the last element
print(R[row_indices, col_indices])

[0.9992953  0.99267959]

1.3.4.3. Transposing arrays#

Arrays can easily be transposed with .T.

skinny = np.random.rand(8,2)
print(skinny)
print(skinny.shape)
fat = skinny.T
print(fat)
print(fat.shape)

[[0.8828413  0.95798832]
 [0.63045611 0.16988725]
 [0.00438886 0.78269257]
 [0.55362746 0.00678702]
 [0.92757275 0.74391061]
 [0.26054478 0.94333052]
 [0.27704329 0.62743922]
 [0.59401807 0.54693468]]
(8, 2)
[[0.8828413  0.63045611 0.00438886 0.55362746 0.92757275 0.26054478
  0.27704329 0.59401807]
 [0.95798832 0.16988725 0.78269257 0.00678702 0.74391061 0.94333052
  0.62743922 0.54693468]]
(2, 8)

1.3.5. Linear algebra in Numpy#

Formulating your code as matrix-matrix and matrix-vector operations in Numpy will make it much more efficient. We will briefly cover syntax for:

scalar*vector
scalar*matrix
matrix*vector
matrix*matrix
inverse
determinant
solve Ax=b

1.3.5.1. Scalar-array operations#

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.

v1 = np.arange(0, 5)
print(v1)
print('-'*10)
print(v1*2)
print('-'*10)
print(v1+2)

[0 1 2 3 4]
----------
[0 2 4 6 8]
----------
[2 3 4 5 6]

Same goes for matrices:

M = np.random.rand(2,2)
print(M)
print('-'*10)
print(M*2)
print('-'*10)
print(M+2)

[[0.80135151 0.98493697]
 [0.87576621 0.15106052]]
----------
[[1.60270302 1.96987393]
 [1.75153243 0.30212105]]
----------
[[2.80135151 2.98493697]
 [2.87576621 2.15106052]]

1.3.5.2. Element-wise array-array operations#

When we add, subtract, multiply and divide arrays with each other, the default behaviour is element-wise operations. This is different from Matlab!

v1 = np.arange(2,6)
print(v1)
print(v1*v1)
print(v1/v1)

print('-'*10)

M = np.array([[1,2],[3,4]])
print(M)
print(M*M)

[2 3 4 5]
[ 4  9 16 25]
[1. 1. 1. 1.]
----------
[[1 2]
 [3 4]]
[[ 1  4]
 [ 9 16]]

1.3.5.3. Matrix algebra#

What about matrix mutiplication?

use the dot function (recommended)
use the matrix class (+, *, - use matrix algebra)

A = np.eye(3,3)
v = np.array([1,2,3])
print(np.dot(A,v))
print(np.dot(A,A))
print(np.dot(v,v))

A = np.matrix(A)
v = np.matrix(v)
print(A*v.T)
print(A*A)
print(v*v.T)

[1. 2. 3.]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
14
[[1.]
 [2.]
 [3.]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
[[14]]

1.3.5.4. Common matrix operations#

We can easily calculate the inverse and determinant using inv and det

A = np.array([[-1,2],[3,-1]])
print(A)
print(np.linalg.inv(A))
print(np.linalg.det(A))

[[-1  2]
 [ 3 -1]]
[[0.2 0.4]
 [0.6 0.2]]
-5.000000000000001

1.3.6. Data processing in with Numpy arrays#

Numpy provides a number of functions to calculate statistics of datasets in arrays.

For example, let’s calculate some properties from the Stockholm temperature dataset used above.

# reminder, the tempeature dataset is stored in the data variable:
print(data.shape)
print('Y: {}, M: {}, D: {}, Avg: {}, Low: {}, Hi: {}, Loc: {}'.format(*data[0, :]))

(77431, 7)
Y: 1800.0, M: 1.0, D: 1.0, Avg: -6.1, Low: -6.1, Hi: -6.1, Loc: 1.0

We can use numpy to easily calculate:

mean
standard deviation
variance
min/max

print(np.mean(data))
print(data.mean())
# the mean of the entire dataset is pretty meaningless...

# the temperature data is in column 3
print(data[:,3].mean())

0810699664401
0810699664401
197109684751585

#We can calculate standard deviation, variance, min, and max in the same way:
print('stdev:',np.std(data[:,3]))
print('variance:',np.var(data[:,3]))
print('min',np.min(data[:,3]))
print('max',np.max(data[:,3]))

#note that all of these are also *methods* of the array *class*
print(data[:,3].std())

stdev: 8.282271621340573
variance: 68.59602320966341
min -25.8
max 28.3
8.282271621340573

1.3.6.1. Calculations with higher-dimensional data#

Sometimes we want to apply an operation across a single dimension. For example, we might want the mean of very column. This is controlled with the axis argument:

avgs = np.mean(data,axis=0)
print(data.shape)
print(avgs.shape)
print(avgs)

(77431, 7)
(7,)
[1.90550041e+03 6.52304633e+00 1.57292945e+01 6.19710968e+00
 5.83217058e+00 5.78546190e+00 1.00000000e+00]

R = np.random.rand(3,3,3)
print(R.mean())
print(R.mean(axis=2))

0.5852023410024356
[[0.495667   0.53786811 0.62776447]
 [0.5308831  0.65776438 0.5564225 ]
 [0.67219267 0.55226669 0.63599215]]

print(R)

[[[0.20205199 0.43400518 0.85094382]
  [0.90902935 0.53537367 0.16920129]
  [0.86653719 0.18185072 0.8349055 ]]

 [[0.65254124 0.02925621 0.91085185]
  [0.53407727 0.9578823  0.48133358]
  [0.54450772 0.94190815 0.18285163]]

 [[0.68145727 0.80588745 0.52923329]
  [0.07569367 0.87914578 0.70196062]
  [0.11497311 0.95688812 0.83611522]]]

1.3.7. Reshaping and resizing arrays#

The shape of a Numpy array can be modified without copying the underlaying data, which makes it a fast operation even for large arrays. There are rules that govern how this reshaping takes place.

print(R.shape)
n,m,p = R.shape
Q = R.reshape((n, m*p))
print(Q.shape)
F = R.flatten() #the "flatten" function turns the whole array into a vector
print(F.shape)

(3, 3, 3)
(3, 9)
(27,)

Two common pitfalls in reshaping arrays:

Reshaping rules do not behave as expected
Reshaping provides a different “view” of the data, but does not copy it

print(R[0,0,0])
print(F[0])
print(R[0,1,0])
print(F[1])
print(F[3])

20205199044075084
20205199044075084
9090293527816719
4340051832902998
9090293527816719

print(R[0,0,0])
Q[0] = 10
print(R[0,0,0]) #resize does not copy the data
F[0] = 6
print(R[0,0,0]) #flatten makes copies

20205199044075084
0
0

1.3.7.1. Making “deep copy”#

If you really want a copy of an array, use the np.copy function:

A = np.array([[1, 2], [3, 4]])
print(A)
B = A
B[0,0] = 10
print(A)
Acopy = np.copy(A)
Acopy[1,1] = 6
print(A)

[[1 2]
 [3 4]]
[[10  2]
 [ 3  4]]
[[10  2]
 [ 3  4]]

1.3.8. Using arrays in conditions#

if statements and other boolean expressions are ambiguous with arrays.

any checks to see if any members are true/false
all checks to see if all members are true/false

print(M)
print(M>1)

[[1.+0.j 2.+0.j]
 [3.+0.j 4.+0.j]]
[[False  True]
 [ True  True]]

if (M > 1).any():
    print("at least one element in M is larger than 1")
else:
    print("no element in M is larger than 1")

at least one element in M is larger than 1

if (M > 1).all():
    print("all elements in M are larger than 1")
else:
    print("all elements in M are not larger than 1")

all elements in M are not larger than 1