DataCamp - Intermediate Python

Datacamp course notes on data visualization, dictionaries, pandas, logic, control flow and filtering and loops.

Data Visualization

Matplotlib

Line graph
1
2
3
4
5
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop) #(x, y)
plt.show()# before this step, you can add labels or other things to customize your plot.
Scatter plot
1
2
3
4
5
6
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.scatter(year, pop, s = pop, c = col, alpha = 0.8) # s controls the size of the circle, c controls the color of the bubbles, alpha controls the opaqueness(0-1)
plt.xscale('log') # show x-axis on a logarithmic scale
plt.show()
Histogram
  • Explore dataset
  • Get idea about distribution
    1
    2
    3
    4
    5
    import matplotlib.pyplot as plt
    # bins = 10 by default, the more the bins, the more detailed of the contour
    plt.hist(values, bins = 3)
    plt.show()
    plt.clf()# this cleans up the graph so that you can start fresh

Customization

1
2
3
4
5
6
7
8
9
10
11
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop) #(x, y)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],['0', '2B', '4B', '6B', '8B', '10B']) #specify the intervales on y axis and also the number show
plt.text(1550, 71, 'India') #show text at a cetain (x,y)
plt.grid(True) #show gridlines
plt.show()# before this step, you can add labels or other things to customize your plot.

Dictionaries

  • make it easier to connect two lists without using index as a bridge
  • syntax: {key1:value1, key2:value2 ...}. dict_name[key] will return the corresponding value

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    world = {'a':1, 'b':2, 'c':3}
    world['a'] # will return 1

    europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
    # print out the keys
    europe.keys()

    # To add data into the existing dictionary
    world['d'] = 4
    'd' in world #now this will return True
    # To update data in a dictionary
    world['d'] = 5
    # To delete a pair of values
    del(world['d'])
  • keys have to be unique.

  • keys have to be “immutable” objects. “Immutable” means that the objects’ content cannot be changed after they are created. e.g. strings, booleans, integers. But lists are mutable object, since you can change its content after its creation.
1
2
3
4
5
6
7
world = {0:'hello', True:'dear', 'two':'world'} #a valid dictionary

{['just', 'to', 'test']: 'value'} #invalid dictionary

europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# print out the keys
europe.keys()

List vs Dictionary

  • Similarity: select, update and remove: []
  • Difference: list is indexed by range of numbers, while dictionary is indexed by unique keys.
  • for list, it should be used when there is:
  1. Collection of values
  2. where the order matters
  3. want to easily select entire subsets
  • for dictionary, it should be used when you need to lookup table with unique keys, and want the process to be quick.

Dictionary of dictionaries

1
2
3
4
5
6
7
8
9
10
11
12
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
'france': { 'capital':'paris', 'population':66.03 },
'germany': { 'capital':'berlin', 'population':80.62 },
'norway': { 'capital':'oslo', 'population':5.084 } }
# To subset element in the sub-dictionary
europe['france']['capital']

# Create sub-dictionary data
data = {'capital':'rome', 'population':59.83}

# Add data to europe under key 'italy'
europe['italy'] = data

Pandas

Pandas is a high level data manipulation tool that was built on Numpy. It can bring dataset down to tabular structure and store it in a DataFrame. Numpy array is not that useful in this case since the data in the table may be of different types.

DataFrame from Dictionary

1
2
3
4
5
6
7
8
d = {'country':['Brazil', 'Russia', 'India'],
'capital':['Brasilia', 'Moscow', 'New Delhi'],
'area':[8.516, 17.10, 3.286],
'population':[200.4, 143.5, 1252]}
import pandas as pd
brics = pd.DataFrame(d)
#To specify the row index
brics.index = ['BR', 'RU', 'IN']

DataFrame from csv file

1
2
import pandas as pd
brics = pd.read_csv("path/to/brics.csv", index_col = 0) #index_col = 0 helps to set the first column as the row index.

Index and Select Data

  • Square brackets: limited functionality

    1
    2
    3
    4
    5
    6
    7
    #Column Access
    brics['country']# this is a pandas series, which can be thought as a 1d labelled array. Dataframe = pasting several series together
    brics[['country']] #this will keep the dataframe type
    brics[['country', 'capital']]

    #Row Access
    brics[1:4] #select row with index 1 to 3
  • Advanced methods: loc(label-based), iloc(integer position-based)

loc

1
2
3
4
5
6
7
8
9
10
#Row access
brics.loc['RU'] #Row as pandas series
brics.loc[['RU']] #row as dataframe
brics.loc[['RU', 'IN']]

#Row & Column
brics.loc[['RU', 'IN'], ['country', 'capital']]

#Column access
brics.loc[:, ['country', 'capital']]

iloc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#Row access
brics.loc[['RU']] #row as dataframe
brics.iloc[[1]]

brics.loc[['RU', 'IN']]
brics.iloc[[1, 2]]

#Row & Column
brics.loc[['RU', 'IN'], ['country', 'capital']]
brics.iloc[[1, 2, 3], [0, 1]]

#Column access
brics.loc[:, ['country', 'capital']]
brics.iloc[:, [0, 1]]

ix a way to combine loc and iloc (not covered in the course)

Logic, Control Flow and Filtering

Comparison Operators

Comparison operators are operators that can tell how Python values relate, and results in a boolean

  • Numeric operators

    1
    2
    3
    4
    5
    6
    2 < 3
    2 > 3
    2 == 3
    2 <= 3
    2 >= 3
    2 != 3
  • Other comparisons

    1
    'carl' < 'chris' #according to alphabet
  • Boolean operators: and, or, not

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    #and
    True and True #returns true
    x = 12
    x > 5 and x < 15 #returns true

    #or
    False or True #returns true

    #not
    not True #returns false

When working with Numpy: logical_and(), logical_or(), and logical_not()

1
2
3
4
bmi = np.array([21,852, 20.975, 21.75, 24.757, 21.441])
#logical_and()
np.logical_and(bmi > 21, bmi < 22) #returns an array of T/F
bmi[np.logical_and(bmi > 21, bmi < 22)] #subset the desired data

Conditional Statements: if, else, elif

1
2
3
4
5
6
7
z = 4
if z % 2 == 0 :
print('z is even') # if this statement is executed, the elif and else statements will not be executed.
elif z % 3 == 0 :
print('z is divisible by 3')
else :
print('z is odd') # no need speficication of condition

Filtering Pandas DataFrame

1
2
3
4
brics[brics['area'] > 8] # select rows that satisfy the condition

import numpy as np
brics[no.logical_and(brics['area'] > 8, brics['area'] < 10)]

Loops

While loop

While loops can help to repeat the if statement, and can be very helpful in some cases. For example, to repeat the same step over and over again until a particular condition is met, so that it can stop at some point.

Example

  • Error starts at 50
  • Divide error by 4 on every run
  • Continue until error no longer > 1
    1
    2
    3
    4
    error = 50.0
    while error > 1 :
    error = error / 4
    print(error)

For loop

Basics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
fam = [1,73, 1.68, 1.71, 1.89]
for height in fam :
print(height)

# To have access to the index
for index, height in enumerate(fam) :
print('index ' + str(index) + ': ' + str(height)) #the output is like index 0: 1.73

# loop over strings
for c in "family"
print(c) #print out each letter in the word

# Example
# house list of lists
house = [["hallway", 11.25],
["kitchen", 18.0],
["living room", 20.0],
["bedroom", 10.75],
["bathroom", 9.50]]
# Build a for loop from scratch
for room in house:
print('the ' + room[0] + ' is ' + str(room[1]) + ' sqm')
Looping Data Structures
  • Dictionary : use a method. for key, val in my_dict.items() :
  • Numpy Array: use a function. for val in np.nditer(my_array) :
  • DataFrame: use a method. for lab, row in my_df.iterrows() :
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    # Dictionary
    world = {'afghanistan':30.55,
    'albania':2.77,
    'algeria':39.21}

    # To print out the keys and corresponding values
    for key, value in world.items() : #key and value can also be changed into other forms, like k and v. But the position does matters. The first represents the key, the second represents the value.
    print(key + ' -- ' + str(value)) # the output will not follow the original order in the dictionary defined above, since dictionaries are inherently unordered

    # Numpy Arrays
    import numpy as np
    np_height = np.array([1,73, 1.68, 1.71, 1.89, 1.79])
    np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
    bmi = np_weight / np_height ** 2
    for val in bmi :
    print(val)

    ## 2D Numpy Arrays
    meas = np.array([np_height, np_weight])
    for val in np.nditer(meas)
    print(val) #the output will print out array by array

    # DataFrame
    import pandas as pd
    brics = pd.read_csv('brics.csv', index_col = 0)

    for val in brics :
    print(val) #this will only print out the column names

    # to loops each row in a dataframe: iterrows().
    for lab, row in brics.iterrows() : #lab is the row index, and row is the data in the specific row.
    print()

    ## Selective print
    for lab, row in brics.interrows() :
    print(lab + ': ' + row['capital'])

    ## Add column to the existing dataframe
    for lab, row in brics.iterrows() :
    brics.loc[lab, 'name_length'] = len(row['country']) #inefficient when the dataset is large, since it creates a series in each iteration

    ### Using Apply without using for loops
    brics['name_length'] = brics['country'].apply(len)
    ### When applying a method instead of a function
    cars['COUNTRY'] = cars['country'].apply(str.upper)

Tips:

  1. print(x, end = ' ') the end argument specify the interval between this printout and the next printout

Case Study: Hacker Statistics

To generate random numbers

np.random.seed(123) same seed will generate the same set of random numbers and ensures reproducibility
np.random.rand() this will generate random numbers from 0 to 1
np.random.randint(0, 2) this will randomly generate 0 or 1, since 2 will not be included

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Import numpy and set seed
import numpy as np
np.random.seed(123)

# Starting step
step = 50

# Roll the dice
dice = np.random.randint(1,7)

# Finish the control construct
if dice <= 2 :
step = step - 1
elif dice < 6:
step = step + 1
else :
step = step + np.random.randint(1, 7)

# Print out dice and step
print(dice)
print(step)

Simulate random walk

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Import numpy and set seed
import numpy as np
np.random.seed(123)

# Initialize random_walk
random_walk = [0]

for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)

if dice <= 2:
# use max to make sure step can't go below 0
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)

random_walk.append(step)

print(random_walk)

Visualize the random walk

1
2
3
4
5
6
7
8
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Plot random_walk
plt.plot(random_walk)

# Show the plot
plt.show()

Distribution

What is the chance that you can reach 60 steps after throwing the dice for 100 times?

  • Each random walk has an end point
  • Simulate this walk 10,000 times, then you get 10,000 end points: Distribution
  • Calculate the chances

Create a list that contains the path of 10 random walks and visualize it

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
for i in range(10) :
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
# Implement clumsiness (there is a 0.001 chance that the climber will fall and restart from 0 step)
if np.random.rand(0, 1) <= 0.001 :
step = 0
random_walk.append(step)
all_walks.append(random_walk)

np_aw = np.array(all_walks) # Convert all_walks to Numpy array: np_aw
plt.plot(np_aw)
plt.show()
plt.clf() # Clear the figure
np_aw_t = np.transpose(np_aw)# Transpose np_aw: np_aw_t. This will make each row contains the 1st, 2nd, 3rd... random step in all the random walks.
plt.plot(np_aw_t)
plt.show()

ends = np_aw_t[-1] # Select last row from np_aw_t: ends
plt.hist(ends)# Plot histogram of ends, display plot
plt.show()