Datacamp course notes on data visualization, dictionaries, pandas, logic, control flow and filtering and loops.

Data Visualization

Matplotlib

Line graph

import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop) #(x, y)
plt.show()# before this step, you can add labels or other things to customize your plot.

Scatter plot

import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.scatter(year, pop, s = pop, c = col, alpha = 0.8) # s controls the size of the circle, c controls the color of the bubbles, alpha controls the opaqueness(0-1)
plt.xscale('log') # show x-axis on a logarithmic scale
plt.show()

Histogram

Explore dataset

Get idea about distribution

import matplotlib.pyplot as plt
# bins = 10 by default, the more the bins, the more detailed of the contour
plt.hist(values, bins = 3)
plt.show()
plt.clf()# this cleans up the graph so that you can start fresh

Customization

import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop) #(x, y)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],['0', '2B', '4B', '6B', '8B', '10B']) #specify the intervales on y axis and also the number show
plt.text(1550, 71, 'India') #show text at a cetain (x,y)
plt.grid(True) #show gridlines
plt.show()# before this step, you can add labels or other things to customize your plot.

Dictionaries

make it easier to connect two lists without using index as a bridge

syntax: {key1:value1, key2:value2 ...}. dict_name[key] will return the corresponding value

world = {'a':1, 'b':2, 'c':3}
world['a'] # will return 1

europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# print out the keys
europe.keys()

# To add data into the existing dictionary
world['d'] = 4
'd' in world #now this will return True
# To update data in a dictionary
world['d'] = 5
# To delete a pair of values
del(world['d'])

keys have to be unique.
keys have to be “immutable” objects. “Immutable” means that the objects’ content cannot be changed after they are created. e.g. strings, booleans, integers. But lists are mutable object, since you can change its content after its creation.

world = {0:'hello', True:'dear', 'two':'world'} #a valid dictionary

{['just', 'to', 'test']: 'value'} #invalid dictionary

europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# print out the keys
europe.keys()

List vs Dictionary

Similarity: select, update and remove: []
Difference: list is indexed by range of numbers, while dictionary is indexed by unique keys.
for list, it should be used when there is:

Collection of values
where the order matters
want to easily select entire subsets

for dictionary, it should be used when you need to lookup table with unique keys, and want the process to be quick.

Dictionary of dictionaries

europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }
# To subset element in the sub-dictionary
europe['france']['capital']

# Create sub-dictionary data
data = {'capital':'rome', 'population':59.83}

# Add data to europe under key 'italy'
europe['italy'] = data

Pandas

Pandas is a high level data manipulation tool that was built on Numpy. It can bring dataset down to tabular structure and store it in a DataFrame. Numpy array is not that useful in this case since the data in the table may be of different types.

DataFrame from Dictionary

d = {'country':['Brazil', 'Russia', 'India'],
		  'capital':['Brasilia', 'Moscow', 'New Delhi'],
		  'area':[8.516, 17.10, 3.286],
		  'population':[200.4, 143.5, 1252]}
import pandas as pd
brics = pd.DataFrame(d)
#To specify the row index
brics.index = ['BR', 'RU', 'IN']

DataFrame from csv file

1 2	import pandas as pd brics = pd.read_csv("path/to/brics.csv", index_col = 0) #index_col = 0 helps to set the first column as the row index.

Index and Select Data

Square brackets: limited functionality

#Column Access
brics['country']# this is a pandas series, which can be thought as a 1d labelled array. Dataframe = pasting several series together
brics[['country']] #this will keep the dataframe type
brics[['country', 'capital']]

#Row Access
brics[1:4] #select row with index 1 to 3

Advanced methods: loc(label-based), iloc(integer position-based)

loc

#Row access
brics.loc['RU'] #Row as pandas series
brics.loc[['RU']] #row as dataframe
brics.loc[['RU', 'IN']]

#Row & Column
brics.loc[['RU', 'IN'], ['country', 'capital']]

#Column access
brics.loc[:, ['country', 'capital']]

iloc

#Row access
brics.loc[['RU']] #row as dataframe
brics.iloc[[1]] 

brics.loc[['RU', 'IN']]
brics.iloc[[1, 2]]

#Row & Column
brics.loc[['RU', 'IN'], ['country', 'capital']]
brics.iloc[[1, 2, 3], [0, 1]]

#Column access
brics.loc[:, ['country', 'capital']]
brics.iloc[:, [0, 1]]

ix a way to combine loc and iloc (not covered in the course)

Logic, Control Flow and Filtering

Comparison Operators

Comparison operators are operators that can tell how Python values relate, and results in a boolean

Numeric operators

2 < 3
2 > 3
2 == 3
2 <= 3
2 >= 3
2 != 3

Other comparisons

1	'carl' < 'chris' #according to alphabet

Boolean operators: and, or, not

#and
True and True #returns true
x = 12
x > 5 and x < 15 #returns true

#or
False or True #returns true

#not
not True #returns false

When working with Numpy: logical_and(), logical_or(), and logical_not()

bmi = np.array([21,852, 20.975, 21.75, 24.757, 21.441])
#logical_and()
np.logical_and(bmi > 21, bmi < 22) #returns an array of T/F
bmi[np.logical_and(bmi > 21, bmi < 22)] #subset the desired data

Conditional Statements: if, else, elif

z = 4
if z % 2 == 0 :
    print('z is even') # if this statement is executed, the elif and else statements will not be executed.
elif z % 3 == 0 :
    print('z is divisible by 3')
else :
    print('z is odd') # no need speficication of condition

Filtering Pandas DataFrame

brics[brics['area'] > 8] # select rows that satisfy the condition

import numpy as np
brics[no.logical_and(brics['area'] > 8, brics['area'] < 10)]

Loops

While loop

While loops can help to repeat the if statement, and can be very helpful in some cases. For example, to repeat the same step over and over again until a particular condition is met, so that it can stop at some point.

Example

Error starts at 50
Divide error by 4 on every run

Continue until error no longer > 1

error = 50.0
while error > 1 :
    error = error / 4
    print(error)

For loop

Basics

fam = [1,73, 1.68, 1.71, 1.89]
for height in fam :
    print(height)

#  To have access to the index
for index, height in enumerate(fam) :
    print('index ' + str(index) + ': ' + str(height)) #the output is like index 0: 1.73

# loop over strings
for c in "family"
    print(c) #print out each letter in the word

# Example
# house list of lists
house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
# Build a for loop from scratch
for room in house:
    print('the ' + room[0] + ' is ' + str(room[1]) + ' sqm')

Looping Data Structures

Dictionary : use a method. for key, val in my_dict.items() :
Numpy Array: use a function. for val in np.nditer(my_array) :

DataFrame: use a method. for lab, row in my_df.iterrows() :

# Dictionary
world = {'afghanistan':30.55,
         'albania':2.77,
         'algeria':39.21}

# To print out the keys and corresponding values
for key, value in world.items() : #key and value can also be changed into other forms, like k and v. But the position does matters. The first represents the key, the second represents the value.
    print(key + ' -- ' + str(value)) # the output will not follow the original order in the dictionary defined above, since dictionaries are inherently unordered

# Numpy Arrays
import numpy as np
np_height = np.array([1,73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
bmi = np_weight / np_height ** 2
for val in bmi :
    print(val)

## 2D Numpy Arrays
meas = np.array([np_height, np_weight])
for val in np.nditer(meas)
    print(val) #the output will print out array by array

# DataFrame
import pandas as pd
brics = pd.read_csv('brics.csv', index_col = 0)

for val in brics :
    print(val) #this will only print out the column names

# to loops each row in a dataframe: iterrows(). 
for lab, row in brics.iterrows() : #lab is the row index, and row is the data in the specific row.
    print()

## Selective print
for lab, row in brics.interrows() :
    print(lab + ': ' + row['capital'])

## Add column to the existing dataframe
for lab, row in brics.iterrows() : 
    brics.loc[lab, 'name_length'] = len(row['country']) #inefficient when the dataset is large, since it creates a series in each iteration

### Using Apply without using for loops
brics['name_length'] = brics['country'].apply(len)
### When applying a method instead of a function
cars['COUNTRY'] = cars['country'].apply(str.upper)

Tips:

print(x, end = ' ') the end argument specify the interval between this printout and the next printout

Case Study: Hacker Statistics

To generate random numbers

np.random.seed(123) same seed will generate the same set of random numbers and ensures reproducibility
np.random.rand() this will generate random numbers from 0 to 1
np.random.randint(0, 2) this will randomly generate 0 or 1, since 2 will not be included

# Import numpy and set seed
import numpy as np
np.random.seed(123)

# Starting step
step = 50

# Roll the dice
dice = np.random.randint(1,7)

# Finish the control construct
if dice <= 2 :
    step = step - 1
elif dice < 6:
    step = step + 1
else :
    step = step + np.random.randint(1, 7)

# Print out dice and step
print(dice)
print(step)

Simulate random walk

# Import numpy and set seed
import numpy as np
np.random.seed(123)

# Initialize random_walk
random_walk = [0]

for x in range(100) :
    step = random_walk[-1]
    dice = np.random.randint(1,7)

    if dice <= 2:
        # use max to make sure step can't go below 0
        step = max(0, step - 1)
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)

    random_walk.append(step)

print(random_walk)

Visualize the random walk

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Plot random_walk
plt.plot(random_walk)

# Show the plot
plt.show()

Distribution

What is the chance that you can reach 60 steps after throwing the dice for 100 times?

Each random walk has an end point
Simulate this walk 10,000 times, then you get 10,000 end points: Distribution
Calculate the chances

Create a list that contains the path of 10 random walks and visualize it

import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
for i in range(10) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        # Implement clumsiness (there is a 0.001 chance that the climber will fall and restart from 0 step)
        if np.random.rand(0, 1) <= 0.001 :
            step = 0
        random_walk.append(step)
    all_walks.append(random_walk)

np_aw = np.array(all_walks) # Convert all_walks to Numpy array: np_aw
plt.plot(np_aw)
plt.show()
plt.clf() # Clear the figure
np_aw_t = np.transpose(np_aw)# Transpose np_aw: np_aw_t. This will make each row contains the 1st, 2nd, 3rd... random step in all the random walks.
plt.plot(np_aw_t)
plt.show()

ends = np_aw_t[-1] # Select last row from np_aw_t: ends
plt.hist(ends)# Plot histogram of ends, display plot
plt.show()