Datacamp course notes on data visualization, dictionaries, pandas, logic, control flow and filtering and loops.
Data Visualization
Matplotlib
Line graph
1 | import matplotlib.pyplot as plt |
Scatter plot
1 | import matplotlib.pyplot as plt |
Histogram
- Explore dataset
- Get idea about distribution
1
2
3
4
5import matplotlib.pyplot as plt
# bins = 10 by default, the more the bins, the more detailed of the contour
plt.hist(values, bins = 3)
plt.show()
plt.clf()# this cleans up the graph so that you can start fresh
Customization
1 | import matplotlib.pyplot as plt |
Dictionaries
- make it easier to connect two lists without using index as a bridge
syntax:
{key1:value1, key2:value2 ...}
.dict_name[key]
will return the corresponding value1
2
3
4
5
6
7
8
9
10
11
12
13
14world = {'a':1, 'b':2, 'c':3}
world['a'] # will return 1
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# print out the keys
europe.keys()
# To add data into the existing dictionary
world['d'] = 4
'd' in world #now this will return True
# To update data in a dictionary
world['d'] = 5
# To delete a pair of values
del(world['d'])keys have to be unique.
- keys have to be “immutable” objects. “Immutable” means that the objects’ content cannot be changed after they are created. e.g. strings, booleans, integers. But lists are mutable object, since you can change its content after its creation.
1 | world = {0:'hello', True:'dear', 'two':'world'} #a valid dictionary |
List vs Dictionary
- Similarity: select, update and remove: []
- Difference: list is indexed by range of numbers, while dictionary is indexed by unique keys.
- for list, it should be used when there is:
- Collection of values
- where the order matters
- want to easily select entire subsets
- for dictionary, it should be used when you need to lookup table with unique keys, and want the process to be quick.
Dictionary of dictionaries
1 | europe = { 'spain': { 'capital':'madrid', 'population':46.77 }, |
Pandas
Pandas is a high level data manipulation tool that was built on Numpy. It can bring dataset down to tabular structure and store it in a DataFrame. Numpy array is not that useful in this case since the data in the table may be of different types.
DataFrame from Dictionary
1 | d = {'country':['Brazil', 'Russia', 'India'], |
DataFrame from csv file
1 | import pandas as pd |
Index and Select Data
Square brackets: limited functionality
1
2
3
4
5
6
7#Column Access
brics['country']# this is a pandas series, which can be thought as a 1d labelled array. Dataframe = pasting several series together
brics[['country']] #this will keep the dataframe type
brics[['country', 'capital']]
#Row Access
brics[1:4] #select row with index 1 to 3Advanced methods: loc(label-based), iloc(integer position-based)
loc1
2
3
4
5
6
7
8
9
10#Row access
brics.loc['RU'] #Row as pandas series
brics.loc[['RU']] #row as dataframe
brics.loc[['RU', 'IN']]
#Row & Column
brics.loc[['RU', 'IN'], ['country', 'capital']]
#Column access
brics.loc[:, ['country', 'capital']]
iloc1
2
3
4
5
6
7
8
9
10
11
12
13
14#Row access
brics.loc[['RU']] #row as dataframe
brics.iloc[[1]]
brics.loc[['RU', 'IN']]
brics.iloc[[1, 2]]
#Row & Column
brics.loc[['RU', 'IN'], ['country', 'capital']]
brics.iloc[[1, 2, 3], [0, 1]]
#Column access
brics.loc[:, ['country', 'capital']]
brics.iloc[:, [0, 1]]
ix a way to combine loc and iloc (not covered in the course)
Logic, Control Flow and Filtering
Comparison Operators
Comparison operators are operators that can tell how Python values relate, and results in a boolean
Numeric operators
1
2
3
4
5
62 < 3
2 > 3
2 == 3
2 <= 3
2 >= 3
2 != 3Other comparisons
1
'carl' < 'chris' #according to alphabet
Boolean operators: and, or, not
1
2
3
4
5
6
7
8
9
10#and
True and True #returns true
x = 12
x > 5 and x < 15 #returns true
#or
False or True #returns true
#not
not True #returns false
When working with Numpy: logical_and()
, logical_or()
, and logical_not()
1 | bmi = np.array([21,852, 20.975, 21.75, 24.757, 21.441]) |
Conditional Statements: if, else, elif
1 | z = 4 |
Filtering Pandas DataFrame
1 | brics[brics['area'] > 8] # select rows that satisfy the condition |
Loops
While loop
While loops can help to repeat the if statement, and can be very helpful in some cases. For example, to repeat the same step over and over again until a particular condition is met, so that it can stop at some point.
Example
- Error starts at 50
- Divide error by 4 on every run
- Continue until error no longer > 1
1
2
3
4error = 50.0
while error > 1 :
error = error / 4
print(error)
For loop
Basics
1 | fam = [1,73, 1.68, 1.71, 1.89] |
Looping Data Structures
- Dictionary : use a method.
for key, val in my_dict.items() :
- Numpy Array: use a function.
for val in np.nditer(my_array) :
- DataFrame: use a method.
for lab, row in my_df.iterrows() :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45# Dictionary
world = {'afghanistan':30.55,
'albania':2.77,
'algeria':39.21}
# To print out the keys and corresponding values
for key, value in world.items() : #key and value can also be changed into other forms, like k and v. But the position does matters. The first represents the key, the second represents the value.
print(key + ' -- ' + str(value)) # the output will not follow the original order in the dictionary defined above, since dictionaries are inherently unordered
# Numpy Arrays
import numpy as np
np_height = np.array([1,73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
bmi = np_weight / np_height ** 2
for val in bmi :
print(val)
## 2D Numpy Arrays
meas = np.array([np_height, np_weight])
for val in np.nditer(meas)
print(val) #the output will print out array by array
# DataFrame
import pandas as pd
brics = pd.read_csv('brics.csv', index_col = 0)
for val in brics :
print(val) #this will only print out the column names
# to loops each row in a dataframe: iterrows().
for lab, row in brics.iterrows() : #lab is the row index, and row is the data in the specific row.
print()
## Selective print
for lab, row in brics.interrows() :
print(lab + ': ' + row['capital'])
## Add column to the existing dataframe
for lab, row in brics.iterrows() :
brics.loc[lab, 'name_length'] = len(row['country']) #inefficient when the dataset is large, since it creates a series in each iteration
### Using Apply without using for loops
brics['name_length'] = brics['country'].apply(len)
### When applying a method instead of a function
cars['COUNTRY'] = cars['country'].apply(str.upper)
Tips:
print(x, end = ' ')
the end argument specify the interval between this printout and the next printout
Case Study: Hacker Statistics
To generate random numbers
np.random.seed(123)
same seed will generate the same set of random numbers and ensures reproducibilitynp.random.rand()
this will generate random numbers from 0 to 1np.random.randint(0, 2)
this will randomly generate 0 or 1, since 2 will not be included1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Starting step
step = 50
# Roll the dice
dice = np.random.randint(1,7)
# Finish the control construct
if dice <= 2 :
step = step - 1
elif dice < 6:
step = step + 1
else :
step = step + np.random.randint(1, 7)
# Print out dice and step
print(dice)
print(step)
Simulate random walk
1 | # Import numpy and set seed |
Visualize the random walk
1 | # Import matplotlib.pyplot as plt |
Distribution
What is the chance that you can reach 60 steps after throwing the dice for 100 times?
- Each random walk has an end point
- Simulate this walk 10,000 times, then you get 10,000 end points: Distribution
- Calculate the chances
Create a list that contains the path of 10 random walks and visualize it1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
for i in range(10) :
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
# Implement clumsiness (there is a 0.001 chance that the climber will fall and restart from 0 step)
if np.random.rand(0, 1) <= 0.001 :
step = 0
random_walk.append(step)
all_walks.append(random_walk)
np_aw = np.array(all_walks) # Convert all_walks to Numpy array: np_aw
plt.plot(np_aw)
plt.show()
plt.clf() # Clear the figure
np_aw_t = np.transpose(np_aw)# Transpose np_aw: np_aw_t. This will make each row contains the 1st, 2nd, 3rd... random step in all the random walks.
plt.plot(np_aw_t)
plt.show()
ends = np_aw_t[-1] # Select last row from np_aw_t: ends
plt.hist(ends)# Plot histogram of ends, display plot
plt.show()