DataCamp - Python Data Science Toolbox (Part2)

Datacamp course notes on iteratiors, list comprehensions and generators.

Iterators

For loop

We can use a for loop to loop over a list, a string, or over a range object. The reason why we can iterate over these objects is that they are iterables.

Iterables:

  • Examples: Lists, strings, dictionaris, file connections are all iterables.
  • Definition: An object with an associated iter() method.
  • Applying iter() to an iterable creates an iterator. This is actually what the for loop is doing under the hood.

Iterator:

  • Definition: An object with an associated next() method that produces the consecutive values.

To sum up:

  • an iterable is an object that can return an iterator,
  • while an iterator is an object that keeps state and produces the next value when you call next() on it.

To create an iterator from an iterable, all we need to do is to use the function iter() and pass in the iterable.

1
2
3
4
5
6
7
8
word = 'Da' # 'word' is a iterable
it = iter(word) # 'it' is an iterator
next(it) #returns 'D', which is the first letter in the string.
next(it) #returns 'a', which is the second letter in the string
next(it) #no letters left. it throws a iteration stop error

print(*it) #unpacks all elements of an iterator
print(*it) #note that this can only be called once

To iterate over dictionaries, for key, value in my_dict.items(): is necessary when calling the for loop.

To iterate over file connections:

1
2
3
4
file = open('file.txt')
it = iter(file)
print(next(it)) #print the first line of the txt file
print(next(it)) #print the second line of the txt file

Using enumerate()

enumerate() is a function that takes any iterable as an object, such as a list, and returns a special enumerate object, which consists of pairs containing the elements of the original iterable, along with their index within the iterable. We can use the function list to turn this enumerator object into a list of tuples (index, element), and print it to see what it contains.

1
2
3
4
5
6
7
8
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
e = enumerate(avengers)
print(type(e)) #class 'enumerate'
e_list = list(e)
print(e_list) #[(0, 'hawkeye'), (1, 'iron man'), (2, 'thor'), (3, 'quicksilver')]

for index, value in enumerate(avengers, start = 10): # start specifies the index of the first element in the list
print(index, value)

Using zip()

zip() accepts an arbitrary number of iterables and returns a iterator of tuples (list1_element1, list2_element1, list3_element1).

1
2
3
4
5
6
7
8
9
10
11
12
13
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
names = ['barton', 'stark', 'odinson', 'maximoff']
z = zip(avengers, names)
print(type(z)) #class 'zip'
z_list = list(z)
print(z_list) #[('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')]

for z1, z2 in zip(avengers, names):
print(z1, z2)

#or simply use the splat operator
print(*z) #('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')
avengers, names = zip(*z) #avengers = ('hawkeye', 'iron man', 'thor', 'quicksilver'), names = ('barton', 'stark', 'odinson', 'maximoff')

Using iterators to load large files

When the file is too large to be hold in the memory, we can load the data in chunks. We can perform the desired operations on one chunk, store the result, disgard the chunk and then load the next chunk of data. An iterator is helpful in this case.

We use pandas function: read_csv() and specify the chunk with chunksize.

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
result = []
for chunk in pd.read_csv('data.csv', chunksize = 1000):
result.append(sum(chunk['x']))# we want to get the sum of column x
total = sum(result)
print(total)

# Another way
import pandas as pd
total = 0
for chunk in pd.read_csv('data.csv', chunksize = 1000):
total += sum(chunk['x']) # we want to get the sum of column x
print(total)

Applying the trick in the tweeter case

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Define count_entries()
def count_entries(csv_file, c_size, colname):
"""Return a dictionary with counts of
occurrences as value for each key."""

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv(csv_file, chunksize = c_size):

# Iterate over the column in DataFrame
for entry in chunk[colname]:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1

# Return counts_dict
return counts_dict

# Call count_entries(): result_counts
result_counts = count_entries('tweets.csv', 10, 'lang')

# Print result_counts
print(result_counts)

List Comprehensions

List comprehension can collapse for loops for building lists into a single line. It create lists from other lists, DataFrame columns, etc., and is more efficient than a for loop since it only takes a single line of code.

Required Components:

  1. Iterable
  2. Iterator variable (represent members of iterable)
  3. Output expression

When we have a list of number and we want to create a new list of numbers, which is the same as the old list except that each number has 1 added to it. Instead of using a for loop in multiple lines, we can use list comprehension to finish this operation in one line as follows:

1
2
3
4
5
6
7
nums = [12, 8, 21, 3, 16]
new_nums = [num + 1 for num in nums] #[output expression + iterator variable + iterable]

#The above line is the same as:
new_nums = []
for num in nums:
new_nums.append(num + 1)

List comprehension is not restricted to lists, and can be used on any iterables.

1
2
3
# List comprehension with range()
result = [num for num in range(11)]
print(result) #[0, 1, 2, ..., 10]

We can also replace nested loops with list comprehensions:

1
2
3
4
5
6
7
8
9
10
pairs_1 = []
for num1 in range(0, 2):
for num2 in range(6, 8):
pairs_1.append(num1, num2)
print(paris_1) #[(0, 6), (0, 7), (1, 6), (1, 7)]

# To perform the above code in list comprehension
pairs_2 = [(num1, num2) for num1 in range(0, 2)
for num2 in range(6, 8)] #less readability
print(pairs_2) #[(0, 6), (0, 7), (1, 6), (1, 7)]

To create a matrix by list comprehension:

1
2
3
4
5
6
# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for row in range(5)]

# Print the matrix
for row in matrix:
print(row)

Advanced Comprehensions

Conditionals on the iterable:

1
[num ** 2 for num in range(10) if num % 2 == 0]

Conditionals on the output expression:

1
[num ** 2 if num % 2 == 0 else 0 for num in range(10)]

Dictionary comprehensions to create dictionaries:

1
2
pos_neg = {num: -num for num in range(9)}
print(pos_neg)

Generators

Generator is very similar to a list comprehension, except that it is not stored in the memory and does not construct a list. But we can still iterate over the generator to produce elements of list as required. It becomes very useful when you don’t want to store the entire list in the memory.

1
2
3
4
5
6
7
8
9
10
11
# list comprehension
[2 * num for num in range(10)]

# generator
(2 * num for num in range(10))

result = (num for num in range(6))
for num in result:
print(num) #print element by element
print(list(result)) #print a list of the elements
print(next(result)) #just like an iterator

Let’s say we want to iterate through a large number of integers from 0 to 10 ** 1000000:

1
(num for num in range(10 ** 1000000)) #this can be easily done since it has not created an entire list

We can also apply conditions to generators:

1
even_num = (num for num in range(10) if num % 2 == 0)

Generator Functions

Generator functions are functions that, when called, produce generator objects. It yields a sequence of values instead of returning a single value. It is defined just as other functions, except that it generates a value with yield in steadt of return at the end.

1
2
3
4
5
6
7
8
9
10
def num_sequence(n):
"""Generate values from 0 to n."""
i = 0
while i < n:
yield i
i += 1
result = num_squence(5)
print(type(result)) #class 'generator'
for item in result:
print(item)

Another example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
"""Generator function that yields the
length of the strings in input_list."""

# Yield the length of a string
for person in input_list:
yield len(person)

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
print(value)

.item() and range() actually also creates generators behind the scenes when they are called.

Case Study: World bank data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define lists2dict()
def lists2dict(list1, list2):
"""Return a dictionary where list1 provides
the keys and list2 provides the values."""

# Zip lists: zipped_lists
zipped_lists = zip(list1, list2)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Return the dictionary
return rs_dict

# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)

# Print rs_fxn
print(rs_fxn)

Turn dictionary into dataframe

1
2
3
4
5
6
7
8
9
10
11
# Import the pandas package
import pandas as pd

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)

# Print the head of the DataFrame
print(df.head())

Writing a generator to load data line by line

Use a generator to load a file line by line. If the data is streaming, which means if more data is added to the dataset while you doing the operation, it will read and process the file until all lines are exhausted.

Context manager: The csv file 'world_dev_ind.csv' is in the current directory for your use. To begin, you need to open a connection to this file using what is known as a context manager. For example, the command with open('datacamp.csv') as datacamp binds the csv file 'datacamp.csv' as datacamp in the context manager. Here, the with statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.

Rough thoughts:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Open a connection to the file
with open('world_dev_ind.csv') as file:

# Skip the column names (tell python to read from the next row)
file.readline()

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Process only the first 1000 rows
for j in range(1000):

# Split the current line into a list: line
line = file.readline().split(',')

# Get the value for the first column: first_col
first_col = line[0]

# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1

# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Define read_large_file()
def read_large_file(file_object):
"""A generator function to read a large file lazily."""

# Loop indefinitely until the end of the file
while True:

# Read a line from the file: data
data = file_object.readline()

# Break if this is the end of the file
if not data:
break

# Yield the line of data
yield data

# Open a connection to the file
with open('world_dev_ind.csv') as file:

# Create a generator object for the file: gen_file
gen_file = read_large_file(file)

# Print the first three lines of the file
print(next(gen_file))
print(next(gen_file))
print(next(gen_file))

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('world_dev_ind.csv') as file:

# Iterate over the generator from read_large_file()
for line in read_large_file(file):

row = line.split(',')
first_col = row[0]

if first_col in counts_dict.keys():
counts_dict[first_col] += 1
else:
counts_dict[first_col] = 1

# Print
print(counts_dict)

Writing an iterator to load data in chunks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Define plot_pop()
def plot_pop(filename, country_code):

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv(filename, chunksize=1000)

# Initialize empty DataFrame: data
data = pd.DataFrame()

# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]

# Append DataFrame chunk to data: data
data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop('ind_pop_data.csv', 'CEB')

# Call plot_pop for country code 'ARB'
plot_pop('ind_pop_data.csv', 'ARB')