Datacamp course notes on iteratiors, list comprehensions and generators.
Iterators
For loop
We can use a for loop to loop over a list, a string, or over a range object. The reason why we can iterate over these objects is that they are iterables.
Iterables:
- Examples: Lists, strings, dictionaris, file connections are all iterables.
- Definition: An object with an associated
iter()
method. - Applying
iter()
to an iterable creates an iterator. This is actually what the for loop is doing under the hood.
Iterator:
- Definition: An object with an associated
next()
method that produces the consecutive values.
To sum up:
- an iterable is an object that can return an iterator,
- while an iterator is an object that keeps state and produces the next value when you call next() on it.
To create an iterator from an iterable, all we need to do is to use the function iter()
and pass in the iterable.
1 | word = 'Da' # 'word' is a iterable |
To iterate over dictionaries, for key, value in my_dict.items():
is necessary when calling the for loop.
To iterate over file connections:
1 | file = open('file.txt') |
Using enumerate()
enumerate()
is a function that takes any iterable as an object, such as a list, and returns a special enumerate object, which consists of pairs containing the elements of the original iterable, along with their index within the iterable. We can use the function list
to turn this enumerator object into a list of tuples (index, element), and print it to see what it contains.
1 | avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] |
Using zip()
zip()
accepts an arbitrary number of iterables and returns a iterator of tuples (list1_element1, list2_element1, list3_element1).
1 | avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] |
Using iterators to load large files
When the file is too large to be hold in the memory, we can load the data in chunks. We can perform the desired operations on one chunk, store the result, disgard the chunk and then load the next chunk of data. An iterator is helpful in this case.
We use pandas function: read_csv()
and specify the chunk with chunksize
.
1 | import pandas as pd |
Applying the trick in the tweeter case
1 | # Define count_entries() |
List Comprehensions
List comprehension can collapse for loops for building lists into a single line. It create lists from other lists, DataFrame columns, etc., and is more efficient than a for loop since it only takes a single line of code.
Required Components:
- Iterable
- Iterator variable (represent members of iterable)
- Output expression
When we have a list of number and we want to create a new list of numbers, which is the same as the old list except that each number has 1 added to it. Instead of using a for loop in multiple lines, we can use list comprehension to finish this operation in one line as follows:
1 | nums = [12, 8, 21, 3, 16] |
List comprehension is not restricted to lists, and can be used on any iterables.
1 | # List comprehension with range() |
We can also replace nested loops with list comprehensions:
1 | pairs_1 = [] |
To create a matrix by list comprehension:
1 | # Create a 5 x 5 matrix using a list of lists: matrix |
Advanced Comprehensions
Conditionals on the iterable:
1 | [num ** 2 for num in range(10) if num % 2 == 0] |
Conditionals on the output expression:
1 | [num ** 2 if num % 2 == 0 else 0 for num in range(10)] |
Dictionary comprehensions to create dictionaries:
1 | pos_neg = {num: -num for num in range(9)} |
Generators
Generator is very similar to a list comprehension, except that it is not stored in the memory and does not construct a list. But we can still iterate over the generator to produce elements of list as required. It becomes very useful when you don’t want to store the entire list in the memory.
1 | # list comprehension |
Let’s say we want to iterate through a large number of integers from 0 to 10 ** 1000000:
1 | (num for num in range(10 ** 1000000)) #this can be easily done since it has not created an entire list |
We can also apply conditions to generators:
1 | even_num = (num for num in range(10) if num % 2 == 0) |
Generator Functions
Generator functions are functions that, when called, produce generator objects. It yields a sequence of values instead of returning a single value. It is defined just as other functions, except that it generates a value with yield
in steadt of return
at the end.
1 | def num_sequence(n): |
Another example:
1 | # Create a list of strings |
.item()
and range()
actually also creates generators behind the scenes when they are called.
Case Study: World bank data
1 | # Define lists2dict() |
Turn dictionary into dataframe
1 | # Import the pandas package |
Writing a generator to load data line by line
Use a generator to load a file line by line. If the data is streaming, which means if more data is added to the dataset while you doing the operation, it will read and process the file until all lines are exhausted.
Context manager: The csv file 'world_dev_ind.csv'
is in the current directory for your use. To begin, you need to open a connection to this file using what is known as a context manager. For example, the command with open('datacamp.csv') as datacamp
binds the csv file 'datacamp.csv'
as datacamp
in the context manager. Here, the with
statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.
Rough thoughts:
1 | # Open a connection to the file |
1 | # Define read_large_file() |
Writing an iterator to load data in chunks
1 | # Define plot_pop() |