DataCamp - Importing Data (Part 1)

Datacamp course notes on importing data from various sources of various types.

Importing Text Files

Types of text files:

  1. Plain text files
  2. Table data: containing records. (flat file)

Importing Plain Text Files

To read a plain text file, we can use Python’s basic open function to open a connection to the file

1
2
3
4
5
6
7
8
9
filename = 'huck_finn.txt'
file = open(filename, mode = 'r') # 'r' = read, to make sure we are only reading it, not writing it or sth else.
text = file.read()
file.close() # make sure you close the connection after saving it to a variable
print(text)

# Note: if you want to write to a file
filename = 'huck_finn.txt'
file = open(filename, mode = 'w') # 'w' = write

To avoid having a closing connection to the file by using a with statement. This will allow you to create a context in which you can execute your command with the file open. Once out of this context, the file is no longer open. That’s also why with is called a Context Manager.

1
2
3
4
5
6
7
8
9
with open('huck_finn.txt', 'r') as file:
print(file.read())

# Read line by line
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())

Exploring Working Directory
Starting a line with ! gives you complete system shell access.
!ls will display the contents of your current directory. Or the following:

1
2
3
import os
wd = os.getcwd()
os.listdir(wd)

Importing Flat Files

Flat files are basic text files containing records, which is table data without structured relationships. This in contrast with relational database, in which columns of distinct tables can be related. Each record means a row of fields or attributes. Each column is a feature or attributes. Note whether the file has a header is important when importing the data.

Common file extensions are:

  • .csv: comma separated values
  • .txt: text file(separated by characters or sequence of characters other than commas, such as a tab. This separator is called a delimiter)

NumPy

NumPy arrays are standard for storing numerical data, and is essential for other packages like scikit-learn. We can use loadtxt() and genformtxt() to load data into NumPy.

1
2
3
4
5
6
7
8
9
10
11
import numpy as np
filename = 'MNIST_header.text'
data = np.loadtxt(filename,
delimiter = ',', #'\t' for tab
skiprows = 1, # skip the first row, which is the header
usecols = [0, 2]) # only use the first and the third columns

# To import data as strings
data = np.loadtxt(filename,
delimiter = ',',
dtype = str) # can also be float

loadtxt() is not that useful when we have mixed datatype in a dataset. We can use genfromtxt() to handle such structures.

1
2
3
4
5
6
7
8
9
data = np.genfromtxt('titanic.csv',
delimiter = ',',
names = True, #T = There is a header
dtype = None) #figure out what types each column should be by itself
# To access the row
data[0]

# To access the column
data['Survived']

Because the data are of different types, data is an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported.

recfromcsv() has the defaults delimiter = ',' and names = True in addition to dtype = None

Pandas

To have two-dimensional labeled data structure(s) with columns of potentially different types, to be able to manipulate, slice, reshape, groupby, join, merge, perform statistcis, work with time series data…. We need DataFrame.

DataFrame = pythonic analog of R’s data frame.

It’s now standard and best practice to use pandas to import flat files.

1
2
3
4
5
6
7
import pandas as pd
file = 'titanic.csv'
data = pd.read_csv(file,
nrows = 5, #only need first 4 rows
header = None) # no header in this file
data.head()
data_array = data.values # Build a numpy array from the DataFrame

Customizing your pandas import

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file,
sep='\t',
comment='#', #takes characters that comments occur after in the file
na_values='Nothing') #takes a list of strings to recognize as NA/NaN

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

Pickled Files

  • File type native to Python
  • Motivation: many datatypes for which it isn’t obvious how to store them
  • Pickled files are serialized = convert object to bytestream
1
2
3
import pickel
with open('pickled_fruit.pkl', 'rb') as file: # 'rb' = read, binary (computer readable)
data = pickle.load(file)

Excel File

1
2
3
4
5
6
import pandas as pd
file = 'urbanpop.xlsx'
data = pd.ExcelFile(file)
print(data.sheet_names)
df1 = data.parse('1960-1966') #sheet name
df2 = data.parse(0) #sheet index, as a float

Customizing the import:
Note that the arguments all need to be of list type

1
2
df1 = xl.parse(0, skiprows = [0], 
names = ['Country', 'AAM due to War (2002)'])#change the column names

SAS File

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
df_sas = file.to_data_frame()

# Print head of DataFrame
print(df_sas.head())

# Plot histogram of DataFrame features (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()

Stata File

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import pandas
import pandas as pd

# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')

# Print the head of the DataFrame df
print(df.head())

# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of coutries')
plt.show()

HDF5 File

  • Hierarchical Data Format version 5
  • Standard for storing large quantities of numerical data
  • Datasets can be hundreds of gigabytes or terabytes
  • HDF5 can scale to exabytes
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    # Import packages
    import numpy as np
    import h5py

    # Assign filename: file
    file = 'LIGO_data.hdf5'

    # Load file: data
    data = h5py.File(file, 'r')

    # Print the datatype of the loaded file
    print(type(data))

    # Print the keys of the file
    for key in data.keys():
    print(key)

    # Get the HDF5 group: group
    group = data['strain']

    # Check out keys of group
    for key in group.keys():
    print(key)

    # Set variable equal to time series data: strain
    strain = data['strain']['Strain'].value

    # Set number of time points to sample: num_samples
    num_samples = 10000

    # Set time vector
    time = np.arange(0, 1, 1/num_samples) #(start, end, interval)

    # Plot data
    plt.plot(time, strain[:num_samples])
    plt.xlabel('GPS Time (s)')
    plt.ylabel('strain')
    plt.show()

MATLAB Files

  • “Matrix Laboratory”
  • Industry standard in engineering and science
  • python keys = MATLAB variable names
  • python values = object assigend to variables
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    # Import package
    import scipy.io

    # Load MATLAB file: mat
    mat = scipy.io.loadmat('albeck_gene_expression.mat')

    # Print the datatype type of mat
    print(type(mat)) #dict

    import matplotlib.pyplot as plt
    import numpy as np
    # Print the keys of the MATLAB dictionary
    print(mat.keys())

    # Print the type of the value corresponding to the key 'CYratioCyt'
    print(type(mat['CYratioCyt']))

    # Print the shape of the value corresponding to the key 'CYratioCyt'
    print(np.shape(mat['CYratioCyt'])) #dimensions

    # Subset the array and plot it
    data = mat['CYratioCyt'][25, 5:]
    fig = plt.figure()
    plt.plot(data)
    plt.xlabel('time (min.)')
    plt.ylabel('normalized fluorescence (measure of expression)')
    plt.show()

Relational Databases

Workflow of SQL querying:

  1. Import packages and functions
  2. Create the database engine
  • SQLite database: fast and simple
  • SQLAlchemy: works with many RDMS
  1. Connect to the engine
  2. Query the database
  3. Save query results to a DataFrame
  4. Close the connection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite')
table_names = engine.table_names()

con = engine.connect()
rs = con.execute('SELECT * FROM Orders')
df = pd.DataFrame(rs.fetchall()) #store the results to a df
df.columns = rs.keys() #to make the column name right
con.close()

# Or to save the trouble of closing a connection:
with engine.connect() as con:
rs = con.execute('SELECT * FROM Orders')
df = pd.DataFrame(rs.fetchmany(size = 5)) #store 5 rows instead of all rows
df.columns = rs.keys()

We can also query the relational databases directly with pandas:

1
2
3
4
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///Chinook.sqlite')
df = pd.read_sql_query("SELECT * FROM Orders", engine)