Datacamp course notes on importing data from various sources of various types.

Importing Text Files

Types of text files:

Plain text files
Table data: containing records. (flat file)

Importing Plain Text Files

To read a plain text file, we can use Python’s basic open function to open a connection to the file

filename = 'huck_finn.txt'
file = open(filename, mode = 'r') # 'r' = read, to make sure we are only reading it, not writing it or sth else.
text = file.read()
file.close() # make sure you close the connection after saving it to a variable
print(text)

# Note: if you want to write to a file
filename = 'huck_finn.txt'
file = open(filename, mode = 'w') # 'w' = write

To avoid having a closing connection to the file by using a with statement. This will allow you to create a context in which you can execute your command with the file open. Once out of this context, the file is no longer open. That’s also why with is called a Context Manager.

with open('huck_finn.txt', 'r') as file:
	print(file.read())

# Read line by line
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

Exploring Working Directory
Starting a line with ! gives you complete system shell access.
!ls will display the contents of your current directory. Or the following:

1
2
3

import os
wd = os.getcwd()
os.listdir(wd)

Importing Flat Files

Flat files are basic text files containing records, which is table data without structured relationships. This in contrast with relational database, in which columns of distinct tables can be related. Each record means a row of fields or attributes. Each column is a feature or attributes. Note whether the file has a header is important when importing the data.

Common file extensions are:

.csv: comma separated values
.txt: text file(separated by characters or sequence of characters other than commas, such as a tab. This separator is called a delimiter)

NumPy

NumPy arrays are standard for storing numerical data, and is essential for other packages like scikit-learn. We can use loadtxt() and genformtxt() to load data into NumPy.

import numpy as np
filename = 'MNIST_header.text'
data = np.loadtxt(filename,
				  delimiter = ',', #'\t' for tab
				  skiprows = 1, # skip the first row, which is the header
				  usecols = [0, 2]) # only use the first and the third columns

# To import data as strings
data = np.loadtxt(filename,
				  delimiter = ',',
				  dtype = str) # can also be float

loadtxt() is not that useful when we have mixed datatype in a dataset. We can use genfromtxt() to handle such structures.

data = np.genfromtxt('titanic.csv',
					 delimiter = ',',
					 names = True, #T = There is a header
					 dtype = None) #figure out what types each column should be by itself
# To access the row
data[0]

# To access the column
data['Survived']

Because the data are of different types, data is an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported.

recfromcsv() has the defaults delimiter = ',' and names = True in addition to dtype = None

Pandas

To have two-dimensional labeled data structure(s) with columns of potentially different types, to be able to manipulate, slice, reshape, groupby, join, merge, perform statistcis, work with time series data…. We need DataFrame.

DataFrame = pythonic analog of R’s data frame.

It’s now standard and best practice to use pandas to import flat files.

import pandas as pd
file = 'titanic.csv'
data = pd.read_csv(file, 
				   nrows = 5, #only need first 4 rows
				   header = None) # no header in this file
data.head()
data_array = data.values # Build a numpy array from the DataFrame

Customizing your pandas import

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, 
				   sep='\t', 
				   comment='#', #takes characters that comments occur after in the file
				   na_values='Nothing') #takes a list of strings to recognize as NA/NaN

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

Pickled Files

File type native to Python
Motivation: many datatypes for which it isn’t obvious how to store them
Pickled files are serialized = convert object to bytestream

1
2
3

import pickel
with open('pickled_fruit.pkl', 'rb') as file: # 'rb' = read, binary (computer readable)
	data = pickle.load(file)

Excel File

import pandas as pd
file = 'urbanpop.xlsx'
data = pd.ExcelFile(file)
print(data.sheet_names)
df1 = data.parse('1960-1966') #sheet name
df2 = data.parse(0) #sheet index, as a float

Customizing the import:
Note that the arguments all need to be of list type

1 2	df1 = xl.parse(0, skiprows = [0], names = ['Country', 'AAM due to War (2002)'])#change the column names

SAS File

# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

# Print head of DataFrame
print(df_sas.head())

# Plot histogram of DataFrame features (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()

Stata File

# Import pandas
import pandas as pd

# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')

# Print the head of the DataFrame df
print(df.head())

# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of coutries')
plt.show()

HDF5 File

Hierarchical Data Format version 5
Standard for storing large quantities of numerical data
Datasets can be hundreds of gigabytes or terabytes

HDF5 can scale to exabytes

# Import packages
import numpy as np
import h5py

# Assign filename: file
file = 'LIGO_data.hdf5'

# Load file: data
data = h5py.File(file, 'r')

# Print the datatype of the loaded file
print(type(data))

# Print the keys of the file
for key in data.keys():
    print(key)

# Get the HDF5 group: group
group = data['strain']

# Check out keys of group
for key in group.keys():
    print(key)

# Set variable equal to time series data: strain
strain = data['strain']['Strain'].value

# Set number of time points to sample: num_samples
num_samples = 10000

# Set time vector
time = np.arange(0, 1, 1/num_samples) #(start, end, interval)

# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

MATLAB Files

“Matrix Laboratory”
Industry standard in engineering and science
python keys = MATLAB variable names

python values = object assigend to variables

# Import package
import scipy.io

# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')

# Print the datatype type of mat
print(type(mat)) #dict

import matplotlib.pyplot as plt
import numpy as np
# Print the keys of the MATLAB dictionary
print(mat.keys())

# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))

# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt'])) #dimensions

# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()

Relational Databases

Workflow of SQL querying:

Import packages and functions
Create the database engine

SQLite database: fast and simple
SQLAlchemy: works with many RDMS

Connect to the engine
Query the database
Save query results to a DataFrame
Close the connection

from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite')
table_names = engine.table_names()

con = engine.connect()
rs = con.execute('SELECT * FROM Orders')
df = pd.DataFrame(rs.fetchall()) #store the results to a df
df.columns = rs.keys() #to make the column name right
con.close()

# Or to save the trouble of closing a connection:
with engine.connect() as con:
	rs = con.execute('SELECT * FROM Orders')
	df = pd.DataFrame(rs.fetchmany(size = 5)) #store 5 rows instead of all rows
	df.columns = rs.keys()

We can also query the relational databases directly with pandas:

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///Chinook.sqlite')
df = pd.read_sql_query("SELECT * FROM Orders", engine)