Datacamp course notes on importing data from various sources of various types.
Importing Text Files
Types of text files:
- Plain text files
- Table data: containing records. (flat file)
Importing Plain Text Files
To read a plain text file, we can use Python’s basic open function to open a connection to the file1
2
3
4
5
6
7
8
9filename = 'huck_finn.txt'
file = open(filename, mode = 'r') # 'r' = read, to make sure we are only reading it, not writing it or sth else.
text = file.read()
file.close() # make sure you close the connection after saving it to a variable
print(text)
# Note: if you want to write to a file
filename = 'huck_finn.txt'
file = open(filename, mode = 'w') # 'w' = write
To avoid having a closing connection to the file by using a with
statement. This will allow you to create a context in which you can execute your command with the file open. Once out of this context, the file is no longer open. That’s also why with
is called a Context Manager.1
2
3
4
5
6
7
8
9with open('huck_finn.txt', 'r') as file:
print(file.read())
# Read line by line
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())
Exploring Working Directory
Starting a line with !
gives you complete system shell access.!ls
will display the contents of your current directory. Or the following:1
2
3import os
wd = os.getcwd()
os.listdir(wd)
Importing Flat Files
Flat files are basic text files containing records, which is table data without structured relationships. This in contrast with relational database, in which columns of distinct tables can be related. Each record means a row of fields or attributes. Each column is a feature or attributes. Note whether the file has a header is important when importing the data.
Common file extensions are:
- .csv: comma separated values
- .txt: text file(separated by characters or sequence of characters other than commas, such as a tab. This separator is called a delimiter)
NumPy
NumPy arrays are standard for storing numerical data, and is essential for other packages like scikit-learn. We can use loadtxt()
and genformtxt()
to load data into NumPy.1
2
3
4
5
6
7
8
9
10
11import numpy as np
filename = 'MNIST_header.text'
data = np.loadtxt(filename,
delimiter = ',', #'\t' for tab
skiprows = 1, # skip the first row, which is the header
usecols = [0, 2]) # only use the first and the third columns
# To import data as strings
data = np.loadtxt(filename,
delimiter = ',',
dtype = str) # can also be float
loadtxt()
is not that useful when we have mixed datatype in a dataset. We can use genfromtxt()
to handle such structures.1
2
3
4
5
6
7
8
9data = np.genfromtxt('titanic.csv',
delimiter = ',',
names = True, #T = There is a header
dtype = None) #figure out what types each column should be by itself
# To access the row
data[0]
# To access the column
data['Survived']
Because the data are of different types, data
is an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported.
recfromcsv()
has the defaults delimiter = ','
and names = True
in addition to dtype = None
Pandas
To have two-dimensional labeled data structure(s) with columns of potentially different types, to be able to manipulate, slice, reshape, groupby, join, merge, perform statistcis, work with time series data…. We need DataFrame.
DataFrame = pythonic analog of R’s data frame.
It’s now standard and best practice to use pandas to import flat files.
1 | import pandas as pd |
Customizing your pandas import1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file,
sep='\t',
comment='#', #takes characters that comments occur after in the file
na_values='Nothing') #takes a list of strings to recognize as NA/NaN
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
Pickled Files
- File type native to Python
- Motivation: many datatypes for which it isn’t obvious how to store them
- Pickled files are serialized = convert object to bytestream
1 | import pickel |
Excel File
1 | import pandas as pd |
Customizing the import:
Note that the arguments all need to be of list
type1
2df1 = xl.parse(0, skiprows = [0],
names = ['Country', 'AAM due to War (2002)'])#change the column names
SAS File
1 | # Import sas7bdat package |
Stata File
1 | # Import pandas |
HDF5 File
- Hierarchical Data Format version 5
- Standard for storing large quantities of numerical data
- Datasets can be hundreds of gigabytes or terabytes
- HDF5 can scale to exabytes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38# Import packages
import numpy as np
import h5py
# Assign filename: file
file = 'LIGO_data.hdf5'
# Load file: data
data = h5py.File(file, 'r')
# Print the datatype of the loaded file
print(type(data))
# Print the keys of the file
for key in data.keys():
print(key)
# Get the HDF5 group: group
group = data['strain']
# Check out keys of group
for key in group.keys():
print(key)
# Set variable equal to time series data: strain
strain = data['strain']['Strain'].value
# Set number of time points to sample: num_samples
num_samples = 10000
# Set time vector
time = np.arange(0, 1, 1/num_samples) #(start, end, interval)
# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()
MATLAB Files
- “Matrix Laboratory”
- Industry standard in engineering and science
- python keys = MATLAB variable names
- python values = object assigend to variables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27# Import package
import scipy.io
# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')
# Print the datatype type of mat
print(type(mat)) #dict
import matplotlib.pyplot as plt
import numpy as np
# Print the keys of the MATLAB dictionary
print(mat.keys())
# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))
# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt'])) #dimensions
# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()
Relational Databases
Workflow of SQL querying:
- Import packages and functions
- Create the database engine
- SQLite database: fast and simple
- SQLAlchemy: works with many RDMS
- Connect to the engine
- Query the database
- Save query results to a DataFrame
- Close the connection
1 | from sqlalchemy import create_engine |
We can also query the relational databases directly with pandas:1
2
3
4from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///Chinook.sqlite')
df = pd.read_sql_query("SELECT * FROM Orders", engine)