DataCamp - Importing Data (Part 2)

Datacamp course notes on web scraping and APIs in Python.

Web Scraping

Importing files from online website

CSV File

The urllib package provides a high-level interface for fetching data across the web.
urlopen() - accepts URLs instead of file names

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Import package
from urllib.request import urlretrieve

# Import pandas
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())

# OR using the online file directly without downloading
df = pd.read_csv(url, sep = ';')

Excel File

Note that the output of pd.read_excel() is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname = None) #set sheetname = None to load in all the sheets. xl

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

HTTP Requests

Going to a website = sending HTTP GET request. urlretrieve() performs a GET request, and helps save the HTML(HyperText Markup Language) data locally

GET requests using urllib

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Import packages
from urllib.request import urlopen, Request

# Specify the url
url = "http://www.datacamp.com/teach/documentation"

# This packages the request: request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)#this will return an http reponse object

# Print the datatype of response
print(type(response))

# Extract the response: html
html = response.read() #returns the html as a string

# Print the html
print(html)

# Be polite and close the response!
response.close()

GET requests using requests

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import package
import requests

# Specify the url: url
url = "http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text
text = r.text

# Print the html
print(text)

Web Scraping in Python

HTML data is a mix of unstructured and structured data. Therefore, we need to parse it and extract structured data from it with a Python package BeautifulSoup.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from bs4 import BeautifulSoup
import requests
url = 'https//www.crummy.com/software/BeautifulSoup/'

# Package the request, send the request and catch the response
r = requests.get(url)

# Extracts the response as html
html_doc = r.text

# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html_doc)

#make the html property indented into a readable manner
print(soup.prettify())

#get the title
print(soup.title)

#get the text
print(soup.get_text())

#to extract url of all the hyperlinks in the html
for link in soup.find_all('a'): #find all '<a>' tags which define hyperlinks
print(link.get('href'))

Extracting data from APIs

API stands for Application Programming Interface, and is a set of protocols and routines for building and interacting with software applications. Simply put, it’s a bunch of code that allows two software programs to communicate with each other. We need APIs to interact with all kinds of applications like Twitter, Instagram and so on.

Connecting to an API in Python

1
2
3
4
5
6
import requests
url = 'http://www.omdbapi.com/?t=hackers'
r = requests.get(url)
json_data = r.json()
for key, value in json_data.items():
print(key + ':', value)

The components of the above url:

  • http: making an HTTP request
  • www.omdbapi.com: querying the OMDB API
  • ?t=hackers: it’s a query string that is aimed at returning data for a movie with title (t) ‘Hackers’

Wikipedia API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Import package
import requests

# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)

Twitter API

  • Create a twitter account
  • Create a twitter app
  • Keys and Access Tokens and copy API Key, API Secret, Access Token and Access Token Secret. These are the key credentials that will allow you to have access to the twitter

Set up authentication credentials

1
2
3
4
5
6
7
8
9
10
11
12
# Import package
import tweepy

# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"

# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

Streaming tweets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Initialize Stream listener
l = MyStreamListener()

# Create you Stream object with authentication
stream = tweepy.Stream(auth, l)


# Filter Twitter Streams to capture data by the keywords:
stream.filter(track = ['clinton', 'trump', 'sanders', 'cruz'])

# Import package
import json

# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'

# Initialize empty list to store tweets: tweets_data
tweets_data = []

# Open connection to file
tweets_file = open(tweets_data_path, "r")

# Read in tweets and store in list: tweets_data
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)

# Close connection to file
tweets_file.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

# Import package
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns = ['text', 'lang'])

# Print head of DataFrame
print(df.head())

Twitter text analysis: count how many tweets contain the words 'clinton', 'trump', 'sanders' and 'cruz'.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import re

def word_in_text(word, tweet):
"""tell you whether the first argument (a word) occurs within the 2nd argument (a tweet)"""
word = word.lower()
text = tweet.lower()
match = re.search(word, tweet)

if match:
return True
return False

# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

# Iterate through df, counting the number of tweets in which each candidate is mentioned
for index, row in df.iterrows():
clinton += word_in_text('clinton', row['text'])
trump += word_in_text('trump', row['text'])
sanders += word_in_text('sanders', row['text'])
cruz += word_in_text('cruz', row['text'])

# Import packages
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style
sns.set(color_codes=True)

# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']

# Plot histogram
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()

A standard form for transferring data through APIs is the JSON file format. JSON is short for JavaScript Object Notation, which aroused out of the growing need for real-time server-to-browser communication that wouldn’t necessarily rely on Flash or Java. JSON is human readable and is close to the structure of a python dictionary.

1
2
3
4
5
6
7
8
import json
# Load JSON: json_data
with open("a_movie.json") as json_file:
json_data = json.load(json_file)

# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])