Introduction to Shell for Data Science

Learning shell commands for data science (DataCamp).

Manipulating Files and Directories

  • pwd shows where you are in the filesystem, short for: “print working directory”.
  • ls lists the contents of your current directory.
    • ls /home/repl lists the content of the specified directory.
  • If the shell begins with / then it’s absolute, if not, then it’s relative.
    • e.g. /home/repl/seasonal/winter.csv. is absolute, seasonal/winter.csv is relative.
  • cd change directory to the specified one.
    • cd ~ takes you to your “your home directory”, such as /home/repl. ~/people/aaa.txt = /home/repl/people/aaa.txt
    • cd .. moves you up one level of directory, means “the directory above the one I’m currently in”. (a single dot on its own . always means “the current directory”)

Move

  • mv moves file from one directory to another.
    • e.g. mv autumn.csv winter.csv .. moves the files autumn.csv and winter.csv from the current working directory up one level to its parent directory.
    • it can also be used to rename files or directories. mv course.txt old-course.txt changes the name course.txt to old-course.txt, and mv seasonal by-season changes the name of the seasonal directory to by-season
      • Warning: just like cp, mv will overwrite existing files. If, for example, you already have a file called old-course.txt, then the command shown above will replace it with whatever is in course.txt.

Copy

  • cp original.txt duplicate.txt creates a copy of original.txt called duplicate.txt.
    • If there already was a file called duplicate.txt, it is overwritten.
    • If the last parameter to cp is an existing directory, then a command like: cp seasonal/autumn.csv seasonal/winter.csv backup copies all of the files into that directory.

Delete

  • rm can remove files
    • e.g. rm thesis.txt backup/thesis-2017-08.txt removes both thesis.txt and backup/thesis-2017-08.txt
  • rmdir can remove directory, and only works when the directory is empty, so you must delete the files in a directory before you delete the directory.

Create New Directory

  • mkdir can create directory. mkdir directory_name
  • /tmp is where people and programs often keep files they only need briefly. It is immediately below the root directory /, not below your home directory.)

Manipulating Data

View File Contents

  • cat prints the contents of files onto the screen
    • Its name is short for “concatenate”, meaning “to link things together”, since it will print all the files whose names you give it, one after the other.
  • less print content page by page.
    • When you less a file, one page is displayed at a time; you can press spacebar to page down or type q to quit.
    • If you give less the names of several files,
      • you can type :n to move to the next file,
      • :p to go back to the previous one,
      • or :q to quit.
  • head prints the first few lines of a file (where “a few” means 10). If there aren’t 10 lines in the file, it will display as many lines as there are.
  • tail prings the last few lines of a file.
  • You can use single quotes, ', or double quotes, ", around the file names when there are spaces in the names.

Tricks

  • Pressing tab after typing partially the name of the directory can help you type less.
    • For example, if you type sea and press tab, it will fill in the directory name seasonal/ (with a trailing slash). If you then type a and tab, it will complete the path as seasonal/autumn.csv.
    • If the path is ambiguous, such as seasonal/s, pressing tab a second time will display a list of possibilities. Typing another character or two to make your path more specific and then pressing tab will fill in the rest of the name.
  • Command-line flag or flag can help you customize the behavior of certain commands. Command flags don’t have to be a - followed by a single letter, but it’s a widely-used convention. It’s considered good style to put all flags before any filenames:
    • -n means to signal “number of lines”. head -n 3 seasonal/summer.csv will only display the first three lines of the file.
    • -R means “recursive”. ls -R shows everything underneath a directory, no matter how deeply nested it is. This shows every file and directory in the current level, then everything in each sub-directory, and so on.
    • -F. ls -F prints a / after the name of every directory and a * after the name of every runnable program.
  • To get help for a command, we use man (short for “manual”)
    • e.g. man head brings up information about head.
    • man automatically invokes less, so you may need to press spacebar to page through the information and :q to quit.
  • history will print a list of commands you have run recently.
    • Each one is preceded by a serial number to make it easy to re-run particular commands: just type !55 to re-run the 55th command in your history (if you have that many).
    • You can also re-run a command by typing an exclamation mark followed by the command’s name, such as !head or !cut, which will re-run the most recent use of that command.
  • Ctrl+C can end a running program. This is often written ^C in Unix documentation; note that the ‘c’ can be lower-case.

Select data

  • cut allows you to select columns.
    • e.g. cut -f 2-5,8 -d , values.csv means “select columns 2 through 5 and columns 8, using comma as the separator”. cut uses -f (meaning “fields”) to specify columns and -d (meaning “delimiter”) to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.
    • cut is a simple-minded command. In particular, it doesn’t understand quoted strings.
  • paste can combine data files together. But it treat data files as text, so be careful.
  • grep selects lines according to what they contain. In its simplest form, grep takes a piece of text followed by one or more filenames and prints all of the lines in those files that contain that text.
    • e.g. grep bicuspid seasonal/winter.csv prints lines from winter.csv that contain “bicuspid”.
    • grep‘s common flags:
      • -c: print a count of matching lines rather than the lines themselves
      • -h: do not print the names of files when searching multiple files
      • -i: ignore case (e.g., treat “Regression” and “regression” as matches)
      • -l: print the names of files that contain matches, not the matches
      • -n: print line numbers for matching lines
      • -v: invert the match, i.e., only show lines that don’t match
  • Wildcard
    • *, which means “match zero or more characters
      • e.g. cut -d , -f 1 seasonal/* or cut -d , -f 1 seasonal/*.csv can perform column selection on multiple files at the same time.
    • ? matches a single character, so 201?.txt will match 2017.txt or 2018.txt, but not 2017-01.txt.
    • [...] matches any one of the characters inside the square brackets, so 201[78].txt matches 2017.txt or 2018.txt, but not 2016.txt.
    • {...} matches any of the comma-separated patterns inside the curly brackets, so {*.txt, *.csv} matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf.

Combining Tools

  • > saves command’s output anywhere you want.
    • e.g. head -n 5 seasonal/summer.csv > top.csv saves the top 5 lines of summer.csv into top.csv.

Pipeline

  • | The pipe symbol tells the shell to use the output of the command on the left as the input to the command on the right.
    • e.g. head -n 5 seasonal/summer.csv | tail -n 3 select 3-5 lines from summer.csv.
    • e.g. cut -d , -f 2 seasonal/summer.csv | grep -v Tooth | head -n 1 select all of the tooth names from column 2 of the comma delimited file seasonal/summer.csv, then pipe the result to grep, with an inverted match, to exclude the header line containing the word “Tooth”, and then select the very first tooth name.
  • wc prints the number of characters, words, and lines in a file. You can make it print only one of these using -c, -w, or -l respectively.
    • e.g. grep 2017-07 seasonal/spring.csv | wc -l Count how many records in seasonal/spring.csv have dates in July 2017.
  • sort puts data in order. By default it does this in ascending alphabetical order,
    • -n and -r can be used to sort numerically and reverse the order of its output,
    • -b tells it to ignore leading blanks
    • -f tells it to fold case (i.e., be case-insensitive).
    • Pipelines often use grep to get rid of unwanted records and then sort to put the remaining records in order.
  • uniq removes adjacent duplicated lines. The reason is that uniq is built to work with very large files. In order to remove non-adjacent lines from a file, it would have to keep the whole file in memory (or at least, all the unique lines seen so far). By only removing adjacent duplicates, it only has to keep the most recent unique line in memory.
    • -c can be used to count the occurence of each unique item. e.g. cut -d , -f 2 seasonal/winter.csv |grep -v Tooth | sort | uniq -c
  • To save the output of a pipeline, add > xxx.txt at the beginning or the end of the pipeline.

Batch Processing

  • The shell stores information in variables. Some of these, called environment variables, are available all the time. Environment variables’ names are conventionally written in upper case, and a few of the more commonly-used ones are shown below. To get a complete list (which is quite long), you can type set in the shell.
Variable Purpose Value
HOME User’s home directory /home/repl
PWD Present working directory Same as pwd command
SHELL Which shell program is being used /bin/bash
USER User’s ID repl
  • echo prints its arguments.

    • echo hello DataCamp! prints hello DataCamp!
    • echo USER prints variable name USER
    • echo $USER prints variable’s value repl
  • The other kind of variable is called a shell variable, which is like a local variable in a programming language.

    • To create a shell variable, you simply assign a value to a name: training=seasonal/summer.csv without any spaces before or after the = sign.
    • To access the value in the variable, simply add $ before the variable. head -n 1 $training returns the first line of seasonal/summer.csv, which is defined above.

For

Shell variables are also used in loops, which repeat commands many times. e.g. for filetype in gif jpg png; do echo $filetype; done, it produces: gif jpg png

  • Notice these things about the loop:
    1. The structure is for …variable… in …list… ; do …body… ; done
    2. The list of things the loop is to process (in our case, the words gif, jpg, and png).
    3. The variable that keeps track of which thing the loop is currently processing (in our case, filetype).
    4. The body of the loop that does the processing (in our case, echo $filetype).
  • for filename in seasonal/*.csv; do echo $filename; done will print all csv files in seasonal directory. Sometimes we often set a variable using a wildcard expression to record a list of filenames. datasets=seasonal/*.csv then for filename in $datasets; do echo $filename; done
    • common mistake: files=seasonal/*.csv then for f in files; do echo $f; done
  • for file in seasonal/*.csv; do grep -h 2017-07 $file; done will print all lines on 2017-07 in all files without printing the file names.

A loop can also contain any number of commands. To tell the shell where one ends and the next begins, you must separate them with semi-colons: for f in seasonal/*.csv; do echo $f; head -n 2 $f | tail -n 1; done

Creating New Tools

Edit a File

  • nano filename will open filename for editing (or create it if it doesn’t already exist). You can move around with the arrow keys, delete characters using backspace, and do other operations with control-key combinations:
    • Ctrl + K: delete a line.
    • Ctrl + U: un-delete a line.
    • Ctrl + O: save the file (‘O’ stands for ‘output’).
    • Ctrl + X: exit the editor.
    • copy and paste: navigating to the line you want to copy, pressing CTRL + K to cut the line, then CTRL + U twice to paste two copies of it.

Automate a command

  • To keep a record of the commands you used, you can do this by the following:

    1. Run history
    2. Pipe its output to tail -n 10 (or however many recent steps you want to save).
    3. Redirect that to a file called something like figure-5.history.
      e.g. history | tail -n 3 > steps.txt
  • We can also store the commands in files for the shell to run over and over again.

    • e.g. We store the following command head -n 1 seasonal/*.csv in a file called headers.sh, then we run bash headers.sh to run the commands inside
  • $@ means “all of the command-line parameters given to the script”

    • e.g. if unique-lines.sh contains this: sort $@ | uniq, then you run: bash unique-lines.sh seasonal/summer.csv, the shell replaces $@ with seasonal/summer.csv and processes one file.
    • if you run this: bash unique-lines.sh seasonal/summer.csv seasonal/autumn.csv, it processes two files, and so on.
  • @1, @2 and so on can also be used to refer to specific command-line parameters. You can use this to write commands that feel simpler or more natural than the shell’s.
    • For example, you can create a script called column.sh that selects a single column from a CSV file when the user provides the filename as the first parameter and the column as the second: cut -d , -f $2 $1 and then run it using: bash column.sh seasonal/autumn.csv 1.
    • Notice how the script uses the two parameters in reverse order.
  • You can also write multiple lines of commands in the file and run all of them at once.
  • It is OK to split loops across lines without semi-colons to make them more readable:

    1
    2
    3
    4
    5
    6
    # Print the first and last data records of each file.
    for filename in $@
    do
    head -n 2 $filename | tail -n 1
    tail -n 1 $filename
    done
    • You don’t have to indent the commands inside the loop, but doing so makes things clearer.
  • You can pipe the output of the program by directly adding commands behind it. e.g. bash date-range.sh seasonal/*.csv | sort