Learning shell commands for data science (DataCamp).
Manipulating Files and Directories
Navigate
pwd
shows where you are in the filesystem, short for: “print working directory”.ls
lists the contents of your current directory.ls /home/repl
lists the content of the specified directory.
- If the shell begins with
/
then it’s absolute, if not, then it’s relative.- e.g.
/home/repl/seasonal/winter.csv.
is absolute,seasonal/winter.csv
is relative.
- e.g.
cd
change directory to the specified one.cd ~
takes you to your “your home directory”, such as/home/repl
.~/people/aaa.txt
=/home/repl/people/aaa.txt
cd ..
moves you up one level of directory, means “the directory above the one I’m currently in”. (a single dot on its own.
always means “the current directory”)
Move
mv
moves file from one directory to another.- e.g.
mv autumn.csv winter.csv ..
moves the filesautumn.csv
andwinter.csv
from the current working directory up one level to its parent directory. - it can also be used to rename files or directories.
mv course.txt old-course.txt
changes the namecourse.txt
toold-course.txt
, andmv seasonal by-season
changes the name of theseasonal
directory toby-season
- Warning: just like
cp
,mv
will overwrite existing files. If, for example, you already have a file calledold-course.txt
, then the command shown above will replace it with whatever is incourse.txt
.
- Warning: just like
- e.g.
Copy
cp original.txt duplicate.txt
creates a copy oforiginal.txt
calledduplicate.txt
.- If there already was a file called
duplicate.txt
, it is overwritten. - If the last parameter to
cp
is an existing directory, then a command like:cp seasonal/autumn.csv seasonal/winter.csv backup
copies all of the files into that directory.
- If there already was a file called
Delete
rm
can remove files- e.g.
rm thesis.txt backup/thesis-2017-08.txt
removes boththesis.txt
andbackup/thesis-2017-08.txt
- e.g.
rmdir
can remove directory, and only works when the directory is empty, so you must delete the files in a directory before you delete the directory.
Create New Directory
mkdir
can create directory.mkdir directory_name
/tmp
is where people and programs often keep files they only need briefly. It is immediately below the root directory/
, not below your home directory.)
Manipulating Data
View File Contents
cat
prints the contents of files onto the screen- Its name is short for “concatenate”, meaning “to link things together”, since it will print all the files whose names you give it, one after the other.
less
print content page by page.- When you
less
a file, one page is displayed at a time; you can press spacebar to page down or typeq
to quit. - If you give
less
the names of several files,- you can type
:n
to move to the next file, :p
to go back to the previous one,- or
:q
to quit.
- you can type
- When you
head
prints the first few lines of a file (where “a few” means 10). If there aren’t 10 lines in the file, it will display as many lines as there are.tail
prings the last few lines of a file.- You can use single quotes,
'
, or double quotes,"
, around the file names when there are spaces in the names.
Tricks
- Pressing tab after typing partially the name of the directory can help you type less.
- For example, if you type
sea
and press tab, it will fill in the directory nameseasonal/
(with a trailing slash). If you then typea
and tab, it will complete the path asseasonal/autumn.csv
. - If the path is ambiguous, such as
seasonal/s
, pressing tab a second time will display a list of possibilities. Typing another character or two to make your path more specific and then pressing tab will fill in the rest of the name.
- For example, if you type
- Command-line flag or flag can help you customize the behavior of certain commands. Command flags don’t have to be a
-
followed by a single letter, but it’s a widely-used convention. It’s considered good style to put all flags before any filenames:-n
means to signal “number of lines”.head -n 3 seasonal/summer.csv
will only display the first three lines of the file.-R
means “recursive”.ls -R
shows everything underneath a directory, no matter how deeply nested it is. This shows every file and directory in the current level, then everything in each sub-directory, and so on.-F
.ls -F
prints a/
after the name of every directory and a*
after the name of every runnable program.
- To get help for a command, we use
man
(short for “manual”)- e.g.
man head
brings up information abouthead
. man
automatically invokesless
, so you may need to press spacebar to page through the information and :q to quit.
- e.g.
history
will print a list of commands you have run recently.- Each one is preceded by a serial number to make it easy to re-run particular commands: just type
!55
to re-run the 55th command in your history (if you have that many). - You can also re-run a command by typing an exclamation mark followed by the command’s name, such as
!head
or!cut
, which will re-run the most recent use of that command.
- Each one is preceded by a serial number to make it easy to re-run particular commands: just type
Ctrl
+C
can end a running program. This is often written ^C in Unix documentation; note that the ‘c’ can be lower-case.
Select data
cut
allows you to select columns.- e.g.
cut -f 2-5,8 -d , values.csv
means “select columns 2 through 5 and columns 8, using comma as the separator”.cut
uses-f
(meaning “fields”) to specify columns and-d
(meaning “delimiter”) to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns. cut
is a simple-minded command. In particular, it doesn’t understand quoted strings.
- e.g.
paste
can combine data files together. But it treat data files as text, so be careful.grep
selects lines according to what they contain. In its simplest form,grep
takes a piece of text followed by one or more filenames and prints all of the lines in those files that contain that text.- e.g.
grep bicuspid seasonal/winter.csv
prints lines fromwinter.csv
that contain “bicuspid”. grep
‘s common flags:-c
: print a count of matching lines rather than the lines themselves-h
: do not print the names of files when searching multiple files-i
: ignore case (e.g., treat “Regression” and “regression” as matches)-l
: print the names of files that contain matches, not the matches-n
: print line numbers for matching lines-v
: invert the match, i.e., only show lines that don’t match
- e.g.
- Wildcard
*
, which means “match zero or more characters“- e.g.
cut -d , -f 1 seasonal/*
orcut -d , -f 1 seasonal/*.csv
can perform column selection on multiple files at the same time.
- e.g.
?
matches a single character, so201?.txt
will match2017.txt
or2018.txt
, but not2017-01.txt
.[...]
matches any one of the characters inside the square brackets, so201[78].txt
matches2017.txt
or2018.txt
, but not2016.txt
.{...}
matches any of the comma-separated patterns inside the curly brackets, so{*.txt, *.csv}
matches any file whose name ends with.txt
or.csv
, but not files whose names end with.pdf
.
Combining Tools
>
saves command’s output anywhere you want.- e.g.
head -n 5 seasonal/summer.csv > top.csv
saves the top 5 lines ofsummer.csv
intotop.csv
.
- e.g.
Pipeline
|
The pipe symbol tells the shell to use the output of the command on the left as the input to the command on the right.- e.g.
head -n 5 seasonal/summer.csv | tail -n 3
select 3-5 lines fromsummer.csv
. - e.g.
cut -d , -f 2 seasonal/summer.csv | grep -v Tooth | head -n 1
select all of the tooth names from column 2 of the comma delimited fileseasonal/summer.csv
, then pipe the result togrep
, with an inverted match, to exclude the header line containing the word “Tooth”, and then select the very first tooth name.
- e.g.
wc
prints the number of characters, words, and lines in a file. You can make it print only one of these using-c
,-w
, or-l
respectively.- e.g.
grep 2017-07 seasonal/spring.csv | wc -l
Count how many records inseasonal/spring.csv
have dates in July 2017.
- e.g.
sort
puts data in order. By default it does this in ascending alphabetical order,-n
and-r
can be used to sort numerically and reverse the order of its output,-b
tells it to ignore leading blanks-f
tells it to fold case (i.e., be case-insensitive).- Pipelines often use
grep
to get rid of unwanted records and thensort
to put the remaining records in order.
uniq
removes adjacent duplicated lines. The reason is thatuniq
is built to work with very large files. In order to remove non-adjacent lines from a file, it would have to keep the whole file in memory (or at least, all the unique lines seen so far). By only removing adjacent duplicates, it only has to keep the most recent unique line in memory.-c
can be used to count the occurence of each unique item. e.g.cut -d , -f 2 seasonal/winter.csv |grep -v Tooth | sort | uniq -c
- To save the output of a pipeline, add
> xxx.txt
at the beginning or the end of the pipeline.
Batch Processing
- The shell stores information in variables. Some of these, called environment variables, are available all the time. Environment variables’ names are conventionally written in upper case, and a few of the more commonly-used ones are shown below. To get a complete list (which is quite long), you can type set in the shell.
Variable | Purpose | Value |
---|---|---|
HOME | User’s home directory | /home/repl |
PWD | Present working directory | Same as pwd command |
SHELL | Which shell program is being used | /bin/bash |
USER | User’s ID | repl |
echo
prints its arguments.echo hello DataCamp!
printshello DataCamp!
echo USER
prints variable nameUSER
echo $USER
prints variable’s valuerepl
The other kind of variable is called a shell variable, which is like a local variable in a programming language.
- To create a shell variable, you simply assign a value to a name:
training=seasonal/summer.csv
without any spaces before or after the=
sign. - To access the value in the variable, simply add
$
before the variable.head -n 1 $training
returns the first line ofseasonal/summer.csv
, which is defined above.
- To create a shell variable, you simply assign a value to a name:
For
Shell variables are also used in loops, which repeat commands many times. e.g. for filetype in gif jpg png; do echo $filetype; done
, it produces: gif jpg png
- Notice these things about the loop:
- The structure is
for
…variable…in
…list…; do
…body…; done
- The list of things the loop is to process (in our case, the words
gif
,jpg
, andpng
). - The variable that keeps track of which thing the loop is currently processing (in our case,
filetype
). - The body of the loop that does the processing (in our case,
echo $filetype
).
- The structure is
for filename in seasonal/*.csv; do echo $filename; done
will print all csv files inseasonal
directory. Sometimes we often set a variable using a wildcard expression to record a list of filenames.datasets=seasonal/*.csv
thenfor filename in $datasets; do echo $filename; done
- common mistake:
files=seasonal/*.csv
thenfor f in files; do echo $f; done
- common mistake:
for file in seasonal/*.csv; do grep -h 2017-07 $file; done
will print all lines on 2017-07 in all files without printing the file names.
A loop can also contain any number of commands. To tell the shell where one ends and the next begins, you must separate them with semi-colons: for f in seasonal/*.csv; do echo $f; head -n 2 $f | tail -n 1; done
Creating New Tools
Edit a File
nano filename
will openfilename
for editing (or create it if it doesn’t already exist). You can move around with the arrow keys, delete characters using backspace, and do other operations with control-key combinations:Ctrl
+K
: delete a line.Ctrl
+U
: un-delete a line.Ctrl
+O
: save the file (‘O’ stands for ‘output’).Ctrl
+X
: exit the editor.- copy and paste: navigating to the line you want to copy, pressing
CTRL
+K
to cut the line, thenCTRL
+U
twice to paste two copies of it.
Automate a command
To keep a record of the commands you used, you can do this by the following:
- Run
history
- Pipe its output to
tail -n 10
(or however many recent steps you want to save). - Redirect that to a file called something like
figure-5.history
.
e.g.history | tail -n 3 > steps.txt
- Run
We can also store the commands in files for the shell to run over and over again.
- e.g. We store the following command
head -n 1 seasonal/*.csv
in a file calledheaders.sh
, then we runbash headers.sh
to run the commands inside
- e.g. We store the following command
$@
means “all of the command-line parameters given to the script”- e.g. if
unique-lines.sh
contains this:sort $@ | uniq
, then you run:bash unique-lines.sh seasonal/summer.csv
, the shell replaces$@
withseasonal/summer.csv
and processes one file. - if you run this:
bash unique-lines.sh seasonal/summer.csv seasonal/autumn.csv
, it processes two files, and so on.
- e.g. if
@1
,@2
and so on can also be used to refer to specific command-line parameters. You can use this to write commands that feel simpler or more natural than the shell’s.- For example, you can create a script called
column.sh
that selects a single column from a CSV file when the user provides the filename as the first parameter and the column as the second:cut -d , -f $2 $1
and then run it using:bash column.sh seasonal/autumn.csv 1
. - Notice how the script uses the two parameters in reverse order.
- For example, you can create a script called
- You can also write multiple lines of commands in the file and run all of them at once.
It is OK to split loops across lines without semi-colons to make them more readable:
1
2
3
4
5
6# Print the first and last data records of each file.
for filename in $@
do
head -n 2 $filename | tail -n 1
tail -n 1 $filename
done- You don’t have to indent the commands inside the loop, but doing so makes things clearer.
- You can pipe the output of the program by directly adding commands behind it. e.g.
bash date-range.sh seasonal/*.csv | sort