Learning shell commands for data science (DataCamp).
Manipulating Files and Directories
Navigate
pwdshows where you are in the filesystem, short for: “print working directory”.lslists the contents of your current directory.ls /home/repllists the content of the specified directory.
- If the shell begins with
/then it’s absolute, if not, then it’s relative.- e.g.
/home/repl/seasonal/winter.csv.is absolute,seasonal/winter.csvis relative.
- e.g.
cdchange directory to the specified one.cd ~takes you to your “your home directory”, such as/home/repl.~/people/aaa.txt=/home/repl/people/aaa.txtcd ..moves you up one level of directory, means “the directory above the one I’m currently in”. (a single dot on its own.always means “the current directory”)
Move
mvmoves file from one directory to another.- e.g.
mv autumn.csv winter.csv ..moves the filesautumn.csvandwinter.csvfrom the current working directory up one level to its parent directory. - it can also be used to rename files or directories.
mv course.txt old-course.txtchanges the namecourse.txttoold-course.txt, andmv seasonal by-seasonchanges the name of theseasonaldirectory toby-season- Warning: just like
cp,mvwill overwrite existing files. If, for example, you already have a file calledold-course.txt, then the command shown above will replace it with whatever is incourse.txt.
- Warning: just like
- e.g.
Copy
cp original.txt duplicate.txtcreates a copy oforiginal.txtcalledduplicate.txt.- If there already was a file called
duplicate.txt, it is overwritten. - If the last parameter to
cpis an existing directory, then a command like:cp seasonal/autumn.csv seasonal/winter.csv backupcopies all of the files into that directory.
- If there already was a file called
Delete
rmcan remove files- e.g.
rm thesis.txt backup/thesis-2017-08.txtremoves boththesis.txtandbackup/thesis-2017-08.txt
- e.g.
rmdircan remove directory, and only works when the directory is empty, so you must delete the files in a directory before you delete the directory.
Create New Directory
mkdircan create directory.mkdir directory_name/tmpis where people and programs often keep files they only need briefly. It is immediately below the root directory/, not below your home directory.)
Manipulating Data
View File Contents
catprints the contents of files onto the screen- Its name is short for “concatenate”, meaning “to link things together”, since it will print all the files whose names you give it, one after the other.
lessprint content page by page.- When you
lessa file, one page is displayed at a time; you can press spacebar to page down or typeqto quit. - If you give
lessthe names of several files,- you can type
:nto move to the next file, :pto go back to the previous one,- or
:qto quit.
- you can type
- When you
headprints the first few lines of a file (where “a few” means 10). If there aren’t 10 lines in the file, it will display as many lines as there are.tailprings the last few lines of a file.- You can use single quotes,
', or double quotes,", around the file names when there are spaces in the names.
Tricks
- Pressing tab after typing partially the name of the directory can help you type less.
- For example, if you type
seaand press tab, it will fill in the directory nameseasonal/(with a trailing slash). If you then typeaand tab, it will complete the path asseasonal/autumn.csv. - If the path is ambiguous, such as
seasonal/s, pressing tab a second time will display a list of possibilities. Typing another character or two to make your path more specific and then pressing tab will fill in the rest of the name.
- For example, if you type
- Command-line flag or flag can help you customize the behavior of certain commands. Command flags don’t have to be a
-followed by a single letter, but it’s a widely-used convention. It’s considered good style to put all flags before any filenames:-nmeans to signal “number of lines”.head -n 3 seasonal/summer.csvwill only display the first three lines of the file.-Rmeans “recursive”.ls -Rshows everything underneath a directory, no matter how deeply nested it is. This shows every file and directory in the current level, then everything in each sub-directory, and so on.-F.ls -Fprints a/after the name of every directory and a*after the name of every runnable program.
- To get help for a command, we use
man(short for “manual”)- e.g.
man headbrings up information abouthead. manautomatically invokesless, so you may need to press spacebar to page through the information and :q to quit.
- e.g.
historywill print a list of commands you have run recently.- Each one is preceded by a serial number to make it easy to re-run particular commands: just type
!55to re-run the 55th command in your history (if you have that many). - You can also re-run a command by typing an exclamation mark followed by the command’s name, such as
!heador!cut, which will re-run the most recent use of that command.
- Each one is preceded by a serial number to make it easy to re-run particular commands: just type
Ctrl+Ccan end a running program. This is often written ^C in Unix documentation; note that the ‘c’ can be lower-case.
Select data
cutallows you to select columns.- e.g.
cut -f 2-5,8 -d , values.csvmeans “select columns 2 through 5 and columns 8, using comma as the separator”.cutuses-f(meaning “fields”) to specify columns and-d(meaning “delimiter”) to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns. cutis a simple-minded command. In particular, it doesn’t understand quoted strings.
- e.g.
pastecan combine data files together. But it treat data files as text, so be careful.grepselects lines according to what they contain. In its simplest form,greptakes a piece of text followed by one or more filenames and prints all of the lines in those files that contain that text.- e.g.
grep bicuspid seasonal/winter.csvprints lines fromwinter.csvthat contain “bicuspid”. grep‘s common flags:-c: print a count of matching lines rather than the lines themselves-h: do not print the names of files when searching multiple files-i: ignore case (e.g., treat “Regression” and “regression” as matches)-l: print the names of files that contain matches, not the matches-n: print line numbers for matching lines-v: invert the match, i.e., only show lines that don’t match
- e.g.
- Wildcard
*, which means “match zero or more characters“- e.g.
cut -d , -f 1 seasonal/*orcut -d , -f 1 seasonal/*.csvcan perform column selection on multiple files at the same time.
- e.g.
?matches a single character, so201?.txtwill match2017.txtor2018.txt, but not2017-01.txt.[...]matches any one of the characters inside the square brackets, so201[78].txtmatches2017.txtor2018.txt, but not2016.txt.{...}matches any of the comma-separated patterns inside the curly brackets, so{*.txt, *.csv}matches any file whose name ends with.txtor.csv, but not files whose names end with.pdf.
Combining Tools
>saves command’s output anywhere you want.- e.g.
head -n 5 seasonal/summer.csv > top.csvsaves the top 5 lines ofsummer.csvintotop.csv.
- e.g.
Pipeline
|The pipe symbol tells the shell to use the output of the command on the left as the input to the command on the right.- e.g.
head -n 5 seasonal/summer.csv | tail -n 3select 3-5 lines fromsummer.csv. - e.g.
cut -d , -f 2 seasonal/summer.csv | grep -v Tooth | head -n 1select all of the tooth names from column 2 of the comma delimited fileseasonal/summer.csv, then pipe the result togrep, with an inverted match, to exclude the header line containing the word “Tooth”, and then select the very first tooth name.
- e.g.
wcprints the number of characters, words, and lines in a file. You can make it print only one of these using-c,-w, or-lrespectively.- e.g.
grep 2017-07 seasonal/spring.csv | wc -lCount how many records inseasonal/spring.csvhave dates in July 2017.
- e.g.
sortputs data in order. By default it does this in ascending alphabetical order,-nand-rcan be used to sort numerically and reverse the order of its output,-btells it to ignore leading blanks-ftells it to fold case (i.e., be case-insensitive).- Pipelines often use
grepto get rid of unwanted records and thensortto put the remaining records in order.
uniqremoves adjacent duplicated lines. The reason is thatuniqis built to work with very large files. In order to remove non-adjacent lines from a file, it would have to keep the whole file in memory (or at least, all the unique lines seen so far). By only removing adjacent duplicates, it only has to keep the most recent unique line in memory.-ccan be used to count the occurence of each unique item. e.g.cut -d , -f 2 seasonal/winter.csv |grep -v Tooth | sort | uniq -c
- To save the output of a pipeline, add
> xxx.txtat the beginning or the end of the pipeline.
Batch Processing
- The shell stores information in variables. Some of these, called environment variables, are available all the time. Environment variables’ names are conventionally written in upper case, and a few of the more commonly-used ones are shown below. To get a complete list (which is quite long), you can type set in the shell.
| Variable | Purpose | Value |
|---|---|---|
| HOME | User’s home directory | /home/repl |
| PWD | Present working directory | Same as pwd command |
| SHELL | Which shell program is being used | /bin/bash |
| USER | User’s ID | repl |
echoprints its arguments.echo hello DataCamp!printshello DataCamp!echo USERprints variable nameUSERecho $USERprints variable’s valuerepl
The other kind of variable is called a shell variable, which is like a local variable in a programming language.
- To create a shell variable, you simply assign a value to a name:
training=seasonal/summer.csvwithout any spaces before or after the=sign. - To access the value in the variable, simply add
$before the variable.head -n 1 $trainingreturns the first line ofseasonal/summer.csv, which is defined above.
- To create a shell variable, you simply assign a value to a name:
For
Shell variables are also used in loops, which repeat commands many times. e.g. for filetype in gif jpg png; do echo $filetype; done, it produces: gif jpg png
- Notice these things about the loop:
- The structure is
for…variable…in…list…; do…body…; done - The list of things the loop is to process (in our case, the words
gif,jpg, andpng). - The variable that keeps track of which thing the loop is currently processing (in our case,
filetype). - The body of the loop that does the processing (in our case,
echo $filetype).
- The structure is
for filename in seasonal/*.csv; do echo $filename; donewill print all csv files inseasonaldirectory. Sometimes we often set a variable using a wildcard expression to record a list of filenames.datasets=seasonal/*.csvthenfor filename in $datasets; do echo $filename; done- common mistake:
files=seasonal/*.csvthenfor f in files; do echo $f; done
- common mistake:
for file in seasonal/*.csv; do grep -h 2017-07 $file; donewill print all lines on 2017-07 in all files without printing the file names.
A loop can also contain any number of commands. To tell the shell where one ends and the next begins, you must separate them with semi-colons: for f in seasonal/*.csv; do echo $f; head -n 2 $f | tail -n 1; done
Creating New Tools
Edit a File
nano filenamewill openfilenamefor editing (or create it if it doesn’t already exist). You can move around with the arrow keys, delete characters using backspace, and do other operations with control-key combinations:Ctrl+K: delete a line.Ctrl+U: un-delete a line.Ctrl+O: save the file (‘O’ stands for ‘output’).Ctrl+X: exit the editor.- copy and paste: navigating to the line you want to copy, pressing
CTRL+Kto cut the line, thenCTRL+Utwice to paste two copies of it.
Automate a command
To keep a record of the commands you used, you can do this by the following:
- Run
history - Pipe its output to
tail -n 10(or however many recent steps you want to save). - Redirect that to a file called something like
figure-5.history.
e.g.history | tail -n 3 > steps.txt
- Run
We can also store the commands in files for the shell to run over and over again.
- e.g. We store the following command
head -n 1 seasonal/*.csvin a file calledheaders.sh, then we runbash headers.shto run the commands inside
- e.g. We store the following command
$@means “all of the command-line parameters given to the script”- e.g. if
unique-lines.shcontains this:sort $@ | uniq, then you run:bash unique-lines.sh seasonal/summer.csv, the shell replaces$@withseasonal/summer.csvand processes one file. - if you run this:
bash unique-lines.sh seasonal/summer.csv seasonal/autumn.csv, it processes two files, and so on.
- e.g. if
@1,@2and so on can also be used to refer to specific command-line parameters. You can use this to write commands that feel simpler or more natural than the shell’s.- For example, you can create a script called
column.shthat selects a single column from a CSV file when the user provides the filename as the first parameter and the column as the second:cut -d , -f $2 $1and then run it using:bash column.sh seasonal/autumn.csv 1. - Notice how the script uses the two parameters in reverse order.
- For example, you can create a script called
- You can also write multiple lines of commands in the file and run all of them at once.
It is OK to split loops across lines without semi-colons to make them more readable:
1
2
3
4
5
6# Print the first and last data records of each file.
for filename in $@
do
head -n 2 $filename | tail -n 1
tail -n 1 $filename
done- You don’t have to indent the commands inside the loop, but doing so makes things clearer.
- You can pipe the output of the program by directly adding commands behind it. e.g.
bash date-range.sh seasonal/*.csv | sort