Difference between revisions of "ITIC:Working in the shell - Introduction to Bash"

From Juneday education
Jump to: navigation, search
(11 - More on text files: Links to videos)
m (Permissions)
(One intermediate revision by the same user not shown)
Line 466: Line 466:
  
 
A final note on the output from <code>ls</code>. You might have noticed "total 20" before each printout. That's how many "blocks" the files take up on the disk. Disks use blocks of a certain size for files. A file might be slightly smaller than a block but would still occupy a full block. This is not important in the context of this course.
 
A final note on the output from <code>ls</code>. You might have noticed "total 20" before each printout. That's how many "blocks" the files take up on the disk. Disks use blocks of a certain size for files. A file might be slightly smaller than a block but would still occupy a full block. This is not important in the context of this course.
 +
 +
===Permissions===
 +
There's a little more that can be said about file permissions. So let's break it down a little in this section.
 +
 +
On your system, you have users. Some users are system users (not actual persons), and some users are actual persons (like yourself).
 +
 +
All actual person users have a home directory as explained above. And all users belong to at least one ''user group'' (group). By default, you belong to a group with the same name as your username, a group with only one member (you).
 +
 +
Having groups allows for a more fine-grained permission control. For instance, you can create a group for you and your colleagues, e.g. ''team01''. File permissions work in three levels:
 +
* user (owner of file or directory)
 +
* group (members of the group that owns the file or directory)
 +
* other (all other users on the computer)
 +
 +
Having a group team01 allows for assigning the group ownership of a directory to the group team01. If the directory has full permissions for both the owner and the group owner, then all members of team01 can cd to the directory, list files and write new files there.
 +
 +
For files, it's the same thing. A shared file with group ownership of team01 can have e.g. read and write permissions for both the normal owner of the file and the group owner team01. Now all members of team01 can read and write the file, while others perhaps only can read the file.
 +
<pre>
 +
rikard@newdelli:~/shared-files$ ls -l documentation.txt
 +
-rw-rw-r-- 1 rikard team01 0 aug 15 13:10 documentation.txt
 +
</pre>
 +
 +
As mentioned above, the permissions listed by <code>ls</code> come in three groups of symbols. The example listing of <code>documentation.txt</code> has the following permissions:
 +
<pre>
 +
-rw-rw-r--
 +
 +
User (rikard):  rw-
 +
Group (team01): rw-
 +
Other:          r--
 +
</pre>
 +
The leading dash, symbolizes that the file is a normal file (and not, e.g. a directory - directories have a leading "d" instead).
 +
 +
So, the three chunks of symbols, <code>rw-</code>, <code>rw-</code> and <code>r--</code> represent in order: owner, group and others. And the symbols represent the actual permissions:
 +
* r - read
 +
* w - write
 +
* x - execute
 +
 +
A dash in place of any of rwx means that the permission is missing. <code>rw-</code> means that the execute is missing. <code>r--</code> means that both write and execute is missing, etc.
 +
 +
Finally, we'll mention here that changing the permissions could be done with the <code>chmod</code> command and using letters: <code>chmod u+x somefile</code> means that we add the execute permission to the user (owner). Similarly, <code>chmod u-x somefile</code> means that we revoke the execute permission for the user (owner). The three levels of ownership is represented by:
 +
* u - user (owner)
 +
* g - group owner
 +
* o - others
 +
[[File:9123705003 d006eb13da o.jpg|200px|thumb|right|777 means rwxrwxrwx - which may be more than you want]]
 +
In fact, the permission for each level is stored as a number (since we are dealing with computers). We can use numbers in place of letters when changing the permissions. This is a simplified explanation of what numbers to use. For each level (u, g, o) we use one single number. To calculate what number to give, we use a sum of the permissions where:
 +
* 4 is the value of read permission
 +
* 2 is the value of write permission
 +
* 1 is the value of execute permission
 +
 +
This means that to give the user (owner) rw permission, we use 4+2=6. And the same with each ownership level. An example:
 +
<pre>
 +
-rwxrw-r--  (user: rwx, group rw, other r)
 +
  7  6  4
 +
 +
So we could do:
 +
chmod 764 somefile
 +
 +
Note: Using 777 as permission for a file or directory gives full permissions to everyone, which should typically be avoided for security reasons!
 +
</pre>
 +
As far as the scope of this course material goes, it's totally fine if you only learn about the letter version of permissions (rwx) and forget about the numeric version. We just wanted to mention it briefly, in case you run into some example somewhere using numbers for permissions.
 +
 
===Arguments to commands - open ''the fridge''===
 
===Arguments to commands - open ''the fridge''===
 
After looking at the ''how'' using flags or options, it's time to look at the ''what''. Some commands are meaningless without an argument. If you think of commands as verbs or imperatives, you will realize that our language works this way to. Without a context, telling someone to "open" or "put" without specifying what will not work. The same goes for most commands.
 
After looking at the ''how'' using flags or options, it's time to look at the ''what''. Some commands are meaningless without an argument. If you think of commands as verbs or imperatives, you will realize that our language works this way to. Without a context, telling someone to "open" or "put" without specifying what will not work. The same goes for most commands.

Revision as of 13:34, 15 August 2019

Contents

Reading instructions

Before this module, you should read the Swedish compendium, chapters Filer, kataloger och sökvägar and Terminalmiljö.

Introduction to the command line and shell

If you are an IT student, or aim to work with IT in the future (or perhaps already work with IT), then it is a great help to know how to work in a command line interface (CLI), text mode, in the shell.

A terminal running Bash

The shell is an interactive program that allows you to issue commands from a terminal emulator (a terminal). You have probably seen people working in a terminal, only using the keyboard and the text output from the commands entered. Another way of describing a shell is "command interpreter".

In this course material, we are focusing on the Bash shell. The shell runs in a "terminal", a window only displaying text.

Knowing how to get around in the terminal and using the most common tools and commands is extra important if you intend to take our programming courses, since we do programming in the terminal in our introductory programming course materials. The reason for using the terminal and a simple editor for the introductory programming courses is described elsewhere.

But regardless of whether you plan to do programming or something else in the IT field, knowledge of working in a capable shell like Bash, will be of great help to you. We hope that this will become clear throughout the rest of this module.

File system

The file system is a part of the operating system that lets us use abstractions for the organization of files on our hard drive in the terms of a single-rooted tree with directories (sometimes "folders") containing files and other directories. When you work in a shell, you have the concept of "working directory" (or "current directory"), meaning that the shell has the concept of you, the user, "being in a directory" when issuing commands.

The Nautilus file browser

The file system, and how it is organized, is a central part of being capable of working in the shell, so we'll stay on the topic for a while. You have probably used a file browser like Explorer (on Windows) or Finder (on macOS). There, you have probably noticed that folders are organized in a hierarchical manner, with a single root (top directory) - at least per device - which contains files and folders.

On Windows, the default top folder is called C:\, or simply \ (backslash). On macOS and UNIX/Linux systems, the top folder is called / (slash, or "forward slash"). In this material, we are using Ubuntu GNU/Linux as the reference operating system (sometimes we might show examples from other GNU/Linux distributions as well, or even macOS and Windows as a contrast). So, from now on, we'll refer to the root directory as / ("slash").

Since you will be working in text mode, or "command line interface" in this module, it is important to learn the concept of paths. This is because, as we said, the shell has the concept of "current directory", so when you refer to files or directories in the command line, you will use two kinds of paths, described in more detail below, absolute paths and relative paths.

A path is a canonical (unique, unambiguous) way of referring a file or directory. An absolute path is the full list of intermediate directories from the root to a certain file or directory, so it starts with a slash. Each intermediate directory is separated by additional slashes.

A relative path originates from the current directory instead, and also uses slashes between all intermediate directories to the file or directory you want to identify.

This all works, because the directory tree abstraction starts from a root folder at the top (slash in our examples in Bash) and the root can contain files or other directories. Each directory "below" the root can, in turn, contain new directories and files. This means that a file or directory can only exist in one single directory (its "parent directory"). Thus, we can describe a unique path from the root to any file or directory in the tree.

Absolute path

We'll start talking about the absolute paths because they might be easier to understand at this early stage. An absolute path works from any directory, because it is the canonical path to any file or directory in the file system, since it starts from the root (slash).

Here's an example of an absolute path on a Ubuntu system to one of the system log files:

/var/log/kern.log

As you can see, the path starts with slash (hence it is per definition an absolute path) followed by var, followed by /, followed by log, followed by / followed by kern.log.

This is the file with messages from the Linux kernel. We'll look at the parts again:

/var/log/kern.log
| | | | |   `--- kern.log - a file in /var/log/
| | | |  `------ directories are separated by slashes
| | |  `-------- log - a directory in /var
| |  `---------- directories are separated by slashes
|  `------------ var - a directory in /
 `-------------- the root, "slash"

This means that we have a way to communicate this exact file to the shell by giving the absolute path to it. As an example, here's how we can print the first line of the file from the command line, by using the command head with the flag -1 and the argument /var/log/kern.log:

$ head -1 /var/log/kern.log
Jul 22 09:13:51 newdelli NetworkManager[919]: <info>  [1563779631.5211]   address 10.0.116.35

We'll talk about flags and arguments below, so don't worry about these new concepts. The point was that we can ask head to print the first line of this exact file, by supplying the absolute path to the file as extra information for the command head. And this works regardless of what current directory we are in, since the absolute path is canonical and works from any current directory. Next we'll talk about relative paths (paths that start from the current directory rather than from the root).

Relative path

It might seem a bit too much to have to remember the long absolute path all the way from the top (/) down to each file we want to reference in connection with a command in the shell. For instance, I might be in the directory designated by the absolute path /home/rikard/opt/progund/datorkunskap-kompendium/school and the directory tree from my current directory looks like this:

.
└── courses
    ├── command_line
    │   └── meals.txt
    └── programming
        └── Inheritance_-_problems_with.pdf

By, the way, the dot in the diagram above signifies "current directory". If I wanted to investigate the file Inheritance_-_problems_with.pdf which is in a "subfolder" to the courses directory called programming, then the absolute path to this file would be /home/rikard/opt/progund/datorkunskap-kompendium/school/courses/programming/Inheritance_-_problems_with.pdf. A nifty program is the file program, which lets you investigate the file type of any file (the command will do its best to guess the file type!). I could use the absolute path as the extra information to the file command, since an absolute path works from any directory:

$ file /home/rikard/opt/progund/datorkunskap-kompendium/school/courses/programming/Inheritance_-_problems_with.pdf
/home/rikard/opt/progund/datorkunskap-kompendium/school/courses/programming/Inheritance_-_problems_with.pdf: PDF document, version 1.4

But, wait! The file is only two directories away! Isn't there a shorter way to tell file what file I want to examine? There is! You can use the relative path from the current directory. The current directory was, as we said above, /home/rikard/opt/progund/datorkunskap-kompendium/school. And the file we wanted to examin was in the directory programming which was in the directory courses which was in the current directory. Luckily for us, we are allowed to use relative paths, which starts from the current directory. In this case, the relative path would be courses/programming/Inheritance_-_problems_with.pdf . Note that there is no leading slash in this path (that's what makes it a relative path). For this path to be legal, we must have a current directory which has a directory called courses which in turn has a directory called programming which in turn has the file Inheritance_-_problems_with.pdf. Let's see if it works:

$ file courses/programming/Inheritance_-_problems_with.pdf
courses/programming/Inheritance_-_problems_with.pdf: PDF document, version 1.4

By the way, did you notice the dollar sign at the start of the command lines in the examples? That's not part of the command we enter. It is called a "prompt", which we will discuss in more details below.

Predifined aliases . .. ~

When using relative paths, it would be great if we also could refer to files and directories that are "above" the current directory. In the example above, we were investigating a file "below" the current directory. As it turns out, we can. There are a few aliases we can use to signify certain directories. The parent directory (that is, the directory where current directory is in) is signified by .. (two dots). In fact, every directory has a link to its parent directory by this name - .. - which will turn out to be necessary. The only exception is the root directory / which also has this link but it goes to the same directory. In other words, you cannot go to the parent directory of / because it has no parent directory. The path /.. goes to / itself.

Now, let's say we have the following directory structure somewhere in our filesystem:

school
└── courses
    ├── command_line
    │   └── meals.txt
    └── programming
        └── Inheritance_-_problems_with.pdf

If we were in the programming directory, i.e. had programming as our current directory, then how could we refer to the file meals.txt which is in our sibling directory command_line? By sibling directory, we mean that both our current directory (programming) and the directory command_line share the same parent directory (courses) as you can see in the tree above. Individuals that share a parent are siblings, right?

Back to the question, how could we, from programming refer to the meals.txt file in our sibling directory command_line?

Easy! We use the link called .. to refer to our common parent directory! The relative path from programming to meals.txt now becomes:

../command_line/meals.txt

In our current directory, the link .. goes to our parent directory courses. And from there, the relative path to the file is command_line/meals.txt. We just put a slash between .. and command_line as we would between any two directories in a path, so the full relative path becomes ../command_line/meals.txt . Make sure you understand this and ask for help from a colleague, supervisor or teacher if you don't. We'll be using relative paths a lot in this and coming courses (if you intend to use more of our course materials).

If we wanted to print the contents of the meals.txt file, we could use the command cat (explained below in a later section) to do so:

$ cat ../command_line/meals.txt 
Breakfast: Egg and tea
Lunch: Fish and chips
Snack: Sandwich and juice
Dinner: Stake and sallad
Breakfast: Egg and coffee
Lunch: Hamburger and coke
Snack: Peanuts and beer
Dinner: Pizza
Breakfast: Sandwich and milk
Lunch: Fish and potato
Snack: Apple
Dinner: Pasta and wine

Another link which is present in every directory is simply called . ("dot"). That links to the very same directory, so we can call it "current directory" or even "here". We'll show you later how you can use the dot directory as a shortcut when e.g. copying a file from somewhere else to the current directory. For those really interested already, this is how we would copy the meals.txt to the current directory if we still were in the programming directory:

$ cp ../command_line/meals.txt .

The cp command means "copy". It needs at least two pieces of information in order to know what to do:

  • What file to copy
  • What directory to copy that file to

We gave two such "arguments"; first ../command_line/meals.txt which was the relative path to the file to copy and . which was the directory to which we wanted the file copied. We used the shortcut . to signify "here".

Tilde Grillbar, Sapporo, Japan
Photo: blondinrikard, Flickr, CC-BY

Finally, there's one more alias we can use as part of a path. That's ~ ("tilde"). This is an alias for your "home directory". The home directory is explained in more detail below, but in short, it is your default directory when you open a shell. Every user on the system has its own home directory.

If you want to copy a file to your home directory, you can use ~ as the destination directory provided as the last argument to the cp command: cp path/to/some/file ~ . If you have more than one user on the system, you can refer to any such user's home directory by appending the username to the tilde:

  • ~ - your own home
  • ~user2 - the home of user2
  • ~user3 - the home of user3 etc

The prompt

You have already seen some example commands above. As we mentioned, there was something in the very beginning of those command lines, that wasn't a part of what you should type in your terminal. We used a single dollar sign as what is called a "prompt".

A prompt is like a label, urging you to type something. It usually also contains a lot of information about who you are, on what computer and what the current directory is. Usually it ends with a special character (in Bash, typically a dollar sign) and a space. After that, the cursor might be blinking to really try and get your attention and urging you to please enter some commands.

This is what the prompt looks like on Rikard's work laptop:

rikard@newdelli:~$ 

The prompt ends with a space (which doesn't show above) which means that if you were to issue a command line, what you type would show up not directly after the dollar sign, but one space after it:

rikard@newdelli:~$ echo "Hello Bash"
Hello Bash
rikard@newdelli:~$ 

In the example above, we issued the command echo "Hello Bash" and pressed Enter. Bash replied with two things:

  • Hello Bash - the echo command is used for printing text to the terminal
  • rikard@newdelli:~$ - a new prompt, asking for more commands

We needed to press enter in order for Bash to care about what we were doing. This is because Bash is line-based. That means that Bash operates on lines of text. Only when we press Enter, a line is created for Bash to interpret. If it is a complete command, Bash will execute that command (if it can), or issue an error message (if it can't):

rikard@newdelli:~$ ehco "Hello Spelling!"
No command 'ehco' found, did you mean:
 Command 'echo' from package 'coreutils' (main)
ehco: command not found
rikard@newdelli:~$

If your command isn't complete yet (according to Bash's way of trying to understand what you are typing) when you press Enter, it will give you a "secondary prompt", urging you to finish what you started. Let's take an example where we want to use echo again, but we don't close the double quotes around the text we want to print before we press Enter. Bash will be patient enough to give us a secondary prompt, so that we can finish and "close" the string with the final double quote:

rikard@newdelli:~$ echo "Hello
> many lines"
Hello
many lines
rikard@newdelli:~$

As you see, the secondary prompt on Rikard's computer was a simple > (greater than followed by a space). When we closed the string with the second double quote and pressed Enter again, bash could execute the now complete command and have echo print the two lines of text.

Let's look at the prompt on Rikard's computer and see what information there is:

rikard@newdelli:~$ 
  |   |   |    ||`-- indicating Bash at the shell (mnemonic: $ as in "BA$H")
  |   |   |    |`--- indicating current directory is ~ (rikard's home)
  |   |   |    `---- just a separator
  |   |   `--------- the computer is "newdelli" (Rikard's new Dell laptop)
  |   `------------- at the computer...
  `----------------- user is "rikard"

Read from the bottom up, to see what each part of the prompt symbolizes.

Now, if rikard uses cd to change directory to the relative path opt/progund/datorkunskap-kompendium, see how the prompt changes:

rikard@newdelli:~$ cd opt/progund/datorkunskap-kompendium
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$

The new prompt is much longer, since the current directory part now contains ~/opt/progund/datorkunskap-kompendium .

The prompt is particularly informative when you have many terminals running and you are logged into different computers in each terminal. When you do this, please keep an extra eye on the prompt to make sure that you are on the correct machine when you e.g. delete a file or something else potentially embarrassing if you get it wrong. It is not common to use different color backgrounds etc in the terminal programs to make it extra clear that they are logged into different computers. Of course, hard core hackers don't do this. They never make mistakes. Just kidding. Just a tip for those of you who are nervous about forgetting where you have logged in, you can make this obvious in other ways than just relying on the prompt. Colors are one way.

Current directory

Current directory (sometimes "working directory") is Bash's way to use the abstraction that you actually "are" in a particular directory when you issue your commands, like listing the files or something else. Revisit the sections on absolute and relative paths (again, these are very central concepts when working in the command line!), and refresh your knowledge about relative paths. Relative paths originate from the current directory. Therefore it is important to know where you are if you want to use a relative path. Looking at the prompt might help you, if your prompt contains that information (often this is the default).

If you want to get a printout of the current directory, you may issue the command pwd (pwd stands for "print name of current/working directory"):

rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ pwd
/home/rikard/opt/progund/datorkunskap-kompendium
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$

As you see, the current directory is printed in the form of an absolute path.

Just like we mentioned above, there is a nickname for current directory that works in any directory. That was . (a single dot). This is very useful when moving or copying things from some place to the current directory. Rather than printing the whole path to the current directory as the destination, a single dot suffices:

rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ cp path/to/some/file .

The cp (stands for "copy") command above took two arguments (two pieces of extra information to instruct it what to do):

  • path/to/some/file
  • .

The "syntax" or "synopsis" for the cp command is described as this:

cp [OPTION]... SOURCE... DIRECTORY

This means, to use the cp (copy) command, you should start with the name of the command (cp) and optionally some "options" (more about that later) followed by some source files followed by a single target directory.

It might be good to get used to this way of describing the uses of commands (more about that later). The square brackets around "OPTION" typically means that they are optional and can be left out. The ellipses (...) means "one or more". There are no dots after the last part, DIRECTORY, because (of course) you can only have one target directory.

In the examples above, we have used the simpler form: "cp SOURCE DIRECTORY" where SOURCE is one single file, and DIRECTORY is one single directory (the destination for the copy, where the copied file should end up).

If you want a full list of what options you can use, do the following:

$ cp --help
Usage: cp [OPTION]... [-T] SOURCE DEST
  or:  cp [OPTION]... SOURCE... DIRECTORY
  or:  cp [OPTION]... -t DIRECTORY SOURCE...
Copy SOURCE to DEST, or multiple SOURCE(s) to DIRECTORY.

Mandatory arguments to long options are mandatory for short options too.
  -a, --archive                same as -dR --preserve=all
      --attributes-only        don't copy the file data, just the attributes
      --backup[=CONTROL]       make a backup of each existing destination file
  -b                           like --backup but does not accept an argument
      --copy-contents          copy contents of special files when recursive
  -d                           same as --no-dereference --preserve=links
  -f, --force                  if an existing destination file cannot be
                                 opened, remove it and try again (this option
                                 is ignored when the -n option is also used)
  -i, --interactive            prompt before overwrite (overrides a previous -n
                                  option)
  -H                           follow command-line symbolic links in SOURCE
  -l, --link                   hard link files instead of copying
  -L, --dereference            always follow symbolic links in SOURCE
  -n, --no-clobber             do not overwrite an existing file (overrides
                                 a previous -i option)
  -P, --no-dereference         never follow symbolic links in SOURCE
  -p                           same as --preserve=mode,ownership,timestamps
      --preserve[=ATTR_LIST]   preserve the specified attributes (default:
                                 mode,ownership,timestamps), if possible
                                 additional attributes: context, links, xattr,
                                 all
      --no-preserve=ATTR_LIST  don't preserve the specified attributes
      --parents                use full source file name under DIRECTORY
  -R, -r, --recursive          copy directories recursively
      --reflink[=WHEN]         control clone/CoW copies. See below
      --remove-destination     remove each existing destination file before
                                 attempting to open it (contrast with --force)
      --sparse=WHEN            control creation of sparse files. See below
      --strip-trailing-slashes  remove any trailing slashes from each SOURCE
                                 argument
  -s, --symbolic-link          make symbolic links instead of copying
  -S, --suffix=SUFFIX          override the usual backup suffix
  -t, --target-directory=DIRECTORY  copy all SOURCE arguments into DIRECTORY
  -T, --no-target-directory    treat DEST as a normal file
  -u, --update                 copy only when the SOURCE file is newer
                                 than the destination file or when the
                                 destination file is missing
  -v, --verbose                explain what is being done
  -x, --one-file-system        stay on this file system
  -Z                           set SELinux security context of destination
                                 file to default type
      --context[=CTX]          like -Z, or if CTX is specified then set the
                                 SELinux or SMACK security context to CTX
      --help     display this help and exit
      --version  output version information and exit

By default, sparse SOURCE files are detected by a crude heuristic and the
corresponding DEST file is made sparse as well.  That is the behavior
selected by --sparse=auto.  Specify --sparse=always to create a sparse DEST
file whenever the SOURCE file contains a long enough sequence of zero bytes.
Use --sparse=never to inhibit creation of sparse files.

When --reflink[=always] is specified, perform a lightweight copy, where the
data blocks are copied only when modified.  If this is not possible the copy
fails, or if --reflink=auto is specified, fall back to a standard copy.

The backup suffix is '~', unless set with --suffix or SIMPLE_BACKUP_SUFFIX.
The version control method may be selected via the --backup option or through
the VERSION_CONTROL environment variable.  Here are the values:

  none, off       never make backups (even if --backup is given)
  numbered, t     make numbered backups
  existing, nil   numbered if numbered backups exist, simple otherwise
  simple, never   always make simple backups

As a special case, cp makes a backup of SOURCE when the force and backup
options are given and SOURCE and DEST are the same name for an existing,
regular file.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/cp>
or available locally via: info '(coreutils) cp invocation'

The current directory is stored by Bash in an "environment variable" called $PWD. More about variables later.

Home directory

It is not uncommon to have multiple users on a system. If you have a computer running GNU/Linux and you use it as a file server and a web server, for instance, you might have created user accounts for you and your colleagues so that each one of you can login to the server and access your files. It is also common (and recommended) that the web server application has its own user which is responsible only for running the web server program (but not meant for logging into the computer). Having multiple users on a system allows for separation of files and responsibilities.

Let's say your username is sara and your colleague's name is bob. Then you can both be logged into the system at the same time, and have your files and folders to yourself. Usually, there's also a "superuser" (or administrator) called root.

Now, it would be impractical if all of you logged into the computer and ended up in the same directory. Everyone's files would be in the way of each other (by the way, you cannot have two files with the same name in the same folder - how would we distinguish between the two when using names or paths?). For this reason, most operating systems have arranged their file systems in a way that each user has its own "home directory". On Ubuntu (and many other GNU/Linux distributions) the default path to the home of, say, user sara is /home/sara . This means that when sara logs into the computer, she will end up in her own home directory in /home/sara and when bob logs in (perhaps at the same time), he ends up in his home directory in /home/bob (which is a sibling directory to that of sara) .

To make things more secure and private, the system makes sure that sara only have permissions to create files in her directory tree under her home directory. Files and folders she creates there, will be accessible only by her (be default). The same goes for bob. This effectively prevents bob from overwriting files in sara's directory tree and vice versa.

As mentioned above, you may use the nickname ~ (tilde) to refer to a home directory. A single tilde means the user's own home, and a tilde followed by a username means that user's home. So, sara can refer to her own home by simply using ~ and to bob's home by using ~bob. It is allowed to append your own username to the tilde when referring to your own home, but that's redundant. For sara, using ~ means the same as using ~sara, so she'll be using the former, because she doesn't like typing things that are not required.

The command cd (change directory) takes you to your home directory if you issue it without any arguments (simply typing cd and hitting return will take you home).

One word of advice here. When choosing your username, use something short but clear so that it is easy to type. Use only lowercase letters (so you don't have to press Shift when typing your name) and do not use spaces in the user name. Unfortunately, using spaces in a username may cause trouble for some applications and some commands. This is because Bash sees spaces as separators between commands and flags and arguments. If you want to treat a word containing spaces as "one thing", then you need to put quotes around that word. Therefore, our general advice is to avoid spaces in names for users, files and directories. It will save you a lot of trouble, in particular when you are new to using the terminal and shell. If you want to signify a space as part of a name, use underscore instead. In general, names in the Unix and GNU/Linux world tend to be short and concise (and therefore sometimes also cryptic). You would typically never see a folder named Program Files on a Unix or GNU/Linux system. It would be called something short instead (and most certainly something without a space and upper case letters) like bin or apps etc.

Issuing commands

The whole point of working in the command line is, of course, to issue command (lines). As you may have noticed (or read above), Bash is interactive and line-based. Nothing happens until you press Enter, so in itself, Bash running in a terminal is not so exciting. What is exciting, however, are the many things you can do with commands in Bash. In particular - as we will see later - the almost magical things you can do when you combine many commands on the same command line.

In fact, Bash is a complete programming language and can be used to write small (or even large) programs. It has many built-in features and constructs that allows you to use it as any programming language, either directly in the terminal, or by saving clever command lines etc in a text file and use it many times as a "script".

But the main thing we will use Bash for in this material is to issue commands. Issuing a command is to use bash to launch a small application (program) often called a "command". We've already seen some small examples in illustrations and examples above:

  • echo - print something to "the terminal" (actually to "standard out")
  • ls - list the files in current (or some named) directory
  • cd - change the current directory
  • cp - copy file(s) to some directory
  • pwd - printout the current directory's absolute path
  • head - print the first (n) lines of a file
  • file - print some information about a file's filetype

In order to learn some commands, we need to learn first about how we formulate a command. Sometimes it is very simple, like with pwd, it doesn't need any additional information in order to know what to do. Other times, we need to provide some information to the command in order for it to know what we mean. For instance, it doesn't really make sense to say "copy!" to the computer. The command cp, would surely complain about us giving to little information for it to know what to copy and where:

$ cp
cp: missing file operand
Try 'cp --help' for more information.

As you saw, cp didn't like it when we didn't provide enough information. Information is provided on the command line after the actual command. There are roughly two types of extra information. First we have "options" (sometimes: "flags"). That usually starts with a dash (a minus sign, -) and some letter. Options are typically used to convey how we like something to be done, rather than what to be done.

The other type is "arguments". Arguments typically convey the what as with cp, what file(s) to copy where.

Options and arguments are often combined.

Options or flags

Let's take the command ls as an example. If we simply type ls and nothing else, it assumes that we simply want to list the contents of the current directory in as simple a form as possible:

rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls
globbing  README.md  school  script  text
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$

If we want ls to print the information differently, we can tell it how by providing a flag (a.k.a. "option"):

rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls -1
globbing
README.md
school
script
text

The flag -1 (dash-one) tells ls how to list the files. The "one" character (1) means "one column". By default, the command lists as many columns that will fit in your terminal, as above.

We can also instruct ls to do a longer listing with more information about each file:

rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls -l
total 20
drwxrwxr-x 3 rikard rikard 4096 sep 21  2018 globbing
-rw-rw-r-- 1 rikard rikard   83 sep 19  2018 README.md
drwxrwxr-x 3 rikard rikard 4096 sep 19  2018 school
drwxrwxr-x 2 rikard rikard 4096 sep 20  2018 script
drwxrwxr-x 2 rikard rikard 4096 sep 19  2018 text
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$

Wow, that's a lot of information! We will not go into details about all information above more than a summary. First column is about "permissions". That contains information about whether the file is a regular file (starts with -) or a directory for instance (starts with d). Also what the owner of the file can do, what the group the file belongs to can do and what everyone else can do. In short, skipping the first character (above either "d" or "-"), there are three permissions for, in turn, the user(owner), group and others:

  • r - read
  • w - write
  • x - execute (for directories, execute means "cd to this directory")

So looking at this line:

-rw-rw-r-- 1 rikard rikard 83 sep 19 2018 README.md

First column starts with a dash, so it's not a directory but a normal file. Then we have rw- meaning that the owner, rikard has read and write permissions but not execute. Next comes rw- again, meaning that the group the file belongs to (also called rikard) has the same permissions. Finally in the first column we have r-- meaning that everyone else has only read permission. People how are not rikard or in the group also called rikard cannot write to the file but they can read it. Nobody can execute the file as if it were a command.

Next column contains a single 1 meaning "there's only one link to this file". We'll skip the meaning of this in this context. Just bare in mind that you can have "links" to a file on the hard drive. Next column is "rikard" which is the username of the owner of the file, next "rikard" again which is the name of the group the file belongs to (in Ubuntu and many other distributions, when you create a user, a group with the same name is automatically created and the user automatically belongs to that group). Next is "83" which is the number of bytes the file occupies on the hard disk. Next is "sep 19 2018" which was when the file was created or modified, and finally the filename.

Another useful flag for ls is -a which stands for "all". In Bash, files whose first character is a dot are not shown by the ls command or file browsers by default. This allows for you to hide certain files that are not so interesting or seldom used by naming them accordingly. If you really want to see them, use ls -a (or tell your file browser to "show hidden files"):

rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls -a
.  ..  .git  globbing  README.md  school  script  text

You can combine flags in various ways. The following three commands have the same meaning:

rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls -a -l
total 32
drwxrwxr-x  7 rikard rikard 4096 sep 21  2018 .
drwxrwxr-x 29 rikard rikard 4096 maj  8 12:07 ..
drwxrwxr-x  8 rikard rikard 4096 jul 22 14:26 .git
drwxrwxr-x  3 rikard rikard 4096 sep 21  2018 globbing
-rw-rw-r--  1 rikard rikard   83 sep 19  2018 README.md
drwxrwxr-x  3 rikard rikard 4096 sep 19  2018 school
drwxrwxr-x  2 rikard rikard 4096 sep 20  2018 script
drwxrwxr-x  2 rikard rikard 4096 sep 19  2018 text
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls -al
total 32
drwxrwxr-x  7 rikard rikard 4096 sep 21  2018 .
drwxrwxr-x 29 rikard rikard 4096 maj  8 12:07 ..
drwxrwxr-x  8 rikard rikard 4096 jul 22 14:26 .git
drwxrwxr-x  3 rikard rikard 4096 sep 21  2018 globbing
-rw-rw-r--  1 rikard rikard   83 sep 19  2018 README.md
drwxrwxr-x  3 rikard rikard 4096 sep 19  2018 school
drwxrwxr-x  2 rikard rikard 4096 sep 20  2018 script
drwxrwxr-x  2 rikard rikard 4096 sep 19  2018 text
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls -la
total 32
drwxrwxr-x  7 rikard rikard 4096 sep 21  2018 .
drwxrwxr-x 29 rikard rikard 4096 maj  8 12:07 ..
drwxrwxr-x  8 rikard rikard 4096 jul 22 14:26 .git
drwxrwxr-x  3 rikard rikard 4096 sep 21  2018 globbing
-rw-rw-r--  1 rikard rikard   83 sep 19  2018 README.md
drwxrwxr-x  3 rikard rikard 4096 sep 19  2018 school
drwxrwxr-x  2 rikard rikard 4096 sep 20  2018 script
drwxrwxr-x  2 rikard rikard 4096 sep 19  2018 text
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$

You can use -t to sort the files on modification date descending or -tr ascending:

rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls -lt
total 20
drwxrwxr-x 3 rikard rikard 4096 sep 21  2018 globbing
drwxrwxr-x 2 rikard rikard 4096 sep 20  2018 script
drwxrwxr-x 2 rikard rikard 4096 sep 19  2018 text
-rw-rw-r-- 1 rikard rikard   83 sep 19  2018 README.md
drwxrwxr-x 3 rikard rikard 4096 sep 19  2018 school
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls -ltr
total 20
drwxrwxr-x 3 rikard rikard 4096 sep 19  2018 school
-rw-rw-r-- 1 rikard rikard   83 sep 19  2018 README.md
drwxrwxr-x 2 rikard rikard 4096 sep 19  2018 text
drwxrwxr-x 2 rikard rikard 4096 sep 20  2018 script
drwxrwxr-x 3 rikard rikard 4096 sep 21  2018 globbing
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$

A final note on the output from ls. You might have noticed "total 20" before each printout. That's how many "blocks" the files take up on the disk. Disks use blocks of a certain size for files. A file might be slightly smaller than a block but would still occupy a full block. This is not important in the context of this course.

Permissions

There's a little more that can be said about file permissions. So let's break it down a little in this section.

On your system, you have users. Some users are system users (not actual persons), and some users are actual persons (like yourself).

All actual person users have a home directory as explained above. And all users belong to at least one user group (group). By default, you belong to a group with the same name as your username, a group with only one member (you).

Having groups allows for a more fine-grained permission control. For instance, you can create a group for you and your colleagues, e.g. team01. File permissions work in three levels:

  • user (owner of file or directory)
  • group (members of the group that owns the file or directory)
  • other (all other users on the computer)

Having a group team01 allows for assigning the group ownership of a directory to the group team01. If the directory has full permissions for both the owner and the group owner, then all members of team01 can cd to the directory, list files and write new files there.

For files, it's the same thing. A shared file with group ownership of team01 can have e.g. read and write permissions for both the normal owner of the file and the group owner team01. Now all members of team01 can read and write the file, while others perhaps only can read the file.

rikard@newdelli:~/shared-files$ ls -l documentation.txt
-rw-rw-r-- 1 rikard team01 0 aug 15 13:10 documentation.txt

As mentioned above, the permissions listed by ls come in three groups of symbols. The example listing of documentation.txt has the following permissions:

-rw-rw-r--
 
User (rikard):  rw-
Group (team01): rw-
Other:          r--

The leading dash, symbolizes that the file is a normal file (and not, e.g. a directory - directories have a leading "d" instead).

So, the three chunks of symbols, rw-, rw- and r-- represent in order: owner, group and others. And the symbols represent the actual permissions:

  • r - read
  • w - write
  • x - execute

A dash in place of any of rwx means that the permission is missing. rw- means that the execute is missing. r-- means that both write and execute is missing, etc.

Finally, we'll mention here that changing the permissions could be done with the chmod command and using letters: chmod u+x somefile means that we add the execute permission to the user (owner). Similarly, chmod u-x somefile means that we revoke the execute permission for the user (owner). The three levels of ownership is represented by:

  • u - user (owner)
  • g - group owner
  • o - others
777 means rwxrwxrwx - which may be more than you want

In fact, the permission for each level is stored as a number (since we are dealing with computers). We can use numbers in place of letters when changing the permissions. This is a simplified explanation of what numbers to use. For each level (u, g, o) we use one single number. To calculate what number to give, we use a sum of the permissions where:

  • 4 is the value of read permission
  • 2 is the value of write permission
  • 1 is the value of execute permission

This means that to give the user (owner) rw permission, we use 4+2=6. And the same with each ownership level. An example:

-rwxrw-r--  (user: rwx, group rw, other r)
   7  6  4

So we could do:
chmod 764 somefile

Note: Using 777 as permission for a file or directory gives full permissions to everyone, which should typically be avoided for security reasons!

As far as the scope of this course material goes, it's totally fine if you only learn about the letter version of permissions (rwx) and forget about the numeric version. We just wanted to mention it briefly, in case you run into some example somewhere using numbers for permissions.

Arguments to commands - open the fridge

After looking at the how using flags or options, it's time to look at the what. Some commands are meaningless without an argument. If you think of commands as verbs or imperatives, you will realize that our language works this way to. Without a context, telling someone to "open" or "put" without specifying what will not work. The same goes for most commands.

A few commands work with the context (usually the current directory, current user etc) and don't require commands. Even fewer can't take commands (but can take options) like pwd (what to do is part of the command name - print working directory) and uname (print system information).

Even if ls works without an argument (without it, it assumes current directory) it is quite common to give an argument to this command. The argument or arguments are what directories to list.

The command cd also works without an argument (as a quick way to get to your home directory) but is often used to change directory to one specified directory (the what). Let's let user rikard cd to opt/progund (that's a relative path!) and then back to the home directory again in various ways. Observe how the promt changes to indicate current directory after each command:

rikard@newdelli:~$ cd opt
rikard@newdelli:~/opt$ cd progund
rikard@newdelli:~/opt/progund$ cd
rikard@newdelli:~$ cd opt/progund
rikard@newdelli:~/opt/progund$ cd ../../
rikard@newdelli:~$

Here's another journey through rikard's file system:

rikard@newdelli:~$ cd programming/java
rikard@newdelli:~/programming/java$ pwd
/home/rikard/programming/java
rikard@newdelli:~/programming/java$ cd ..
rikard@newdelli:~/programming$ pwd
/home/rikard/programming
rikard@newdelli:~/programming$ cd ..
rikard@newdelli:~$ pwd
/home/rikard
rikard@newdelli:~$ cd ..
rikard@newdelli:/home$ pwd
/home
rikard@newdelli:/home$ cd ..
rikard@newdelli:/$ pwd
/
rikard@newdelli:/$ cd
rikard@newdelli:~$ pwd
/home/rikard
rikard@newdelli:~$

Moving around and knowing where you are - cd and pwd

As you have learnt by now, you can refer to the prompt to see where you are or use the pwd command to print the working directory (a.k.a. "current directory"). Do you remember that we said that Bash keeps the value of the current directory in an "environment variable" called $PWD? Without yet exactly knowing what an environment variable is, let's just think of it as a named memory where you can store information and retrieve information.

There is a sister to $PWD called OLDPWD. What do you think Bash stores in that variable?

Let's assume that you are doing some work in two distant directories and you frequently use cd to alternate between them back-and-forth. You could use a feature called bash history and the arrow keys to retrieve the correct argument to cd every time you want to go back and forth between the two (more about the history below). But there is an easier way! Let's first alternate between two directories and print out the $PWD and OLDPWD to see what Bash puts in these to:

rikard@newdelli:~$ echo "PWD: $PWD OLDPWD: $OLDPWD"
PWD: /home/rikard OLDPWD: /home/rikard
rikard@newdelli:~$ cd opt/progund/datorkunskap-kompendium/
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ echo "PWD: $PWD OLDPWD: $OLDPWD"
PWD: /home/rikard/opt/progund/datorkunskap-kompendium OLDPWD: /home/rikard
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ cd
rikard@newdelli:~$ echo "PWD: $PWD OLDPWD: $OLDPWD"
PWD: /home/rikard OLDPWD: /home/rikard/opt/progund/datorkunskap-kompendium
rikard@newdelli:~$ cd opt/progund/datorkunskap-kompendium/
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ echo "PWD: $PWD OLDPWD: $OLDPWD"
PWD: /home/rikard/opt/progund/datorkunskap-kompendium OLDPWD: /home/rikard
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ cd
rikard@newdelli:~$ echo "PWD: $PWD OLDPWD: $OLDPWD"
PWD: /home/rikard OLDPWD: /home/rikard/opt/progund/datorkunskap-kompendium
rikard@newdelli:~$

As you see, not only does Bash keep track of the current directory, it also keeps track of the previous one!

This allows us to use a trick with cd. There is a special way to instruct cd to go back to the previous directory (making the current one the new previous directory!). You give cd a single dash as the only argument to get back to the previous directory. This means that you can alternate between two directories by continuously issuing cd - when you want to go to the other directory and back:

rikard@newdelli:~$ cd opt/progund/datorkunskap-kompendium/
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ ls
globbing  README.md  school  script  text
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ cd -
/home/rikard
rikard@newdelli:~$ echo "I'm back"
I'm back
rikard@newdelli:~$ cd -
/home/rikard/opt/progund/datorkunskap-kompendium
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ echo "Now I'm here!"
Now I'm here!
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ cd -
/home/rikard
rikard@newdelli:~$ echo "I'm back"
I'm back
rikard@newdelli:~$ cd -
/home/rikard/opt/progund/datorkunskap-kompendium
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ echo "Now I'm here!"
Now I'm here!
rikard@newdelli:~/opt/progund/datorkunskap-kompendium$ cd -
/home/rikard
rikard@newdelli:~$ echo "etc etc etc..."
etc etc etc...
rikard@newdelli:~$

As you see, the place your going to is first echoed on a new line to remind you where you went.

Again, please review the sections about paths and make sure you now understand how to navigate your file system using cd and relative as well as absolute paths.

Creating directories - mkdir

Knowing how to move between various directories are of course central in order for you to be able to get some work done using the command line. But where does directories come from?

In fact, on most systems, when you create a user, not only the home directory is created for you. There are often a set of standard directories also created. These typically include:

  • Desktop
  • Documents
  • Downloads
  • Music
  • Pictures

These standard directories (if they are created for you) typically will be created directly in your home directory. On Rikard's computer that means they end up in:

  • /home/rikard/Desktop
  • /home/rikard/Documents
  • /home/rikard/Downloads
  • /home/rikard/Music
  • /home/rikard/Pictures

The Desktop directory is also what you see on your screens background. Try to copy some file to ~/Desktop/ and watch it show up on the graphical representation of your desktop.

It is a common practice to save files from the Internet to the Desktop directory, in hopes of it being easy to find it later. This strategy might work until your desktop is overpopulated by hundreds of files and folders. In general, it doesn't make sense to use the Desktop directory as any kind of storage (however the authors admit to committing this sin repeatedly). If you look at the list of typically pre-installed directories, you might get a hint of what the programmers had in mind for you. Stuff you download will initially typically end up in Downloads (to be moved to some better location later). Your music (if you have any) could be placed in the Music directory, and your pictures in Pictures and your "documents" (whatever that means) in Documents.

But it doesn't have to stop here. If you, like the authors, have quite a lot of music files (legally acquired, of course) on your computer, you probably would want to make a hierarchy of directories also under Music. If you remember binary search, using a hierarchy will significantly decrease the time you will spend looking for a certain kind of music file.

As an example, we could imagine a directory structure as follows under Music/:

.
├── classical
│   ├── classicism
│   │   └── Beethoven_sym_9.mp3
│   ├── modernism
│   │   └── Stravinsky_Rite_of_spring.mp3
│   ├── modernist
│   │   └── Schoenberg.mp3
│   └── renaissance
│       └── Carlo_Gesualdo_-_Madrigaux.mp3
├── jazz
│   ├── bebop
│   │   └── Dizzy_Gillespie_-_Salty_Peanuts.mp3
│   ├── free_jazz
│   │   └── Ornette_Coleman_-_Free_jazz_pt_1.mp3
│   └── fusion
│       └── Miles_Davis_-_Bitches_Brew.mp3
└── rock
    ├── hard_rock
    │   └── Deep_Purple_-_Highway_Star.mp3
    ├── metal
    │   └── Iron_Maiden_-_Prowler.mp3
    └── rockabilly
        └── The_Rock_and_Roll_Trio_-_The_train_kept_a-rollin.mp3

13 directories, 10 files

To find Iron_Maiden_-_Prowler.mp3, you would probably only have to look in Music/rock/metal intuitively, rather than browsing through a great many files directly under Music/ .

So, how do we create such a hierarchy of directories? We use the command mkdir (stands for "make directory"). This is, of course, one of those commands that require at least one argument, what directory to create. Note that you will use a path as the directory argument, whether you like it or not! So you better have reviewed the sections on paths by now!

If we stand in the home directory, and want to create the directory classical under the Music directory, the new directory will have the relative path Music/classical which will also be the argument to mkdir.

But if you want to create the directory classicism under the directory classical, you cannot do that until you have created the classical directory.

Before you create the Music/classical directory, you cannot tell mkdir to create it, because it will see through that and complain. It can't create a directory under a non-existent directory. That may seem like a disappointment. But there's an option that will let you create both directories at the same time, -p (stands for "parents").

So, either you have to first create the Music/classical directory and only then create the Music/classical/classicism directory. Or you can create both at the same time: mkdir -p Music/classical/classicism .

Later, we'll see how we can use a feature called brace expansions to create the whole directory tree with just one command line.

What about removing directories? You can use the same command as for removing files, rm (stands for "remove") to remove a single directory or a whole tree and all its files, if you use the flag -r (stands for "recursive"). This is something you should use with care, however. There is no "undelete" or "Trash" in Ubuntu. The command rm removes the entry from the file system. Period. The actual data will still remain on the disk and could be recovered using special recovery software, but that's usually a procedure you want to avoid. So be careful and think before you use rm -r.

If you wan to remove a single empty directory, you can instead use the command rmdir (it works with multiple arguments too, but each directory must be empty in order to be removed).

Lastly, we'll talk a little about renaming as well. Both files and directories are actually renamed using the same command as for moving a file or directory. You can think of it as moving it from one place to the same place but with a new name. The command is mv (stands for "move"). To rename the directory clasical to classical, you could use mv clasical classical if you are standing in the same directory as clasical was created. Remember, the arguments to files are always paths. In this case, clasical was a relative path (it didn't start with a slash). The only place that relative path works, is in the directory that contains clasical (i.e. clasical's parent directory).

You cannot remove the same directory as you are in (the directory that is your current directory). That directory is locked by Bash, because if it were removed, where would you end up? It doesn't make sense to let you delete your current directory, so it is not allowed.

Comments

You can add comments at the end of lines by starting the comment with the hash character # . Any text after the hash, will be neglected by bash:

$ echo "Hey, there!" # this is a comment
Hey, there!
$ # comments are ignored by Bash
$

What's the use for comments? Well, all commands are stored in the Bash history, so that you can find old command lines again. If you want to (or need to) you can add a comment at the end of a complicated command line explaining what it does and what it's for.

Mostly, we use comments in scripts, which we will learn about in a later module of this course material. Here's a small Bash script with comments, just to make a point:

#!/bin/bash

echo "Hello $USER"   # $USER is an environment variable with the user name
echo "Tomorrow is $(date -d tomorrow +%Y-%m-%d)" # using command substitution
                                                 # and date formatting

You'll learn what the script does in the module about Bash scripting.

Creating files

Now we understand how the file system is organized as a tree of directories. The reason for this organization into directories in a hierarchy, is of course to organize all the files in our computer. The operating system comes with thousands of files with settings and applications. But we also want to produce or download files ourselves. In this section, we'll look into how to create or obtain files.

Text files

A large portion of our files will be text files. By "text file" we mean a file with plain text. That is, only text, without any formatting like fonts, font faces, font sizes or illustrations. Just plain text.

So, what is a text file? Actually, to the computer and hard drive, there is nothing special about text files. It is still just binary representation of data, just as with any other kind of file. A string of ones and zeros, that is binary numbers. What makes a text file still a text file, is that those binary numbers represent numbers in a certain range - that of some character table. The most common text representation is to encode the characters according to the ASCII table where each character (letters, numbers, special characters and white space) is assigned a number. Using these numbers, we can represent text as a string of binary numbers. We'll look at some examples so that you will get the picture.

Here's a part of the ASCII table:

	Number	character	number	character
      	33	!		97	a
      	34    	"		98	b
      	35    	#		99	c
      	36    	$		100	d
      	37    	%		101	e
      	38    	&		102	f
      	39    	'		103	g
      	40    	(		104	h
      	41    	)		105	i
      	42    	*		106	j
      	43    	+		107	k
      	44    	,		108	l
      	45    	-		109	m
      	46    	.		110	n
        47    	/		111	o
        48    	0		112	p
        49    	1		113	q
        50    	2		114	r
        51    	3		115	s
        52    	4		116	t
        53    	5		117	u
        54    	6		118	v
        55    	7		119	w
        56    	8		120	x
        57    	9		121	y
        58    	:		122	z
        59    	;		123	{
        60    	<		124	|
        61    	=		125	}
        62    	>		126	~
        63    	?		127	DEL

Please revisit the module about binary representation, to refresh what you learned about binary representation of text.

A great portion of the files on your computer is text files. This is because a lot of the configuration and settings are stored in plain text, as well as documentation (help files and manuals) and a lot of scripts (small programs written in some interpreted language such as Bash, perl or Python). In many of the courses found on this wiki, you also learn how to program in some programming language. Source code in a programming language is also stored in plain text files.

Many of the settings files are stored in a directory called etc. On Windows, this directory is often found in C:\Windows\System32\drivers\etc\ and on GNU/Linux and UNIX systems, the directory is /etc (a directory called etc under the root directory /).

In Windows, you can find for instance the following text files in the etc directory:

  • hosts
  • lmhosts.sam
  • networks
  • protocol
  • services

On Rikard's Ubuntu laptop, the following files (and many more) are found in /etc:

  • passwd
  • bash_completion
  • bash.bashrc
  • magic
  • sudoers
  • hosts
  • fstab
  • group
  • nsswitch.conf
  • crontab
  • hostname
  • timezone
  • issue
  • networks
  • services

You can probably guess what kind of settings many of them contain. An easy way to investigate such text files is to use cat to print them to the standard output stream (which per default is connected to your terminal window).

rikard@newdelli:~$ cat /etc/timezone 
Europe/Stockholm

Another common type of text file, is HTML files. HTML is the markup language used to code web pages. A small web page example is give below:

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

A particular type of text files that we are going to look at during this module is Bash scripts. A Bash script is a text file with some Bash commands, that would work equally well directly in the terminal, but does some task that you would do many times. Small utility programs can be written as Bash scripts, and we'll soon look at some small examples of that.

Small script example

Let's say that you have friends or family living in Tehran, Iran, which has a different timezone from that of the country you are living in. You want to call them but can't remember what the time difference is, so you want to use Bash to figure out what the local time in Tehran is right now.

After reading the documentation for the date command, you find out that you can issue the following command line to get the local Tehran time:

rikard@newdelli:~$ TZ='Asia/Tehran' date +%T
14:37:42

The command line was TZ='Asia/Tehran' date +%T, but you realize that you won't be able to remember that gibberish either. So you decide to write a script with a descriptive name (that you will remember) effectively creating a new command on your computer. The script will be called tehran_time.sh and will have the following content:

#!/bin/bash

echo -n "Time in Tehran is now "
TZ='Asia/Tehran' date +%T

Before you can run the script as a command, you need to change the permissions of the file tehran_time.sh so that your user has the execute permission on the file and can run it. There is one small thing that you also need to know about commands before we continue. All commands on your computer, be it a compiled executable program or an executable script, are placed in one of a specified list of directories in a special Bash environment variable called $PATH. This is to keep things organized and secure. If you have a program or script in some other directory (like your home directory which is not part of the list in the $PATH variable), you cannot simply type the name of the script to execute it.

Think about it. When you type date, and press Enter, Bash executes the date command and it prints the date and time. Try it!

rikard@newdelli:~$ date
ons 24 jul 2019 12:16:48 CEST

Rikard's computer has the timezone Europe/Stockholm, and the time locale set to Swedish so the date is printed in Swedish and local time to Sweden.

But how is it that Bash could run the date command? How did it find it? Let's figure out where the actual compiled date program is installed on Rikard's computer. There's a command for that!

rikard@newdelli:~$ which date
/bin/date

OK, so the date program is located in /bin/date. What type of file is it?

rikard@newdelli:~$ file /bin/date
/bin/date: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/l, for GNU/Linux 2.6.32, BuildID[sha1]=9464cd77e7229648a7ba1368b6c848d62767dc76, stripped

OK, perhaps you couldn't make so much out of that, but we can tell you that file reported that it is a compiled program that can be run on GNU/Linux.

But back to the question, how could Bash find the executable in the /bin directory? The answer is the $PATH variable. It has a list of directories for Bash to look in for any command that you try to run. On Rikard's computer, the variable contains the following list of directories (separated by colon):

rikard@newdelli:~$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/rikard/bin/

The list is traversed from left to right, so if there was an executable in /usr/local/sbin named date, it would have been executed when Rikard issued the date command. Now, there wasn't, so Bash continued down the list until it came to /bin where it actually found date and so it executed that one. Note that at the end of the list, there's /home/rikard/bin. That's actually a directory that Rikard added to his PATH variable in one of his settings files for Bash, so that he can put scripts in his own bin directory in his home directory, so that those scripts can be found by Bash.

Now, what if we create the tehran_time.sh script and put it in our home directory (which is not in the list of directories in the PATH variable), how can we run it anyway?

There are some solutions for this. First, we change the permissions of the file so that the user has execute permission on the file:

rikard@newdelli:~$ ls -l tehran_time.sh 
-rw-rw-r-- 1 rikard rikard 72 jul 24 12:30 tehran_time.sh
rikard@newdelli:~$ chmod u+x tehran_time.sh
rikard@newdelli:~$ ls -l tehran_time.sh 
-rwxrw-r-- 1 rikard rikard 72 jul 24 12:30 tehran_time.sh
rikard@newdelli:~$

Notice the difference in the permissions reported by ls -l before and after. The command used for changing permissions is called chmod (stands for "change file mode bits") and the arguments were u+x tehran_time.sh . The u+x argument means "add execute permissions for the user (owner)" and the last argument was the filename itself.

If you remember, the permissions reported by ls -l had four parts. The first character is a dash for normal files and a "d" for directories. Then there were three groups of three letters for user, group and other listing the permissions for those respectively. The user's permissions was before the change rw- meaning read and write but not execute. After the change they became rwx. So, execute permission was indeed added for the owner (rikard) but not for anyone else.

Now the file is executable for user rikard. But since the file isn't in one of the directories in the PATH variable, we can't execute it simply by typing its name and then Enter:

rikard@newdelli:~$ tehran_time.sh
tehran_time.sh: command not found

What that error message actually means is "command not found in any of the directories listed in the PATH variable!". But we can bypass this "problem" by giving a path to the command ourselves. We'll first use a relative path, then an absolute path, so that you get to refresh those concepts too. If you are in the home directory and want to give a relative path to the file which is also in the home directory, for commands and scripts you can use the special alias . (dot) meaning "this directory" followed by the slash separator and the script name:

rikard@newdelli:~$ ./tehran_time.sh 
Time in Tehran is now 15:10:36
rikard@newdelli:~$

If we want to, we can also use the absolute path from slash to the file:

rikard@newdelli:~$ /home/rikard/tehran_time.sh 
Time in Tehran is now 15:11:20
rikard@newdelli:~$

But simply giving the script name doesn't work, because Bash interprets that as the name of an executable and only looks for it in the list given in the PATH variable:

rikard@newdelli:~$ tehran_time.sh
tehran_time.sh: command not found
rikard@newdelli:~$

Another way to run a Bash script, is to tell Bash to interpret and execute it. Bash is a command like any other program, and its executable is called bash (all lower case). We'll type bash tehran_time.sh and our Bash running in the shell will then start a new instance of the executable bash and when given an argument, it means "run the commands in this file, please". We'll see how that works, and where the bash executable is below:

rikard@newdelli:~$ bash tehran_time.sh
Time in Tehran is now 15:15:46
rikard@newdelli:~$ which bash
/bin/bash
rikard@newdelli:~$ file /bin/bash
/bin/bash: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/l, for GNU/Linux 2.6.32, BuildID[sha1]=6f072e70e3e49380ff4d43cdde8178c24cf73daa, stripped
rikard@newdelli:~$

We tell you the whole story here, so that you gradually will get used to what programs are. So, the terminal was running one instance of bash, which is the whole idea of the terminal. We want to have a conversation with Bash interactively. When we issued the command line bash tehran_time.sh, we told our interactive session with bash in the terminal to start yet another instance of bash just to let the new instance run our script line-by-line.

What happened was that the new instance of bash started and executed all lines in our script, then exited (terminated) and we were back in our interactive session with bash in the terminal. This is not different from when we run e.g. the date command. Our interactive session with bash runs the date command, which prints the date and time and then exits (dies or terminates).

When a program is running, something called a process is created by the operating system. The process consists of a context with some metadata (like who executed the program, from what current directory was the program executed etc). The operating system then executes the compiled code in the program inside this process.

Just like there was two instances (processes) of Bash running for a short time, you can run any program more than once at the same time. You can, for instance, start two terminals. The same executable is loaded in two processes which are separated from each other so that they don't interfere with each other. You can do this with any program. Try starting two instances of you browser, for instance. They are then two processes running the same code from the same executable but separately in two processes.

Rikard is running twelve terminals as he is writing this. That means that he is also running twelve instances of bash, because each terminal runs Bash interactively. We can verify this using the ps command, which lists processes. The flags ax is one way of listing the processes the user started (in this case rikard):

rikard@newdelli:~$ ps ax | grep bash
 2911 pts/20   Ss     0:00 bash
 6656 pts/4    Ss+    0:00 bash
 8721 pts/30   S+     0:00 grep --color=auto bash
11296 pts/30   Ss     0:00 bash
11939 pts/25   Ss+    0:03 bash
17168 pts/27   Ss     0:00 bash
18330 pts/23   Ss+    0:00 bash
20676 pts/22   Ss+    0:00 bash
21531 pts/21   Ss+    0:00 bash
24930 pts/26   Ss     0:01 bash
25894 pts/24   Ss+    0:02 bash
28194 pts/18   Ss     0:00 bash
31732 pts/19   Ss     0:00 bash

Above, Rikard was using also grep to filter out only lines from the output which contained the word "bash". As a funny side-effect, you can see that also the line with the grep command itself was listed, because it too contained "bash".

The list contains the process id (called "pid") for each process in the list, what terminal it was connected to, the cumulated CPU time the process has consumed and the state of the process. We will not go into these details in the scope of this course material, but we mentioned it anyway, in case someone wondered what the output meant.

Before we go on, we'd also like to say something about the script. The first line of the script said #!/bin/bash . That line is often called "shebang" (from the # as in "hash" - referring to "SHell" and the "!" - bang). When you are running a script, the operating system can't execute it, because it is not a compiled program. Instead, it needs to invoke an interpreter. After the #! you may list the path to the interpreter of the language used in the script. In our case, it was a Bash script, so the interpreter needs to be /bin/bash.

If you write a Python script, the shebang will look like #!/usr/bin/env python3 which is slightly different from the shebang for our small Bash script. The env command lets us run a program in a special environment (we won't go into the details here, but again, someone might be curious and we encourage her to use a search engine to find out more) and the python3 part is the name of the Python version 3 executable installed on Rikard's laptop.

Here's a small python script:

#!/usr/bin/env python3

print ("Hello from python")

You can add executable permissions to the file (let's assume the file is called hello.py) and then run it like this:

rikard@newdelli:~$ chmod u+x hello.py
rikard@newdelli:~$ ./hello.py
Hello from python
rikard@newdelli:~$

Using an editor

Some of you might have been frustrated over the fact that even though you wanted to try and write the scripts above, you had no idea of how to write those scripts. How do we create a plain text file?

You use an application called an "editor" to author text files. The authors of this wiki use an editor called emacs which you can install and try (ask a search engine, a colleague, a supervisor or a teacher for how to install emacs) but it takes a while to learn how to use it. Luckily, there are simpler editors available for beginners. We'll here show you a small editor which you can run inside the terminal (without starting a new window application for the editor). It is rather convenient to run an editor inside the terminal - you have one less window to manage on your desktop, and the file you are created is often to be used in the terminal anyway, so why ever leave the terminal?

Nano running in the terminal

The editor we'll look at first is called nano and runs in the terminal window. Starting nano with an argument of a filename, will open the file for editing if it exists or give you the option of saving the file with the filename at a later stage if it doesn't exist.

Type nano todo-list.txt and hit Enter. You have now started the nano editor inside your terminal. Write some lines of text and then press Ctr-O (control and "o"). Nano suggests to write the file "todo-list.txt" in the current directory. Press Enter to accept. When you are finished and want to close nano, to get back to Bash in the terminal, press Ctrl-X (control and "x"). If you have unchanged edits, nano will ask you Save modified buffer (ANSWERING "No" WILL DESTROY CHANGES) ?, to which you answer "y", then accept the filename suggestion by pressing Enter, and you are back in the terminal again.

You have now created your first (?) text file using an editor!

The gedit editor

If you are more comfortable with window applications (applications that run in their own window) you may try gedit (in Ubuntu). If you want a more fancy editor, you may try Atom or some other editor. Find out how to install a new editor by asking for help or searching on the internet. The gedit editor is installed on Ubuntu by default and can be launched from the terminal giving the filename as an argument just as with any decent editor. Note that when you start gedit from the terminal, the terminal will be unusable until you close gedit because you opened it "in the foreground". All Bash is doing in the terminal now is running gedit and waiting for it to finish. If you want to retain your interactive session with Bash in the terminal while gedit runs, you must start it "in the background". You can do that by adding an ampersand at the end of the command-line:

rikard@newdelli:~$ gedit todo-list.txt &
[1] 10971
rikard@newdelli:~$

What you see above is gedit (which is a window application running in its own window) being started in the background. Bash reports that one process is running in the background and it got the pid (process id) 10971 (you will get some other number, we guarantee you). Note, do not start nano in the background! Since nano runs in the terminal itself, you need it to be in the foreground so that it accepts your input ("has focus"). If you happend to start nano in the background (just because you didn't listen to us) that's no disaster. You can actually "send a job to the foreground". Just type fg (meaning "foreground") and your background process will get focus and occupy the terminal. If you for some reason have managed to start more than one process in the background, you must list the jobs by typing jobs. If the process you want to send to the foreground has number 2, you can type fg %2

To view the contents of your text file, you can use the cat command in the terminal, or open the file again with an editor.

We strongly recommend that whatever editor you choose to use, start it from the terminal! This helps greatly to prevent you from opening the wrong file if you manage to save two files with the same name in two different directories. We kid you not, but one of the most common problems our students have in our programming classes is that they think that they are editing a program which they then compile and run. But their changes to the source code "didn't take effect". Of course changes to the source code has effect when you re-compile your program and run it. What seems to happen almost magically to a great number of our students, is that they for some reason has managed to save the source code to a different directory. Their editor shows the code and changes. But the file they compile and then run the resulting program from, is an old file, not being edited. We are willing to bet a few Swedish kronor on the fact that this will happen to some of you too, if you take one of our programming classes!

How starting the editor from the command line helps is like this. Let's say we want to edit the tehran_time.sh script and add a second printout of "Bye!" at the end. Rather than first starting some editor, and trying to open the script source code via the "File->Open" menu (with the risk of actually opening a copy of the script from some other directory), we open the script from the command line. We are standing in the home directory. We want to edit the file there called "tehran_time.sh" (and no other file). We later want to run the script to verify that our changes were to our satisfaction. Why ever leave the terminal and fiddle around with an editor and go looking for the file? We just give the file as an argument to the editor. Let's use gedit in this example.

So we are in the home directory, and that's also where the script file is. Here comes the beautiful part. If we only have one file that begins with "t" or perhaps "te", we only need to type gedit te then press TAB. The tabulator key instructs Bash to "fill out the rest". If we only have one file whose name begins with "te", then Bash will fill out the rest of the file name. Now, this only works if there really is a file whose name starts with "te" in the home directory. So if Bash fills out the rest, we know that we have opened the right file in the correct directory. We add our changes and saves the file. Still in the home directory, we now type ./tehran_time.sh and review our changes.

Trust us on this one. It is, when you work with computers and programming, extremely common to edit a file and then use that file in the next command (like running the script, compiling a program etc). So try to get the habit of never leaving the terminal and use the TAB completion. This way, you will not as easily end up with two copies of the same file in two directories and in particular you will not as easily edit one file then using another file which you believe is the one you were editing.

We'll talk more about some tricks on the command line, like tab completion etc later on.

Downloading files from the web

Another source of files that end up on your computer is of course files that you download from the Internet. Any file on the public web can be downloaded directly to your directory from your shell. Let's say you want to download a file from our github file repository online and use that file from the terminal. We'll use an actual file in this example, so that you can try this yourself.

The file you want to download is located on the public web here: https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/school/courses/command_line/meals.txt . This file will be used later in some examples. It's a small textfile with some lines of text with different food menus.

So, you want to download a file which you want to use from the command line. Of course, you can go about that task like this:

  • Open a browser
  • Paste or type in the URL to the file
  • Press Ctrl-S (or use the File->Save menu option)
    • The file now probably ends up in ~/Downloads
  • Move the file from ~/Downloads to the directory where you will use it
    • What was the name of the file, again? Is it the correct file you are moving?
  • Use the file in the directory where you moved it

Or, you can do like the authors of this wiki would do. You probably already have the terminal open, if you are anything like us. And you need a file in the current directory. So, copy the URL, and use it as an argument to some web download command, and the file will end up right before your feet in the correct place right away.

So copy the URL above. In your terminal, use wget to download the file:

rikard@newdelli:~$ wget "https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/school/courses/command_line/meals.txt"
--2019-07-24 14:34:27--  https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/school/courses/command_line/meals.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.84.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.84.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 274 [text/plain]
Saving to: ‘meals.txt’

meals.txt           100%[===================>]     274  --.-KB/s    in 0s      

2019-07-24 14:34:28 (18,4 MB/s) - ‘meals.txt’ saved [274/274]

The important part above is wget "https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/school/courses/command_line/meals.txt". Please note the quotes around the URL. For reasons explained later, it is a good habit to always put quotes around the URL, so that Bash treats the whole URL like one string.

What happens when you press Enter, is that the wget command, acts like a browser and downloads the file to the current directory. Since the URL ends with a filename (meals.txt), wget will use that name for the new file. Note that if you download it again, wget will notice that there is already a file by that name, and will name the new file meals.txt.1 .

That's it. You wanted a file from a URL, so you downloaded it directly to the place you wanted the file. If the URL doesn't end with a file name for some reason, or if you want to save it with a different name, you can use the flag -O (capital "o") and after that the file name: wget "https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/school/courses/command_line/meals.txt" -O new_name.txt .

Another program you can use for this, is curl (written by Swedish hacker Daniel Stenberg). With curl, we need some extra flags:

curl -JO "https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/school/courses/command_line/meals.txt"

Here's an explanation of the curl syntax:

curl -JO "url-to-the-file"
 ^    ^   ^
 |    |   |
 |    |   +--> The URL goes here (use the same as above, but it's too long to fit here)
 |    +------> flags telling curl how it should work (download only, use suggested filename)
 +-----------> the command "curl"

Flags:

  • -J - Use the filename that the web server suggests
  • -O - (capital "o") Save the file (as opposed to print its contents in the terminal)

The default behavior of curl is a little different than that of wget. The reason is that curl is a full-fledged HTTP program which can do everything (and probably a lot more) that a browser can do. Downloading files is a special case, so we need the -O flag. If you are unsure, we suggest the following flags for downloading: -LJO for, in order of appearance, "foLlow redirects", "use the server's suggested filename", "dOwnload only".

We'll say a few words next, about what the web actually is and why we need special programs (like a browser or wget and curl) to use it for downloading files. The web consists of computers on the Internet that run so called "web server" programs. A server program is a program meant to be running all the time, serving "clients" that "request" some resources. When a browser (or other programs like wget or curl) "requests" a file, the server "responds" and sends the file.

The web uses a "protocol" called HTTP (stands for "HyperTextTransferProtocol"). A protocol is a specified way for computers to communicate. A browser is called a "client". The client uses HTTP to tell the "server" what it wants. If the client wants a file from the server, in HTTP that is expressed as GET /path-to-the-file. We call that "using the GET method over HTTP". There are other "methods" in the HTTP protocol, but we'll skip those here.

The client states what method it wants to use (like GET) and also sends along a lot more information to the server, to make it easier for the server software to know how to respond to the request. The server parses (interprets) the request, and sees that the client requests a certain file at a certain path. In our example, the path is /progund/datorkunskap-kompendium/master/school/courses/command_line/meals.txt . This means that on the computer where the web server is running, there is a directory structure for public files, whose root (top level directory from the web server's point of view) directory contains a directory progund with a directory datorkunskap-kompendium with a directory master with a directory school with a directory courses with a directory command_line with a file meals.txt. The file is the resource that the client requested. The whole URL describes the location of the resource:

  • https - use the HTTPS protocol (like HTTP but also encrypted and more secure)
  • :// - end of the protocol name
  • raw.githubusercontent.com - domain name which locates the server on the internet
  • /progund/datorkunskap-kompendium/master/school/courses/command_line/meals.txt - resource on that server

Note that IT lingo has a tendency to use the same word for different meanings, as well as using different words for the same meaning other times. A "server" can mean "a powerful computer connected on a network" as well as "a program running all the time, serving clients". A server can run one or more server programs. In this case, raw.githubusercontent.com is a server (server computer) on the internet, running a web server application.

When the server sees that the client requests that particular file, it tries to find it in its directory tree (often called "web root"). If the file is there, it sends some metadata and then the bytes that the file consists of, back to the client. The metadata can look like this (shortened to be more understandable):

200 OK
Cache-Control: max-age=300
Connection: close
Date: Wed, 24 Jul 2019 13:10:34 GMT
Via: 1.1 varnish
Accept-Ranges: bytes
Vary: Authorization,Accept-Encoding
Content-Length: 274
Content-Type: text/plain; charset=utf-8
Expires: Wed, 24 Jul 2019 13:15:34 GMT
Client-Date: Wed, 24 Jul 2019 13:10:34 GMT
Client-Peer: 151.101.84.133:443

The client (in our case, say, wget) inspects the metadata (the "headers" of the "response") and first checks that the headers start with 200 OK which means that the server accepted the request and that there indeed was a file in the requested path. It also looks a little at some of the other headers like how does the server manage cache, how large is the file (274 bytes) and what is the encoding of the file (UTF-8) (how should those bytes be interpreted?). After the headers come the actual bytes with the file contents. So, the server reads the bytes from its local file and sends them to the client which saves them to the hard drive. Thus, downloading a file is to get a byte-by-byte copy of some file stored on another computer.

So what makes wget and curl special, is that they "speak HTTP" just like any browser. This is convenient because it means that we can use them in the terminal to download files, so that we don't have to use a whole browser for that (that's overkill!).

After downloading the file, we can use it. Print it to the standard output stream (which is shown in our terminal), for instance:

$ cat meals.txt
Breakfast: Egg and tea
Lunch: Fish and chips
Snack: Sandwich and juice
Dinner: Stake and sallad
Breakfast: Egg and coffee
Lunch: Hamburger and coke
Snack: Peanuts and beer
Dinner: Pizza
Breakfast: Sandwich and milk
Lunch: Fish and potato
Snack: Apple
Dinner: Pasta and wine

Creating files from the output of other commands

After learning that our computer comes with a lot of files, many of which are text files, that we can use an editor to create our own text files, and that we can download files (any file on the web, not only text files), we will now investigate another way of generating files in your file system.

Bash has a very powerful feature called redirection. To understand what that is, and how it works, we need first to learn about how data flows to and from applications.

A typical application can read data, process it and output new data. That's what most programs are about, actually. Your browser can read data from web servers and display them on your screen, and it can send data (requests) to servers asking for resources. The ls command can read the file system and output lists of files and directories so that you can see it on your screen.

These two basic flows of data, in and out, are called streams. The stream into a program is called standard in(put). The stream from a program is called standard out(put). Then, there's a third stream, also going out from programs. The third stream is called standard err(or) and is used for outputting error messages. There's a good reason for not using also standard out for error messages, as we will see soon. All of these streams can be redirected from there default source or destination.

When a program like grep is run from the terminal, by default the standard in is connected to that terminal, so that the user can input text via the keyboard. And by default, both standard out and standard err are also connected to the same terminal, so that the user can see the output there.

Many programs can also read from files, provided as arguments as we have seen with both cat and grep. Let's take the meals.txt file as an example (the file is in a directory called tempo, if you wonder why the promt looks different from before):

rikard@newdelli:~/tempo$ cat meals.txt
Breakfast: Egg and tea
Lunch: Fish and chips
Snack: Sandwich and juice
Dinner: Stake and sallad
Breakfast: Egg and coffee
Lunch: Hamburger and coke
Snack: Peanuts and beer
Dinner: Pizza
Breakfast: Sandwich and milk
Lunch: Fish and potato
Snack: Apple
Dinner: Pasta and wine

rikard@newdelli:~/tempo$ grep ch meals.txt
Lunch: Fish and chips
Snack: Sandwich and juice
Lunch: Hamburger and coke
Breakfast: Sandwich and milk
Lunch: Fish and potato
rikard@newdelli:~/tempo$

What you didn't know, is that both cat and grep also can be run interactively (just like bash). If you don't provide a filename argument, the commands will instead read its input from standard in, which is connected to the same terminal as the one you started the commands from. Below, I will run cat interactively. Most interactive programs are line-based, meaning that they read their input line-by-line. When interactive, we can enter text via the keyboard, and it will be processed by the command when we press Enter. The cat command is not really exiting. It reads lines and prints them back, one-by-one. Below (try it yourself!), I will start cat without any arguments, and enter three lines (terminated by Enter - newline); "one", "two", "three". After each Enter press, cat will echo them back. After the last conversation, I will press Ctrl-D (control and "d") to send the "end of transmission" signal to the application so that it knows I don't want to play anymore, and it will terminate:

rikard@newdelli:~/tempo$ cat
one
one       <- output from cat
two
two       <- output from cat
three
three     <- output from cat
rikard@newdelli:~/tempo$

As you can see, cat echoes every line back, just the way I typed it. When we give cat a filename argument, it opens the file and reads it line-by-line and prints every line out after reading it. No difference there.

Next, we will start a conversation with grep interactively. The grep command usually takes first an argument of what to search for (the pattern to match with its input lines) and then a filename. It will read the lines one-by-one from the file, and if the line matches, it will print it out, otherwise continue with the next line. If I give grep only one argument (no filename) with the pattern to look for, it will instead read line-by-line from standard in, which is the terminal and my keyboard. Let's tell grep to match (search for lines containing) "bingo". I will enter lines with "lingo", "dingo", "zingo" and "bingo". Then press Ctrl-D (end-of-transmission - I don't wanna play any more). If grep does its job, only the last line will be echoed back. Let's try:

rikard@newdelli:~/tempo$ grep bingo
lingo
dingo
zingo
bingo
bingo       <- output from grep
rikard@newdelli:~/tempo$

Try it yourself!

Now, we'll redirect the standard in stream to come from a file rather than from the keyboard/terminal. To redirect the input, use the < (less than) character and then the file. Let's tell cat to read its standard input from the file meals.txt instead of from the terminal:

rikard@newdelli:~/tempo$ cat < meals.txt
Breakfast: Egg and tea
Lunch: Fish and chips
Snack: Sandwich and juice
Dinner: Stake and sallad
Breakfast: Egg and coffee
Lunch: Hamburger and coke
Snack: Peanuts and beer
Dinner: Pizza
Breakfast: Sandwich and milk
Lunch: Fish and potato
Snack: Apple
Dinner: Pasta and wine
rikard@newdelli:~/tempo$

And now, let's redirect grep's input to come from the same file, rather than the termina/keyboard:

rikard@newdelli:~/tempo$ grep ch < meals.txt 
Lunch: Fish and chips
Snack: Sandwich and juice
Lunch: Hamburger and coke
Breakfast: Sandwich and milk
Lunch: Fish and potato
rikard@newdelli:~/tempo$

The effect is the same as if we'd given the filename as an argument, but technically speaking, it's a little different. The standard input really is redirected to come from the file when we use <.

What does all this have to do with creating new files, you may very well wonder. We'll, perhaps you already guessed it! Also standard out can be redirected. We can redirect a programs standard out to go to a file (which will be created or overwritten) rather than the terminal. Let's try this with grep. For some reason, we are interested with only the lunch meals, and save those in a new file lunches.txt . We know exactly who is right for the job, don't we? So we'll instruct grep to output all the lines that match the pattern "Lunch" but we'll also redirect the standard output stream to the file lunches.txt (which will be created or overwritten). We'll use greater than (>) and the filename lunches.txt for the redirection part:

rikard@newdelli:~/tempo$ grep Lunch meals.txt > lunches.txt
rikard@newdelli:~/tempo$

Wait! What happened? Nothing was output to the terminal? That's right. I forgot. We redirected all output to the file lunches.txt. Let's see what's in there:

rikard@newdelli:~/tempo$ cat lunches.txt
Lunch: Fish and chips
Lunch: Hamburger and coke
Lunch: Fish and potato
rikard@newdelli:~/tempo$

It worked! Of course it worked. From grep's perspective, not much changed. It read all the lines and only output the lines that matched the pattern "Lunch". The output went to the file instead of the terminal, because we used Bash to redirect grep's standard output to that file. We have created a new file without using an editor or downloading!

Why is this useful? Because it is very fast and convenient. As an exercise, create a textfile with the content of the output from ls -l. Save the file as files_and_directories.txt. How long did it take you?

Hint: You can highlight the output with the mouse. It will then also be copied to one of the clipboards. To paste it into the editor, you can use the middle button of your mouse (if you have one). If this doesn't work for you, you can use another clipboard by highlighting the output with the mouse and press Ctrl-Insert. To paste from this clipboard, you press Shift-Insert. You cannot use Ctrl-C for "copy" inside the terminal, because the terminal uses this key combination for sending the interrupt signal to the application in the foreground (terminating the application). You can try out Ctrl-C to see what it does by opening gedit in the foreground (without the trailing ampersand) and press Ctrl-C in the terminal. As you can see, gedit was terminated by Ctrl-C. A useful use of Ctrl-C in the terminal, is to abort and discard a commandline. If you change your mind about a command line before you press Enter, pressing Ctrl-C will cause Bash to discard the command line and give you a new, fresh one below. Try it!

Now, delete the file (rm files-and-directories.txt) and create it again using redirection. Which was faster?

Hint: Use ls -l as usual but put the greater than character and then the filename at the end of the command line.

What if you have a file, and want to redirect more text to it, but rather than overwriting it, you want to append the new text to the end of the file? To do that, you use two greater than characters before the filename:

rikard@newdelli:~/tempo$ grep Lunch meals.txt > lunches_and_dinners.txt
rikard@newdelli:~/tempo$ grep Dinner meals.txt >> lunches_and_dinners.txt
rikard@newdelli:~/tempo$ cat lunches_and_dinners.txt
Lunch: Fish and chips
Lunch: Hamburger and coke
Lunch: Fish and potato
Dinner: Stake and sallad
Dinner: Pizza
Dinner: Pasta and wine

Puzzle: What would happen if you instead did this: egrep 'Lunch|Dinner' > lunches_and_dinners.txt (meaning match either Lunch or Dinner)? Try to answer before you try it. The egrep command is a little more fancy than grep. In particular, it has the simple syntax for stating that you want to find lines with any of a list of patterns: egrep 'cat|dog|horse' animals.txt. The patterns are separated by vertical bars (pipes) which you will find on normal keyboard by pressing AltGr-< (altgr key and the key with less than/greater than). On mac keyboards, you probably press some weird key and the "7" key (maybe Alt-7).

Now, we'll see how we can also redirect the error stream (standard err). Why would we want to do that? Before we answer that, we'll have to see that errors are not redirected when we use a single or double greater than character. Why? Because, normally, we are only interested in the expected output from a program when we redirect. If there's an error, we still want to see that in the terminal.

Let's try to create a file breakfasts.txt by using grep and the pattern Breakfast, but we misspell the file meals.txt:

rikard@newdelli:~/tempo$ grep Breakfast meels.txt > breakfasts.txt
grep: meels.txt: No such file or directory
rikard@newdelli:~/tempo$ ls
breakfasts.txt  lunches_and_dinners.txt  lunches.txt  meals.txt
rikard@newdelli:~/tempo$

Note two things above. First, we got the error message in the terminal even though we tried to redirect the result to the breakfasts.txt file. Second, the file was still created. But what's inside? It's empty. The reason is that when redirecting to a file, the file is first created (so that Bash has somewhere to redirect the stream). An empty file is always created before the writing to the file can begin.

It is a feature that we could see the error message. Otherwise two bad things would have happened (if also standard error was redirected). First, we'd have missed that there was an error. Second, the file wouldn't contain breakfast meals, it would have contained an error message - not what we wanted!

This is the reason that the concept of a standard out stream, for expected output, and a also a standard error stream is a good thing. We want to have two separate streams because normal and expected output is one thing. Error messages are a completely different thing.

So when would we want to redirect the standard err stream? When we are doing many things that could go wrong, and we want to save all errors (if any) in a log file for review later (in this case, we'd also use append, to get all errors an not only the last one).

To redirect standard error, we use the following syntax command-that-could-go-wrong 2> error-file.txt. Why the "2"? Because the streams are numbered:

  • 0 - standard in
  • 1 - standard out
  • 2 - standard err

To append, you use this: command-that-could-go-wrong 2>> error-file.txt.

It is common to redirect both standard out (to one file) and standard error (to another file). This is how you can do that:

$ ls pictures/ 2>> errors.txt >> picture-and-movies.txt
$ ls movies/ 2>> errors.txt >> picture-and-movies.txt 
$ ls images/ 2>> errors.txt >> picture-and-movies.txt 
$ ls films/ 2>> errors.txt >> picture-and-movies.txt

$ cat errors.txt 
ls: cannot access 'images/': No such file or directory
ls: cannot access 'films/': No such file or directory
$ cat picture-and-movies.txt 
img1.png
img2.png
img3.png
img4.png
img5.png
img6.png
img7.png
img8.png
img9.png
terror-on-elm-st1.avi
terror-on-elm-st2.avi
terror-on-elm-st3.avi
terror-on-elm-st4.avi
terror-on-elm-st5.avi
terror-on-elm-st6.avi
terror-on-elm-st7.avi
terror-on-elm-st8.avi
terror-on-elm-st9.avi

Please read the above example and try to understand how it works. The syntax ls pictures/ 2>> errors.txt >> picture-and-movies.txt means, do a listing of "pictures" and if it produces an error, append that to the errors.txt file, and if it goes well, append the standard out to the file picture-and-movies.txt .

Finally, we think it is time to say a few things about filenames. You may have noticed that we have chosen filenames with suffices (like a last name; ".txt", ".sh", ".py" etc). This is only a convention. A file can be named anything (on a sound operating system). Bash scripts are usually named something.sh so that we can assume it is a script just by looking at the name. Python scripts are usually called something.py and text files usually something.txt .

But the name is just a name. It has no impact on the contents of the file. I could name an MP3 file warlords.txt and still play it using an MP3 player application. The contents decide what you can do with the file, not the name. And I could name a text file README.mp3 and still be able to type it to the terminal using cat. It would be stupid to do so, because I and anyone who finds this file would be confused. You can always use file to investigate the actual file type of any file. The file command is very good at identifying (sometimes guessing a little) the type of a file.

Special file types usually has a few bytes in the beginning of the file, which indicate the file type according to the standard describing the file type. The file command has access to a list of such bytes (called sometimes "magic numbers") and can so use that information to guess the file type. There are other ways too, to identify file types, like for text files, seeing that the file contains only bytes that are valid numbers in the ASCII table, for instance.

This is useful, because when you download some file and it seems that it is broken, you can use file to confirm that the file is at least of the expected type. Sometimes you will find that it isn't. Then you must investigate what was wrong. Was it the link? Is the file corrupted? Something else? Using file is one of the important skills when you work in the terminal a lot.

Working with text files

Now we have learned the basics of working in the terminal with Bash. We've seen a few basic commands, how to move around in directories, how to create new directories, creating and downloading files and more. With these skills, we are now going to focus a little on text files, and how we can manipulate, investigate and analyze them.

Text files are, as mentioned earlier, a common file type of a rather simple format. Therefore, it is rather simple to do a lot work with text files, if we learn to use some more basic tools. A great part of the commands you'll use working in Bash are dedicated to text manipulation. This is because many commands both accept text input and produce text output. When working both with programming and databases, you will use text files a lot. If you will work with the web and produce or manage HTML files and CSS (style sheets for web pages), those are stored in text files too.

Additionally, a lot of systems administration involves working with text files. Both configuration and logging typically involves text files, so this section is in our opinion, highly relevant, regardless of what your future studies or career within IT will be.

Analyzing text files

As mentioned many times before, text to a computer is just a series of binary numbers where each number can be interpreted to represent one of the many characters in some character table (typically the ASCII table or the Unicode character set which is a superset to the ASCII table). But if we think about text, what does it consist of at a higher level of abstraction? How is text typically organized?

Imagine if we got an email with the following body:

helloihopeyouhadagreatsummerandwelcomebacktoworkwewillhaveakickoffnextmondayandwehopetoseeyouthentheplacewillbetheteachersloungeandthetimewillbearound10amlookingforwardtoseeingyouthenbestregardstheboss

We humans have a hard time parsing (making sense of) the above. We like to organize text in words, sentences and sections. This would be much harder to read:

Hello,
I hope you had a great summer, and welcome back to work.

We will have a kickoff next Monday and we hope to see you then!

The place will be the teachers' lounge, and the time will be around 10 AM.

Looking forward to seeing you then!

Best regards
/The Boss

Not that the authors of this wiki are great fans of social gatherings, but the latter version was much easier to read for us and you probably agree. But what made the latter version easier to read?

The text was divided into words, sentences and sections (paragraphs). Even if it still was plain text (no fancy layout or styling), the text made more sense this way. So how do we form words in a plain text file? We use "whitespace". Whitespace is the spaces in a text like space (a blank between words), tabs (a longer fixed width horizontal space) and newlines (having the text continuing on the next line when printed or displayed. All of these whitespaces are actually just characters like any other and have numbers in the ASCII table like any other character. To form a section, we finish one sentence with a newline (which makes the text continue on the next line when printed or displayed) and then add another newline just after that (making the text appear as it has an empty line between the sections). With sentences, we also use case - sentences and names start with a capital letter - and we use punctuation between sentences and parts of sentences like . (dot), , (comma), : (colon), ; (semicolon), ? (question mark) and ! (exclamation mark, or "bang") etc. These too are just character like any other, all assigned a number in the ASCII table (and similar character tables).

Here's part of the (extended) ASCII table again, with names for white spaces (except "space" which is left blank) from the output of the command ascii:

    0 NUL    16 DLE    32      48 0    64 @    80 P    96 `   112 p 
    1 SOH    17 DC1    33 !    49 1    65 A    81 Q    97 a   113 q 
    2 STX    18 DC2    34 "    50 2    66 B    82 R    98 b   114 r 
    3 ETX    19 DC3    35 #    51 3    67 C    83 S    99 c   115 s 
    4 EOT    20 DC4    36 $    52 4    68 D    84 T   100 d   116 t 
    5 ENQ    21 NAK    37 %    53 5    69 E    85 U   101 e   117 u 
    6 ACK    22 SYN    38 &    54 6    70 F    86 V   102 f   118 v 
    7 BEL    23 ETB    39 '    55 7    71 G    87 W   103 g   119 w 
    8 BS     24 CAN    40 (    56 8    72 H    88 X   104 h   120 x 
    9 HT     25 EM     41 )    57 9    73 I    89 Y   105 i   121 y 
   10 LF     26 SUB    42 *    58 :    74 J    90 Z   106 j   122 z 
   11 VT     27 ESC    43 +    59 ;    75 K    91 [   107 k   123 { 
   12 FF     28 FS     44 ,    60 <    76 L    92 \   108 l   124 | 
   13 CR     29 GS     45 -    61 =    77 M    93 ]   109 m   125 } 
   14 SO     30 RS     46 .    62 >    78 N    94 ^   110 n   126 ~ 
   15 SI     31 US     47 /    63 ?    79 O    95 _   111 o   127 DEL

Note that even numbers are treated like any other character and assigned a code in the table.

Some of the characters relevant to the introduction above are:

    9 HT (horizontal tab)
   10 LF (linefeed)
   13 CR (carriage return)
   32    (space)
   33 !
   34 "  
   35 #  
   36 $  
   37 %  
   38 &  
   39 '  
   40 (  
   41 )  
   42 *
   43 +  
   44 ,  
   45 -  
   46 .  
   47 /
   58 :   
   59 ;   
   60 <    
   61 =   
   62 >  
   63 ?    
   92 \  
   94 ^  
  124 | 
  126 ~ 

Newlines are merited an extra comment here. On typical UNIX and GNU/Linux systems, the LF character signifies a newline character. But on e.g. Windows, a combination is used; CRLF (first carriage return, then linefeed, like an old typewriter). We'll assume that we're on a GNU/Linux system for the rest of this discussion where linebreaks are encoded as just the LF character.

Let's take a small text, and analyze it from a computer's point of view:

These are four words.

The text clearly contains four words and three blanks. But how many characters is that? That's actually hard to answer just looking at the text. Remember that whitespaces are kind of blank. Maybe there's a space or tab after the dot?

If we ask ourselves "How many lines is the above text?", things get even more complicated. What is a line? That's a string of characters ending with a newline character (which makes any text after that to appear on the next line when printed or displayed). We can't tell whether there's a newline at the end, since there's no (printable) text coming after the dot.

The computer would know have many characters the above text has, if it was stored in a file or if it appeared in a stream e.g. to standard out. There are commands for counting and analyzing text that we could use for that.

When we want to express a newline from a script, command line or program, we normally use something called escaping. Try the following:

$ echo -e "one\ntwo\nthree"

The flag -e instructs echo to interpret escaped characters, and the \n is normally how we express an escaped newline. By the way, what do you think \t means?

If we had a file with the example text above, it could be represented like the following number sequence: 84 101 115 101 32 97 114 101 32 102 111 117 114 32 119 111 114 100 115 46. On a computer, that would be binary numbers, of course, but we use decimal here to make things more clear. Here's what the numbers represent using the ASCII table:

 84 (T)
101 (e)
115 (s)
101 (e)
 32 (blank)
 97 (a)
114 (r)
101 (e)
 32 (blank)
102 (f)
111 (o)
117 (u)
114 (r)
 32 (blank)
119 (w)
111 (o)
114 (r)
100 (d)
115 (s)
 46 (.)

That would make the answer to the question about how many characters 20. But the file could also have had contained the following sequence:

 84 (T)
101 (e)
115 (s)
101 (e)
 32 (blank)
 97 (a)
114 (r)
101 (e)
 32 (blank)
102 (f)
111 (o)
117 (u)
114 (r)
 32 (blank)
119 (w)
111 (o)
114 (r)
100 (d)
115 (s)
 46 (.)
 10 (LF - newline)

That would make the answers 21 characters and one line. So, to the computer, a "line" is a sequence of characters where the last one is a newline.

A section to the human eye is therefore encoded as some characters followed by two newline characters followed by more characters.

Next, we'll look at a command for counting lines, words, characters and bytes in text. The command is wc (stands for "word count") and is a very useful tool. We'll see some uses soon.

Let's use a file with some Latin written by Cicero circa year 45 BC. The text was written by him, not the file that is. The file was actually written by Rikard after stealing and copying it from the Internet (Rikard is pretty sure that the copyright has expired, so he doesn't fear the copyright police). You can download the file from here, if you want to play along:

https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/text/lorem.txt

You should know by now, how to download a file to the current directory. We recommend that you first create a directory for this exercise, so that you don't start to pollute your home directory with a lot of unrelated files.

~$ mkdir -p bash-intro/text-files/
~$ cd bash-intro/text-files
~/bash-intro/text-files$ wget ".....(url here)"

Here's the content of this file:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. Maecenas elementum lacinia congue. Suspendisse venenatis tortor dolor, sit amet congue elit lobortis sed. Duis ut augue eget augue congue vehicula. Vivamus gravida commodo placerat. Sed at libero at dolor ultricies finibus. Integer vel tempus lectus. Ut tristique tempus tempor. In et consequat elit. Vivamus efficitur, diam vel molestie tristique, dui eros dignissim eros, id rhoncus nunc leo at felis. Suspendisse at enim vel nibh dapibus maximus in ac nibh. Integer viverra magna at sapien volutpat lobortis. Praesent ornare posuere lectus imperdiet ullamcorper. Sed quis ex dignissim, rutrum tortor id, egestas lacus. Sed imperdiet dolor augue, sit amet consequat odio pellentesque sed.

Etiam fermentum finibus elit, in blandit neque hendrerit non. Vivamus venenatis lobortis nunc, ac cursus magna feugiat id. Proin vehicula ultricies dolor eu tempor. Vivamus cursus, tellus vel tincidunt bibendum, dui justo lacinia diam, sed aliquam erat purus ac eros. Pellentesque justo leo, tincidunt quis dui sed, auctor ullamcorper nisi. In ornare urna justo, sed posuere dui convallis vitae. Nullam cursus euismod commodo.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque rhoncus enim ut libero dapibus interdum. Quisque eu neque in dolor dignissim condimentum eu vel ex. Quisque a purus laoreet, pulvinar mi vitae, auctor risus. Donec sapien ligula, iaculis sollicitudin nisl fringilla, semper aliquam lectus. Donec a iaculis dolor, in placerat metus. Donec malesuada lorem a nisl pretium, mattis efficitur erat suscipit.

Pellentesque nisi tortor, rutrum et dictum rutrum, ornare bibendum purus. Nam at suscipit justo. Fusce eget faucibus eros, id volutpat diam. Morbi eget est nunc. Maecenas nibh odio, aliquam consectetur ligula convallis, luctus varius est. Sed tempor aliquam maximus. Morbi interdum purus vitae enim vehicula pharetra.

Sed ac placerat risus. Vestibulum fringilla ante in placerat dignissim. Donec odio libero, varius quis erat vitae, convallis tempor sem. Duis erat orci, pulvinar ac ante ac, vestibulum porta mauris. Praesent interdum quam quis lorem mollis, id sollicitudin ligula consectetur. Aenean ut elit sed orci consequat feugiat id eget ligula. Etiam quis tortor pulvinar risus vulputate tristique vel at lacus. Donec eleifend arcu neque, nec laoreet tortor ultricies gravida. Praesent vehicula a ipsum at pharetra.

Let's use wc to count the number of lines in the file. The flag for that is -l (dash lowercase L):

rikard@newdelli:~/bash-intro/text-files$ wc -l lorem.txt
9 lorem.txt
rikard@newdelli:~/bash-intro/text-files$
The text in a narrow terminal window

Open the file in an editor. Does it look like nine lines to you? Use cat to print the file to the terminal. Does it look like nine lines? Remember, the definition of "line" to a computer program is "number of newline symbols".

The text has five sections. That's eight newlines (because there are two newlines between each section). And then the file ends with a newline. That's pretty common. When you use cat to print the file, you want your prompt to appear on a new line under the text. If the file didn't end with a newline character, the prompt would actually have appeared after the last section's final dot!

The text in a wide terminal window

Take some time to think about why the text looks like much more than nine lines in the terminal and in your editor. Try to resize your terminal and see what happens. Do the same thing with your editor window. The number of apparent lines seem to change when we resize the window. But the number of sections remain the same. This is call text wrapping. Most applications rearrange the text so that a maximum of characters are visible on each line, depending on the size of the window (the width measured in characters).

When you work with plain text, use newlines mainly to create sections and let the text wrap if you don't want surprises for people with a window width very different from yours. If you want to use a fix line width, we recommend 80 characters (that looks good in most applications - it is rare that people have a less wide character width. You would have to hit Enter after 80 characters on each line to create a fixed line width of 80. There are of course applications to help you with that, if it sounds like a drag.

Next, let's find out the number of words in the file:

rikard@newdelli:~/bash-intro/text-files$ wc -w lorem.txt
359 lorem.txt
rikard@newdelli:~/bash-intro/text-files$

The flag -w (stands for "words") is used and 359 words were reported. Was it correct? Just kidding. You can trust that wc knows what it's doing. But seriously, how would you have figured out the number of words before you knew about wc?

We can figure out the number of characters in the file by using the flag -m (not clear what it stands for, perhaps "syMbols"?):

rikard@newdelli:~/bash-intro/text-files$ wc -m lorem.txt 
2438 lorem.txt

There's a flag which you shouldn't confuse with the above. It is -c and it is used to report the number of bytes in the file. For English text (ASCII table characters), the number of bytes is the same as the number of characters. But if you have e.g. Swedish characters, the file is probably encoded with e.g. the UTF-8 encoding scheme, in which Swedish characters occupy two bytes each:

rikard@newdelli:~/bash-intro/text-files$ cat swe.txt 
Å
rikard@newdelli:~/bash-intro/text-files$ file swe.txt 
swe.txt: UTF-8 Unicode text
rikard@newdelli:~/bash-intro/text-files$ wc -m swe.txt 
2 swe.txt
rikard@newdelli:~/bash-intro/text-files$ wc -c swe.txt
3 swe.txt
rikard@newdelli:~/bash-intro/text-files$

You can also get wc to report the number of characters of the longest line in a file:

rikard@newdelli:~/bash-intro/text-files$ wc -L lorem.txt 
766 lorem.txt
rikard@newdelli:~/bash-intro/text-files$

To demonstrate that a line needs to end with a newline character to count as a line, we could imagine a file with the text "These are four words." which doesn't end with a newline character. Let's see some effects of such a file:

rikard@newdelli:~/bash-intro/text-files$ cat four.txt 
These are four words.rikard@newdelli:~/bash-intro/text-files$ wc -l four.txt
0 four.txt
rikard@newdelli:~/bash-intro/text-files$ wc four.txt 
 0  4 21 four.txt
rikard@newdelli:~/bash-intro/text-files$

Note where the prompt appeared after using cat to print the file. Also note that wc didn't report any lines in the file. So it is safe to conclude that the -l flag actually counts newline characters. A "line" is a human construct, alien to the way computers "think".

Searching for text patterns in text files using grep

We've already peeked at the grep command. It is really one of the most powerful commands for analyzing text when it comes to searching for patterns. The grep command, like most (all?) commands for text processing, is line-based. It consumes lines and applies a pattern provided as an argument and prints matching lines.

Let's search for some patterns in the lines of the Latin text file. But, wait! It had only nine lines. That's no fun. Let's make the file fixed width 80 characters. Open an editor and... no, wait. There's a command for that. Let's use that instead, and redirect the result to a new file:

rikard@newdelli:~/bash-intro/text-files$ fold -s --width=80 lorem.txt > lorem80.txt
rikard@newdelli:~/bash-intro/text-files$

Did it work? Let's ask wc to report the length of the widest line:

rikard@newdelli:~/bash-intro/text-files$ wc -L lorem80.txt
80 lorem80.txt
rikard@newdelli:~/bash-intro/text-files$

It worked! Lucky for us! Now, we have a lot more lines to search, which makes this exercise a lot more interesting. We'll use the new file lorem80.txt from now on.

The grep command takes a pattern as argument. This pattern is actually an expression in a language called regular expressions. This is a whole course in itself (there are thick books written about regular expressions) but we'll learn a little basics here. Let's start with something simple as just normal strings of text. What lines contain the string et anywhere in the line (even as part of a word)?

rikard@newdelli:~/bash-intro/text-files$ grep et lorem80.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. 
Maecenas elementum lacinia congue. Suspendisse venenatis tortor dolor, sit amet 
congue elit lobortis sed. Duis ut augue eget augue congue vehicula. Vivamus 
tempus lectus. Ut tristique tempus tempor. In et consequat elit. Vivamus 
lectus imperdiet ullamcorper. Sed quis ex dignissim, rutrum tortor id, egestas 
lacus. Sed imperdiet dolor augue, sit amet consequat odio pellentesque sed.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque rhoncus enim 
vel ex. Quisque a purus laoreet, pulvinar mi vitae, auctor risus. Donec sapien 
iaculis dolor, in placerat metus. Donec malesuada lorem a nisl pretium, mattis 
Pellentesque nisi tortor, rutrum et dictum rutrum, ornare bibendum purus. Nam 
at suscipit justo. Fusce eget faucibus eros, id volutpat diam. Morbi eget est 
nunc. Maecenas nibh odio, aliquam consectetur ligula convallis, luctus varius 
pharetra.
mollis, id sollicitudin ligula consectetur. Aenean ut elit sed orci consequat 
feugiat id eget ligula. Etiam quis tortor pulvinar risus vulputate tristique 
vel at lacus. Donec eleifend arcu neque, nec laoreet tortor ultricies gravida. 
Praesent vehicula a ipsum at pharetra.
rikard@newdelli:~/bash-intro/text-files$

That's quite a handful. What if we only are interested in lines with the word et (Latin for "and")? Here's how to do that:

rikard@newdelli:~/bash-intro/text-files$ grep -w et lorem80.txt
tempus lectus. Ut tristique tempus tempor. In et consequat elit. Vivamus 
Pellentesque nisi tortor, rutrum et dictum rutrum, ornare bibendum purus. Nam 
rikard@newdelli:~/bash-intro/text-files$

What about the word ut?

rikard@newdelli:~/bash-intro/text-files$ grep -w ut lorem80.txt
congue elit lobortis sed. Duis ut augue eget augue congue vehicula. Vivamus 
ut libero dapibus interdum. Quisque eu neque in dolor dignissim condimentum eu 
mollis, id sollicitudin ligula consectetur. Aenean ut elit sed orci consequat 
rikard@newdelli:~/bash-intro/text-files$

But what happens if a sentence on a line begins with Ut (having capital U)? Then it is not reported. We need to add the flag -i (meaning "case Insensitive"):

rikard@newdelli:~/bash-intro/text-files$ grep -iw Ut lorem80.txt
congue elit lobortis sed. Duis ut augue eget augue congue vehicula. Vivamus 
tempus lectus. Ut tristique tempus tempor. In et consequat elit. Vivamus 
ut libero dapibus interdum. Quisque eu neque in dolor dignissim condimentum eu 
mollis, id sollicitudin ligula consectetur. Aenean ut elit sed orci consequat 
rikard@newdelli:~/bash-intro/text-files$

Flags that don't take an argument (independent flags) can be written together or separately. These flags to grep are synonymous: -i -w, -iw, and -wi.

A common word in Latin is sed (which also happens to be the name of one of the authors' favorite command):

rikard@newdelli:~/bash-intro/text-files$ grep -wi sed lorem80.txt 
congue elit lobortis sed. Duis ut augue eget augue congue vehicula. Vivamus 
gravida commodo placerat. Sed at libero at dolor ultricies finibus. Integer vel 
lectus imperdiet ullamcorper. Sed quis ex dignissim, rutrum tortor id, egestas 
lacus. Sed imperdiet dolor augue, sit amet consequat odio pellentesque sed.
sed aliquam erat purus ac eros. Pellentesque justo leo, tincidunt quis dui sed, 
auctor ullamcorper nisi. In ornare urna justo, sed posuere dui convallis vitae. 
est. Sed tempor aliquam maximus. Morbi interdum purus vitae enim vehicula 
Sed ac placerat risus. Vestibulum fringilla ante in placerat dignissim. Donec 
mollis, id sollicitudin ligula consectetur. Aenean ut elit sed orci consequat 
rikard@newdelli:~/bash-intro/text-files$

But what if we want to do the inverse, report all lines that don't contain "Sed" or "sed"? The inverse is produces using the flag -v (stands for "inVert match"):

rikard@newdelli:~/bash-intro/text-files$ grep -iwv sed lorem80.txt 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. 
Maecenas elementum lacinia congue. Suspendisse venenatis tortor dolor, sit amet 
tempus lectus. Ut tristique tempus tempor. In et consequat elit. Vivamus 
efficitur, diam vel molestie tristique, dui eros dignissim eros, id rhoncus 
nunc leo at felis. Suspendisse at enim vel nibh dapibus maximus in ac nibh. 
Integer viverra magna at sapien volutpat lobortis. Praesent ornare posuere 

Etiam fermentum finibus elit, in blandit neque hendrerit non. Vivamus venenatis 
lobortis nunc, ac cursus magna feugiat id. Proin vehicula ultricies dolor eu 
tempor. Vivamus cursus, tellus vel tincidunt bibendum, dui justo lacinia diam, 
Nullam cursus euismod commodo.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque rhoncus enim 
ut libero dapibus interdum. Quisque eu neque in dolor dignissim condimentum eu 
vel ex. Quisque a purus laoreet, pulvinar mi vitae, auctor risus. Donec sapien 
ligula, iaculis sollicitudin nisl fringilla, semper aliquam lectus. Donec a 
iaculis dolor, in placerat metus. Donec malesuada lorem a nisl pretium, mattis 
efficitur erat suscipit.

Pellentesque nisi tortor, rutrum et dictum rutrum, ornare bibendum purus. Nam 
at suscipit justo. Fusce eget faucibus eros, id volutpat diam. Morbi eget est 
nunc. Maecenas nibh odio, aliquam consectetur ligula convallis, luctus varius 
pharetra.

odio libero, varius quis erat vitae, convallis tempor sem. Duis erat orci, 
pulvinar ac ante ac, vestibulum porta mauris. Praesent interdum quam quis lorem 
feugiat id eget ligula. Etiam quis tortor pulvinar risus vulputate tristique 
vel at lacus. Donec eleifend arcu neque, nec laoreet tortor ultricies gravida. 
Praesent vehicula a ipsum at pharetra.
rikard@newdelli:~/bash-intro/text-files$

On a side-note, blank lines were reported (correctly, they don't contain the word "sed" - they just contain a single newline).

One powerful application of grep, is to use it to recursively search all text files in a whole directory tree. The flag for that is -r (stands for "recursive"). Rikard has a lot of Java source code files on his computer. Let's see which ones in a certain directory tree contain the word "enum":

rikard@newdelli:~/bash-intro/text-files$ grep -rw enum ~/opt/progund/java-extra-lectures/enums
/home/rikard/opt/progund/java-extra-lectures/enums/exercises/suggested-solutions/Weekday.java:public enum Weekday {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/WeekDay.java:public enum WeekDay {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/EnumDemo.java:  enum Repeat {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/EnumDemo.java:  enum Status {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/EnumDemo.java:  enum StringFrom {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/school/Degree.java:  enum Level {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/school/Degree.java~:  enum Level {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/Bill.java:public enum Bill {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/Delivery.java:  public enum Type {
rikard@newdelli:~/bash-intro/text-files$

The final argument was now a directory to recursively search, rather than a file. There was a backup file among the list of matching files, /home/rikard/opt/progund/java-extra-lectures/enums/enumexample/school/Degree.java~. If we wanted to only include files that end in exactly .java in the search, this is how to do it:

rikard@newdelli:~/bash-intro/text-files$ grep -rw enum --include=*.java ~/opt/progund/java-extra-lectures/enums
/home/rikard/opt/progund/java-extra-lectures/enums/exercises/suggested-solutions/Weekday.java:public enum Weekday {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/WeekDay.java:public enum WeekDay {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/EnumDemo.java:  enum Repeat {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/EnumDemo.java:  enum Status {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/EnumDemo.java:  enum StringFrom {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/school/Degree.java:  enum Level {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/Bill.java:public enum Bill {
/home/rikard/opt/progund/java-extra-lectures/enums/enumexample/Delivery.java:  public enum Type {
rikard@newdelli:~/bash-intro/text-files$

Tiny introduction to regular expressions

We've seen that normal strings of letters work as we expected. The string "ut" is just that, those to letters in that order. There are some special characters in the regular expression language. We'll first look at . (dot) which means "any one character". If you want to match an actual dot, you need to escape it. Here's looking for lines that have sentences ending with "a" followed by an actual dot:

rikard@newdelli:~/bash-intro/text-files$ grep 'a\.' lorem80.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. 
congue elit lobortis sed. Duis ut augue eget augue congue vehicula. Vivamus 
pharetra.
feugiat id eget ligula. Etiam quis tortor pulvinar risus vulputate tristique 
vel at lacus. Donec eleifend arcu neque, nec laoreet tortor ultricies gravida. 
Praesent vehicula a ipsum at pharetra.
rikard@newdelli:~/bash-intro/text-files$

Use single quotes around the regular expression, to make Bash treat it as exactly what you type (rather than trying to interpret any special characters before sending the string to grep).

Another special character in regular expressions is the $ which means "end of line". To find all lines that end with an actual dot (that is, after we fixed the line width to be exactly 80 characters, was there by chance any line that happened to end with a dot?):

rikard@newdelli:~/bash-intro/text-files$ grep '\.$' lorem80.txt
lacus. Sed imperdiet dolor augue, sit amet consequat odio pellentesque sed.
Nullam cursus euismod commodo.
efficitur erat suscipit.
pharetra.
Praesent vehicula a ipsum at pharetra.
rikard@newdelli:~/bash-intro/text-files$

If there is a special character for "end of line", there must be one for "beginning of line". That one is ^. Let's look for lines that start with a capital letter. At least the first lines of each section should match, and perhaps some more lines. So we need to also express "capital letter". We can do that with a character class. Using square brackes creates a class of characters. Since the alphabet comes in alphabetical order in the ASCII table, we can use an interval like [A-Z] to signify the class of all upper case letters. Thus, our expression for "lines that start with an upper case letter" becomes ^[A-Z]. Note that [A-Z] is a class and represents one character belonging to that class:

rikard@newdelli:~/bash-intro/text-files$ grep '^[A-Z]'  lorem80.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. 
Maecenas elementum lacinia congue. Suspendisse venenatis tortor dolor, sit amet 
Integer viverra magna at sapien volutpat lobortis. Praesent ornare posuere 
Etiam fermentum finibus elit, in blandit neque hendrerit non. Vivamus venenatis 
Nullam cursus euismod commodo.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque rhoncus enim 
Pellentesque nisi tortor, rutrum et dictum rutrum, ornare bibendum purus. Nam 
Sed ac placerat risus. Vestibulum fringilla ante in placerat dignissim. Donec 
Praesent vehicula a ipsum at pharetra.
rikard@newdelli:~/bash-intro/text-files$

Next, we'll look at some special characters for quantifiers. We can express that a character should occur zero or more times using *, and one or more times using +, and zero or one time using ?. In basic regular expressions (the default for grep) + and ? need to be escaped: \+ and \? respectively. If you don't want to have to escape them, use the flag -E (meaning "extended regular expressions").

Let's see if we can come up with an expression for finding lines that both start with a capital letter and end with a dot (lines that were kind of a perfect match for 80 characters). To express that, we need to figure out how to say "starts with an upper case letter, followed by zero or more characters of any kind, followed by an actual dot which ends the line":

rikard@newdelli:~/bash-intro/text-files$ grep '^[A-Z].*\.$'  lorem80.txt
Nullam cursus euismod commodo.
Praesent vehicula a ipsum at pharetra.
rikard@newdelli:~/bash-intro/text-files$

Let's break it down:

  • "starts with an upper case letter" - ^[A-Z]
  • "followed by zero or more characters of any kind" - .*
  • "followed by an actual dot which ends the line" - \.$

But what about this line:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. 

Why didn't that qualify? It turns out that fold, which we used to fix the 80 character line width, kept a trailing whitespace (a blank) at the end of the line. So the line didn't end with an actual dot, it ended with a dot followed by a space. We can fix that too, using quantifiers! Don't worry if this starts to get too complicated for you. We are including this as a bonus demonstration. Learning regular expressions is not very easy.

We are now looking for any line that "starts with an upper case letter, followed by zero or more characters of any kind, followed by an actual dot - or an actual dot and zero or more spaces - which ends the line". How do we say "zero or more spaces"? We use *:

rikard@newdelli:~/bash-intro/text-files$ grep "^[A-Z].*\.[ ]*$"  lorem80.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. 
Nullam cursus euismod commodo.
Praesent vehicula a ipsum at pharetra.
rikard@newdelli:~/bash-intro/text-files$

An alternative to the last character class with only space, is to escape the space: \ :

rikard@newdelli:~/bash-intro/text-files$ grep "^[A-Z].*\.\ *$"  lorem80.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. 
Nullam cursus euismod commodo.
Praesent vehicula a ipsum at pharetra.
rikard@newdelli:~/bash-intro/text-files$

Don't forget to quote your expression. In bash, an unquoted * will be subject to processing by Bash, before being sent to grep using globbing (explained below).

Printing parts of files etc

We'll look at some commands for printing parts of files next. And for reversing files in different way.

As you might have suspected by now, the authors of this wiki like to work in the terminal and with Bash. They seldom leave the command line for doing stuff that could just as well be done using some command line instead. Heck, they even use editors that run inside the terminal, so that they don't have to bother with "yet another window".

This includes getting a quick look at text files, even if they are very long text files. On Rikard's computer, there's a very long text file with dictionary words:

rikard@newdelli:~/bash-intro/text-files$ wc -l /etc/dictionaries-common/words
99171 /etc/dictionaries-common/words
rikard@newdelli:~/bash-intro/text-files$

That's a bit too much to print in its entirety using cat. Almost a hundred thousand lines! Sometimes, we only need to look at the first few lines of a file to get the idea of what's inside. Let's look at the first ten lines of that long file, using head:

rikard@newdelli:~/bash-intro/text-files$ head /etc/dictionaries-common/words
A
A's
AA's
AB's
ABM's
AC's
ACTH's
AI's
AIDS's
AM's
rikard@newdelli:~/bash-intro/text-files$

Without any flags, the head command prints the ten first lines. To print the five first, do this:

rikard@newdelli:~/bash-intro/text-files$ head -5 /etc/dictionaries-common/words
A
A's
AA's
AB's
ABM's
rikard@newdelli:~/bash-intro/text-files$

There's a sister command to head called tail. Guess what that does? Correct! It prints the last lines of a file (by default 10, but it also accepts a flag with the desired number of last lines):

rikard@newdelli:~/bash-intro/text-files$ tail -15 /etc/dictionaries-common/words
éclair's
éclairs
éclat
éclat's
élan
élan's
émigré
émigré's
émigrés
épée
épée's
épées
étude
étude's
études
rikard@newdelli:~/bash-intro/text-files$

By the way, if this is a dictionary file, why doesn't z come at the end? Because in the extended ASCII table (the one used as part of the Unicode character set used on modern computers), accented letters come after the plain letters (meaning also after "z" which is the "last" normal letter, looking at the ASCII code for it).

You can also reverse the order of the lines in a file. It turns out that cat has a sister command humorously named tac which does that job for us:

rikard@newdelli:~/bash-intro/text-files$ cat small_text.txt
This is one line.
This is the next line.
And this is the third line.
rikard@newdelli:~/bash-intro/text-files$ tac small_text.txt
And this is the third line.
This is the next line.
This is one line.
rikard@newdelli:~/bash-intro/text-files$

There's even a command that reverses the text on each line of a file, called rev:

rikard@newdelli:~/bash-intro/text-files$ rev small_text.txt
.enil eno si sihT
.enil txen eht si sihT
.enil driht eht si siht dnA
rikard@newdelli:~/bash-intro/text-files$

Next, download the following file:

https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/text/a_few_urls.txt

Investigate the file using cat.

rikard@newdelli:~/bash-intro/text-files$ cat a_few_urls.txt 
http://www.gu.se/bazinga
https://ait.gu.se/forskning/journalister
http://www.elsewhere.org/pomo/
http://snarxiv.org/vs-arxiv/
http://www.physics.nyu.edu/faculty/sokal/afterword_v1a/afterword_v1a_singlefile.html

rikard@newdelli:~/bash-intro/text-files$

What if we wanted to use the contents of this file, but without the leading http[s]://? We will use the command cut for this job. The cut command can treat lines as a list of strings delimited by some character that the user chooses. To get rid of the http[s]:// part, we could treat the lines as strings delimited by slashes. That would allow us to keep only fields from the third one and to the end, effectively removing the protocol part from the URLs. Let's explain:

h t t p : / / w w w . g u . s e / b a z i n g a
\________/\/\_________________/  \____________/
^          ^            ^               ^
|          |            |               |
`-field 1  `-field 2    `-field 3       `-field 4

  • Field 1 contains "http:"
  • Field 2 contains "" (nothing)
  • Field 3 contains the URL "www.gu.se"
  • Field 4 contains the path at the end, "bazinga"

Using cut with the delimiter / and keeping fields from field number 3 and the rest is done like this:

rikard@newdelli:~/bash-intro/text-files$ cut -d '/' -f3- a_few_urls.txt
www.gu.se/bazinga
ait.gu.se/forskning/journalister
www.elsewhere.org/pomo/
snarxiv.org/vs-arxiv/
www.physics.nyu.edu/faculty/sokal/afterword_v1a/afterword_v1a_singlefile.html

The flag -d '/' means "use slash as a delimiter" (put the character to use in single quotes to make Bash treat it verbatim). The flag -f3- means "keep fields 3 and so on". Note the trailing dash after the "3". It means "etc" or "and the rest from here".

What if we want to keep only the domain name, and skip the protocol and paths? Then we keep only the third field:

rikard@newdelli:~/bash-intro/text-files$ cut -d '/' -f3 a_few_urls.txt
www.gu.se
ait.gu.se
www.elsewhere.org
snarxiv.org
www.physics.nyu.edu

rikard@newdelli:~/bash-intro/text-files$

There's more we can do to manipulate text. We can use the tr command (stands for "translate characters") to replace some characters with others. Download the following file and investigate its contents:

https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/text/replaceme.txt

rikard@newdelli:~/bash-intro/text-files$ cat replaceme.txt 
haj
att
travligt
axampel
aller hur?
rikard@newdelli:~/bash-intro/text-files$

We apologize for the text being in Swedish. The language is not important to make a point of how tr works, however.

Let's replace all letters 'a' with the letter 'e'. First we learn how to use tr interactively. As previously, we'll mark out with arrows what cut replies with, and we quit playing by pressing Ctrl-D:

rikard@newdelli:~/bash-intro/text-files$ tr 'a' 'e'
avil
evil        <- tr replies
past
pest        <- tr replies
bad cover
bed cover   <- tr replies
(user presses Ctrl-D)
rikard@newdelli:~/bash-intro/text-files$

The tr command doesn't accept a file name argument. It only works with standard in, and replies to standard out. Luckily for us, we know how to make Bash redirect standard in to come from a file rather than interactively reading from the terminal and keyboard:

rikard@newdelli:~/bash-intro/text-files$ tr 'a' 'e' < replaceme.txt
hej
ett
trevligt
exempel
eller hur?
rikard@newdelli:~/bash-intro/text-files$

All 'a's in the file were replaced by 'e's.

We can also use tr to delete characters, using the flag -d (stands for "delete"). Here's an interactive proof of concept:

rikard@newdelli:~/bash-intro/text-files$ tr -d 's'
son
on        <- tr replies
bossy
boy       <- tr replies
stombossy
tomboy    <- tr replies
sweden
weden     <- tr replies
score
core      <- tr replies
scones
cone      <- tr replies
(user presses Ctrl-D)
rikard@newdelli:~/bash-intro/text-files$

What makes tr really powerful, is that you can replace more than one character with more than one character at the same time.

rikard@newdelli:~/bash-intro/text-files$ tr 'ng' 'ss'
ping
piss      <- tr replies
bling
bliss     <- tr replies
ang
ass       <- tr replies
bong
boss      <- tr replies
(user presses Ctrl-D)
rikard@newdelli:~/bash-intro/text-files$

You can use character classes too, which is perhaps even more powerful. Two character classes are ordered lists where each character at a position in the first class, will be replaced by the character at the corresponding position in the second class:

rikard@newdelli:~/bash-intro/text-files$ tr '[a-z]' '[A-Z]'
hello
HELLO      <- tr replies
there
THERE      <- tr replies
ok
OK         <- tr replies
i
I          <- tr replies
get
GET        <- tr replies
it
IT         <- tr replies
(user presses Ctrl-D)
rikard@newdelli:~/bash-intro/text-files$

You can even create your own secret language using only tr:

rikard@newdelli:~/bash-intro/text-files$ tr 'a-zA-Z' 'n-za-mN-ZA-M'
Top secret message!
Gbc frperg zrffntr!       <- tr replies
Gbc frperg zrffntr!
Top secret message!       <- tr replies
(user presses Ctrl-D)
rikard@newdelli:~/bash-intro/text-files$

Sorting text

What else can we do with text and a computer? One very common task, is to sort text lexicograpically (according to some kind of alphabet, like the ASCII table for instance).

A text file is an ordered list of lines. If we want or need to, we can change the order of the lines by sorting it according to some criteria. A sorted list is easier for humans to handle and read, because we are used to alphabetically ordered texts, for instance. A sorted list also makes it a lot easier for humans to spot duplicates, because they will come in sequence.

We can also remove duplicates (both with sort and with the dedicated command uniq) using the computer. The sort command combined with the uniq command can remove duplicates and report the number of occurrences, which turns out to be of great help, if we want to make a frequency table of word frequencies in a text. That is, a top list with the most frequent words a the top (along with their count) and descending.

We have prepared a file with every word from the Latin text file used previously for you. You can download it here:

https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/text/latin_words.txt

From this file, we want to create a frequency table. In order to do that, we must be able to count duplicate words. The easiest way to do that, is to sort the file and save the result in a new file:

rikard@newdelli:~/bash-intro/text-files$ sort latin_words.txt > latin_words_sorted.txt
rikard@newdelli:~/bash-intro/text-files$

We used sort without arguments to simply sort all lines in the file and redirected standard out to a new file, latin_words_sorted.txt.

Next, we want to translate all uppercase letters to lowercase letters, so that "ut" is treated as "Ut", for instance, when we count the number of duplicates in the sorted file:

rikard@newdelli:~/bash-intro/text-files$ tr 'A-Z' 'a-z' < latin_words_sorted.txt > latin_words_sorted_lower_case.txt
rikard@newdelli:~/bash-intro/text-files$

We used tr with character classes to translate everything to lowercase by redirecting standard in from latin_words_sorted.txt and redirected standard out to the new file latin_words_sorted_lower_case.txt .

Let's look at the first ten words:

rikard@newdelli:~/bash-intro/text-files$ head latin_words_sorted_lower_case.txt
a
a
a
a
ac
ac
ac
ac
ac
ac
rikard@newdelli:~/bash-intro/text-files$

Next, we'll use uniq to remove the duplicate lines and report the count. The uniq command can be used by giving two filenames as arguments. The first file will be used as input and the second file will be written with the result. The flag -c tells uniq to also report the count of the duplicates:

rikard@newdelli:~/bash-intro/text-files$ uniq -c latin_words_sorted_lower_case.txt latin_uniq_frequencies.txt
rikard@newdelli:~/bash-intro/text-files$ head latin_uniq_frequencies.txt
      4 a
      6 ac
      2 adipiscing
      1 aenean
      4 aliquam
      4 amet
      2 ante
      1 arcu
      8 at
      2 auctor
rikard@newdelli:~/bash-intro/text-files$

You can see the first ten lines of the result above. Hmm. We wanted a top list. The result file isn't sorted. Wasn't there a command for sorting files?

The sort command can be told to sort by some column and also to treat the content as numeric values rather than text values. Sorting numbers as text doesn't work as you might expect. Lexicographically "9" comes after "1000", for instance. If we want to sort by numeric values, we need to instruct sort to do so. So, to tell sort to sort the file latin_uniq_frequencies.txt using the first column, descending order and treating the column numerically, we'll use the flags -k1rn (meaning: k: "key", 1: "one", r: "reverse/descending", n: "numerically"):

rikard@newdelli:~/bash-intro/text-files$ sort -k1rn latin_uniq_frequencies.txt > frequency_table.txt
rikard@newdelli:~/bash-intro/text-files$ head frequency_table.txt
     11 sed
      8 at
      8 dolor
      8 in
      6 ac
      6 elit
      6 id
      6 vel
      5 donec
      5 quis
rikard@newdelli:~/bash-intro/text-files$

So the top 10 most common words in the Latin text were those listed above. Actually "tortor" (meaning "torturer" or "tormentor") also had a frequency count of 5. Makes you wonder what that text really is about.

Combining files

Since many of you will later take a database course, we thought it might be interesting to see how you can actually combine two files of data to get some correlations and links between data in two files. You will be subjected to similar exercises in the database course, so it doesn't hurt to get the general idea about combining data from two files. This is a little advanced and extra, so please don't panic if you don't follow or even see the point. This section is for the most interested students!

CSV (stands for "Comma-Separated Values) files, are text files with data where different data are separated by commas. It is a very common file format for simple plain text data. Let's pretend that we have two files:

https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/text/group_album.txt

https://raw.githubusercontent.com/progund/datorkunskap-kompendium/master/text/group_genre.txt

The two files contain data about music groups (or artists). The former has information about groups and their album names. The latter has information about groups and what genre they generally play. Our task is to use both files and combine them so we can get both genre and album names for each group.

This is the contents of the file:

rikard@newdelli:~/bash-intro/text-files$ cat group_album.txt 
ABBA,Arrival
Bach,Cantatas
Bowie,Low
Bowie,Station to station
INXS,Kick
Kiss,Destroyer
KSMB,Rika barn leka bäst
Styx,Crystal ball

rikard@newdelli:~/bash-intro/text-files$ cat group_genre.txt 
ABBA,Pop
Bach,Classical
Kiss,Rock
KSMB,Punk
Styx,Metal
rikard@newdelli:~/bash-intro/text-files$

We'll use the command join to join (or combine) these files, using the group/artist name as the key. That is, we want to produce a result that shows us that ABBA does Pop and has the album Arrival. So we'll treat lines in both files as linked if the artist ABBA occurs in the first column in both files.

Note that the files are sorted. This is a requirement for this to work. The join command works in parallel in both files when looking for matches. If the files weren't sorted, it would have to go through the files many times to find each match. Anyway. Also note that Bowie occurs in the group_album.txt file but not in the group_genre.txt file. So, we'll have to handle such cases in some way to. What should we print for "genre" for the Bowie albums? That information is missing.

Let's start with a command that ignores Bowie, since he's not present in both files.

rikard@newdelli:~/bash-intro/text-files$ join -t , -o 2.1,2.2,1.2 group_genre.txt group_album.txt
ABBA,Arrival,Pop
Bach,Cantatas,Classical
Kiss,Destroyer,Rock
KSMB,Rika barn leka bäst,Punk
Styx,Crystal ball,Metal

rikard@newdelli:~/bash-intro/text-files$

This is a break-down of the command:

  • -t - separator token is...
  • , - ... a comma
  • -o - output...
  • 2.1 - the first column of the second file (group_album.txt)
  • 2.2 - the second column of the second file (group_album.txt)
  • 1.2 - the second column of the first file (group_genre.txt)
  • group_genre.txt - first file
  • group_album.txt - second file

Seems to have worked, no?

Now, let's handle the missing data about Bowie and his genre (Bowie was a genre-spanner, so we can understand that it was hard for the author to add this information). INXS is also missing from the genre file.

We can provide some additional flags to handle this. The -a 2 flag means that the second file should be used even if no match can be found there. So, in other words, print information about Bowie, event though he can't be found in the genre file. And the flag -e NULL means, for missing data (like Bowies genre), print "NULL" instead. The rest is the same:

rikard@newdelli:~/bash-intro/text-files$ join -a 2 -e NULL -t , -o 2.1,2.2,1.2 group_genre.txt group_album.txt
ABBA,Arrival,Pop
Bach,Cantatas,Classical
Bowie,Low,NULL
Bowie,Station to station,NULL
INXS,Kick,NULL
Kiss,Destroyer,Rock
KSMB,Rika barn leka bäst,Punk
Styx,Crystal ball,Metal

rikard@newdelli:~/bash-intro/text-files$

This kind of exercise is quite common. Journalists who are using data to investigate something often get data from two sources and in slightly different formats. Let's pretend some journalists are investigating a bank with potentially suspicious transactions. Could there be money laundry going on here? The problem is that they have two files with data. One with account numbers and company names of the holders. They know that some of those companies are quite shady. The other file contains bank accounts and transactions between them. They want to combine these files, to see if any of the transactions involve any of the shady companies. Lucky for them, they know about join!

As a cliffhanger for the next section, this is how we can combine many commands on the same line, to extract only the artist names of those artists that are missing from the genre file:

rikard@newdelli:~/bash-intro/text-files$ join -a 2 -e NULL -t , -o 2.1,2.2,1.2 group_genre.txt group_album.txt | grep NULL | cut -d ',' -f 1|sort -u
Bowie
INXS
rikard@newdelli:~/bash-intro/text-files$

Combining commands with pipes

We've already seen that there are three streams and that we can redirect them to and from files. What makes the Bash command line extremely powerful, is the possibility to also connect the output from one command as the input to another. This is called using pipes (from the fact that a pipe, | a.k.a. vertical bar is used). Think of it as you run a command and all its standard output goes into a pipe. This pipe is connected to the next command, which will use the data in the pipe as its standard input.

Using pipes eliminates the use of temporary files. Rather than running a command and redirecting its output to a file that we only will use with another command, we run the first command and pipe its output directly to the second command.

Some examples are probably needed to see how this works. Remember the file latin_words.txt? That file has all the words from the latin text file, in mixed case. Above, we created a frequency table by performing many steps and using some temporary files. We'll see here how we could use pipes instead, eventually ending up with a single command line for creating the frequency table.

We'll rehears each step, so that you will follow what's going on. We'll start by sorting latin_words.txt and printing out the sorted result's seven first words:

rikard@newdelli:~/bash-intro/text-files$ cat latin_words.txt | sort | head -7
a
a
a
a
ac
ac
ac
rikard@newdelli:~/bash-intro/text-files$

No temporary file was created, and the file wasn't changed. We used cat to create a stream of the lines of the file, piped it to sort which sorted the stream, and piped the sorted result to head -7 which consumed the first seven lines and printed them to standard out. We could have used a shorter command line:

rikard@newdelli:~/bash-intro/text-files$ sort latin_words.txt | head -7
a
a
a
a
ac
ac
ac
rikard@newdelli:~/bash-intro/text-files$

The use of cat in the first example wasn't necessary since sort can take a file argument and read from that. But the first example showed that it is possible to pipe more than two commands together.

If we were interested in the seven last lines, we could pipe the sorted stream to tail -7 instead (showing both with and without cat):

rikard@newdelli:~/bash-intro/text-files$ cat latin_words.txt | sort | tail -7
Vivamus
Vivamus
Vivamus
viverra
volutpat
volutpat
vulputate
rikard@newdelli:~/bash-intro/text-files$ sort latin_words.txt | tail -7
Vivamus
Vivamus
Vivamus
viverra
volutpat
volutpat
vulputate
rikard@newdelli:~/bash-intro/text-files$

If you remember, the "problem" with the word file was that the words were not in the same case. Some words existed both with a uppercase and lowercase first letter, which prevented uniq from viewing them as duplicates. What we did, was to create a temporary file, latin_words_sorted_lower_case.txt, only so that uniq could remove and count duplicates. As we said, pipes remove the need for temporary files like that.

Here's how we could change all words to lower case "on the fly" using pipes (again, showing both with and without cat):

rikard@newdelli:~/bash-intro/text-files$ cat latin_words.txt | sort | tr 'A-Z' 'a-z' | tail -7
vivamus
vivamus
vivamus
viverra
volutpat
volutpat
vulputate
rikard@newdelli:~/bash-intro/text-files$ sort latin_words.txt | tr 'A-Z' 'a-z' | tail -7
vivamus
vivamus
vivamus
viverra
volutpat
volutpat
vulputate
rikard@newdelli:~/bash-intro/text-files$

We could now just put a pipe to uniq -c at the end of the command line, to get frequencies (showing the five first frequencies):

rikard@newdelli:~/bash-intro/text-files$ sort latin_words.txt | tr 'A-Z' 'a-z' | uniq -c | head -5
      4 a
      6 ac
      2 adipiscing
      1 aenean
      4 aliquam
rikard@newdelli:~/bash-intro/text-files$

To get the top list, we should now append a pipe to sort again at the end, instructing it to read from the pipe and sort the first column numerically descending:

rikard@newdelli:~/bash-intro/text-files$ sort latin_words.txt | tr 'A-Z' 'a-z' | uniq -c | sort -rnk1 | head -11
     11 sed
      8 in
      8 dolor
      8 at
      6 vel
      6 id
      6 elit
      6 ac
      5 tortor
      5 quis
      5 donec
rikard@newdelli:~/bash-intro/text-files$

We put a pipe to head -11 at the very end, so that we could see the eleven most common words of the file.

So, what we learned here is that we could create a frequency table as one single command line. Note that when using pipes like this, each command will run in its own process. So to the CPU and memory, this isn't "faster" or "less resource consuming" than using temporary files. But it saves a lot of time to only type in one single command line.

Actually, we didn't tell you before that uniq has a -i flag which instructs it to ignore case. So we can shorten the command line even some more:

rikard@newdelli:~/bash-intro/text-files$ sort -i latin_words.txt | uniq -ic | sort -rnk1 | head -11
     11 sed
      8 in
      8 dolor
      8 at
      6 vel
      6 id
      6 elit
      6 ac
      5 tortor
      5 quis
      5 Donec
rikard@newdelli:~/bash-intro/text-files$

The output still has case differences, but the counting is correct. If we can live with this, it's a rather nice little command line for creating a frequency table, don't you think?

But we still needed the latin_words.txt file. That's actually cheating. We'd want a command line that allows us to get the frequency table for any text file, don't we?

So let's think about how to achieve this. What we want is to convert any text file to have only one word on each line (because it's only then we can use uniq -c to count the duplicate words when the file is sorted).

To get one word on each line, we can use tr to translate each blank (space) to a newline. Think about it. If you translate the spaces in the following example to newlines, the result looks promising:

rikard@newdelli:~/bash-intro/text-files$ echo "testing a theory" | tr ' ' '\n' 
testing
a
theory
rikard@newdelli:~/bash-intro/text-files$

The newline is expressed as \n in Bash and many other languages and tools.

Now we have a strategy. Let's see if we can create a command line that takes the whole lorem80.txt file and creates a frequency table from it.

One idea would be to first use tr to get one word on every line, pipe the result to sort etc as above. Let's try:

rikard@newdelli:~/bash-intro/text-files$ tr ' ' '\n' < lorem80.txt | sort -i | uniq -ic | sort -rnk1 | head -11
     33 
      8 in
      8 at
      6 vel
      6 dolor
      5 Sed
      5 quis
      5 Donec
      5 ac
      4 Vivamus
      4 ut
rikard@newdelli:~/bash-intro/text-files$

That didn't work so well. The numbers don't look the same and there's a count of 33 something (newlines, perhaps?).

Let's investigate why the count for sed isn't correct. We can use grep to find all lines that match sed case insensitively, to see what's going on:

rikard@newdelli:~/bash-intro/text-files$ tr ' ' '\n' < lorem80.txt | grep -i sed
sed.
Sed
Sed
Sed
sed.
sed
sed,
sed
Sed
Sed
sed
rikard@newdelli:~/bash-intro/text-files$

Aha! There are words with a punctuation after. Those look different from the rest, and are not counted as the same word. So we have to figure out how to remove all punctuation. But, wait! Wasn't there a way to use tr to remove characters?

rikard@newdelli:~/bash-intro/text-files$ tr ' ' '\n' < lorem80.txt | grep -i sed | tr -d '[.,;?!]'
sed
Sed
Sed
Sed
sed
sed
sed
sed
Sed
Sed
sed
rikard@newdelli:~/bash-intro/text-files$

Now we're getting somewhere! Let's use this newfound knowledge in our strategy:

rikard@newdelli:~/bash-intro/text-files$ tr ' ' '\n' < lorem80.txt | tr -d '[.,:;!?]'| sort -i | uniq -ic | sort -rnk1 | head -11
     33 
     11 sed
      8 in
      8 dolor
      8 at
      6 vel
      6 id
      6 elit
      6 ac
      5 tortor
      5 quis
rikard@newdelli:~/bash-intro/text-files$

Better! But there's still the matter of the 33 newlines. They probably come from two sources: Lines that ends with one or more spaces before the newline (those spaces will be converted to newlines too) and the newlines between sections (paragraphs).

So, in our stream of one-word-per-line, there are a lot of empty lines. We could use grep to ignore empty lines. The expression for an empty line should be ^$ (starts and ends with nothing else).

rikard@newdelli:~/bash-intro/text-files$ tr ' ' '\n' < lorem80.txt | tr -d '[.,:;!?]'| grep -v '^$'| sort -i | uniq -ic | sort -rnk1 | head -11
     11 sed
      8 in
      8 dolor
      8 at
      6 vel
      6 id
      6 elit
      6 ac
      5 tortor
      5 quis
      5 Donec
rikard@newdelli:~/bash-intro/text-files$

And, were there! The empty lines were removed just after removing the punctuations. We added a pipe to grep -v '^$' there, and boom, the empty lines were gone.

Again, don't panic if you didn't follow every step. We just want you to realize that you can do rather complicated things with one single command line using pipes. Try to read the steps again, and try each step out on your own computer. Don't be afraid to come up with your own exercises to see if you can figure out how to use pipes for a different problem.

Editing the command line and some other tricks

Since we work a lot in the command line, it is useful to know some tricks for editing the command line, reuse previous command lines and more. Knowing these kind of tricks makes working in the command line much less painful and much quicker.

Bash history

Every command you issue in the command line is saved in a history. This is a good thing. It greatly helps doing the same thing again, since every command line is saved. The easiest way to do this (reuse an old command), is to use the arrow keys arrow-up and arrow-down.

Try this out. On an empty command line, press arrow up a few times until you find the old command you want to issue again. The arrow-up key goes back in the history and the arrow-down key goes forward (if you happened to go to far back). Try it! It is not easy to illustrate in text!

So for instance, type echo "hello" and Enter. Then press arrow-up once, and Enter again. The same command was issued again. This is actually so common, to do the same thing twice, that there's a shortcut for previous command again. Issue the command echo "again!". Then issue the command !! (two bangs). See what happened?

rikard@newdelli:~/bash-intro/text-files$ echo "again"
again
rikard@newdelli:~/bash-intro/text-files$ !!
echo "again"
again
rikard@newdelli:~/bash-intro/text-files$

Now, the history gets quite long after a while. In Rikard's current terminal it's 1001 commands long. How did hi know that?

rikard@newdelli:~/bash-intro/text-files$ history | wc -l
1001
rikard@newdelli:~/bash-intro/text-files$

The history command prints out the whole command line history. Rikard wasn't interested to count the lines by hand, nor to see them, so he used a pipe to wc -l. See? Using pipes is quite handy.

Sometimes you just know that you have a command line in the history but you don't have the time to press arrow-up a hundred times or more to find it. Then you can actually search the command line. Press Ctrl-R (Control and "r") and enter some part of the command line you were looking for. If it's there in the history, you will find it (or keep pressing Ctrl-R until you find it if there are many commands with the same pattern). When the correct command line is displayed, you can simply press Enter to execute it. This is what it looks like when Rikard searches for "again":

(reverse-i-search)`again': echo "again"

The thing between ` and ' are the letters you are typing as the pattern to search for. The thing after : is the found command line.

Moving the cursor around on the command line

It is very common that you want to reuse a command line but change a small part of it. So you should learn a few ways to edit and moving around on the command line.

Let's say we want to create the directory apa and immediately after that cd down in the directory. We could then reuse the first command line and change it to use cd instead of mkdir. Issue:

$ mkdir apa

Then press arrow-up, so that the same command line shows again. In this history command line, you always get the cursor at the end of the line. Now, press Ctrl-<- (Control and arrow-left). This jumps one word to the left, so that you are after the mkdir word and actually on the first letter of "apa". Next, press Ctrl-w (stands for "wipe out word to the left) to get rid of mkdir. Type cd and a space, then Enter. Did it work? Ask a colleague, friend, supervisor or teacher if you have trouble following this.

So, you can move around one word at the time using Ctrl-leftarrow to go left and Ctrl-rightarrow to go right. To go to the start of the line, you press Ctrl-A (Control and "a"), to go to the end of the line you press Ctrl-E (Control and "e").

Cut and paste

You can cut a word to the left of the cursor by Ctrl-W and you can cut the rest of the line from the cursor using Ctrl-K. To paste the thing you cut last back to the command line, you use Ctrl-Y.

Changing stuff inplace

You can swap two letters immediately to the left of the cursor by pressing Ctrl-T. Let's say you wanted to write cd apa but you wrote cd aap . Press arrow-up to get the faulty line. Your cursor is now after the "p". Press Ctrl-T and watch how the last two letters change place with each other, so that the line becomes cd apa. Not the most commonly used trick, but you might use it to impress your colleagues at some point. Just wait for the right moment.

You can also change the case of a word. Put the cursor to the left of the word you want to change (first letter also works). Press Alt-U to change all letters of the current word (from the cursor position and forward) to Uppercase. Press Alt-L to change to Lowercase instead.

Pasting the previous command line's last argument

This might sound weird, but it is very common to reuse the previous command's last argument. The reason is that we work a lot with files. The filename is often the last argument of a command. For instance, we might investigate the first lines of a file using head and realize this is the file we want to edit. Then the next line becomes invoking your editor with the same filename as the last argument again. Pressing Esc . (Escape and then dot) or alternatively Alt-. (Alt-key and dot at the same time), will paste the last command's last argument.

Exit status

Exit status (sometimes: exit code) tells us what the previous command reports about the success or failure of its execution. As mentioned earlier, a program is executed by the operating system in what is called a process. Now, a process can terminate in more than one way. What we hope for is that it terminates in a good, expected and successful way. But sometimes something goes wrong, and the process exits in an unexpected (or unwanted) way.

How the last program exited is communicated from the operating system to the shell by setting the value of a special variable. The variable is called $? (dollar-question mark). If all went well, the value of the variable is zero. You can try to remember that as "no problem (zero problems)" if you like. But if something goes wrong (according to the program) so that the program couldn't complete its task, then the value is some other positive numeric value (like one or two or something else).

This is quite useful, because what you want is a way to notice that a program exited successfully. But often there's more than one way that a program or command can fail, and allowing more than one value for indicating failure, gives us the opportunity to get feedback about how the program failed. Often, however, there is only a value of one, signifying a general failure. But at least some programs (it's completely up to the programmer to decide) allows for more error codes (a.k.a. exit codes or exit status).

You can read the manual for the program to learn what exit codes other than zero and one exists (if any). Type man ls to read the manual for ls. You can scroll the manual using the arrow keys and using space to scroll down a whole page. To quit the manual, just press q (stands for "quit").

What are the various status codes for ls? Other than zero, we mean. Zero always means "success". What "success" means is defined in the manual. The grep command, for instance, defines zero to mean "a line was selected", i.e. that a match was found. A value of one means that no match was found (not exactly an error). A value of two means some more serious problem like an error.

rikard@newdelli:~/bash-intro/text-files$ ls jehova.txt
ls: cannot access 'jehova.txt': No such file or directory
rikard@newdelli:~/bash-intro/text-files$ echo $?
2
rikard@newdelli:~/bash-intro/text-files$ ls /root
ls: cannot open directory '/root': Permission denied
rikard@newdelli:~/bash-intro/text-files$ echo $?
2
rikard@newdelli:~/bash-intro/text-files$ grep bengt lorem.txt 
rikard@newdelli:~/bash-intro/text-files$ echo $?
1
rikard@newdelli:~/bash-intro/text-files$ grep bengt apa.txt
grep: apa.txt: No such file or directory
rikard@newdelli:~/bash-intro/text-files$ echo $?
2
rikard@newdelli:~/bash-intro/text-files$ grep Lorem lorem80.txt 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras in nisi urna. 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque rhoncus enim 
rikard@newdelli:~/bash-intro/text-files$ echo $?
0
rikard@newdelli:~/bash-intro/text-files$ ls lorem80.txt 
lorem80.txt
rikard@newdelli:~/bash-intro/text-files$ echo $?
0
rikard@newdelli:~/bash-intro/text-files$

One use of exit codes, is to use them implicitly as logical conditionals for deciding what to do depending on the success of a command. Bash has an IF-statement for conditional execution. The if command takes a command as it's first argument. It's easier to show you with code, than to explain the if-statement here.

Let's say that we want to echo some text if grep is able to find some string in a file, and a different string if it isn't:

rikard@newdelli:~/bash-intro/text-files$ if grep bengt lorem80.txt
> then
> echo "bengt found in the file!"
> else
> echo "no bengt in the file!"
> fi
no bengt in the file!
rikard@newdelli:~/bash-intro/text-files$

Note the secondary prompts that appear since the if-statement isn't complete until we write fi.

Another, shorter way of conditional execution, is to use logical && (meaning AND) and || (meaning OR) between two commands. The command after && is only executed if the first command has an exit status of zero. The command after || is only executed if the first command has an exit status of something other than zero:

rikard@newdelli:~/bash-intro/text-files$ grep -q bengt lorem80.txt || echo "Nope"
Nope

rikard@newdelli:~/bash-intro/text-files$ grep -q Lorem lorem80.txt && echo "Yep"
Yep

rikard@newdelli:~/bash-intro/text-files$ grep -q Lorem lorem80.txt && echo "Yep" || echo "Nope"
Yep

rikard@newdelli:~/bash-intro/text-files$ grep -q bengt lorem80.txt && echo "Yep" || echo "Nope"
Nope
rikard@newdelli:~/bash-intro/text-files$

When you learn how to program you will find this extremely useful. Because you will often try to compile your source code and then immediately afterwards run the resulting program. But if the compilation fails, you will certainly not want to run the program (since you couldn't compile). So using && between compilation and running means that the program will only run if the compilation was successful.

We used the -q (stands for "quiet") flag to grep in order not to see any matched lines, since we only wanted to know if the pattern was found or not.

Sometimes we don't want to see the error message of a failed command. Let's say that we want to count the number of lines of a file but only if the file exists. Otherwise, we want to do nothing. We can then use find lorem80.txt && wc -l lorem80.txt. But if the file doesn't exist, we want to ignore it, so we don't want ls to display its error message (if any). We can then redirect standard err to a special file called /dev/null which is just like a black hole that silently consumes any data sent to it. We probably don't want to see the standard out of the command either, so we can send both streams to /dev/null>. The <code>find command does what it's called: it finds files for you.

We'll show you examples both with and without the redirection of the streams for found files and missing files:

rikard@newdelli:~/bash-intro/text-files$ find lorem80.txt && wc -l lorem80.txt 
lorem80.txt
38 lorem80.txt

rikard@newdelli:~/bash-intro/text-files$ find lorem90.txt && wc -l lorem90.txt 
find: ‘lorem90.txt’: No such file or directory

rikard@newdelli:~/bash-intro/text-files$ find lorem80.txt > /dev/null && wc -l lorem80.txt 
38 lorem80.txt

rikard@newdelli:~/bash-intro/text-files$ find lorem90.txt &> /dev/null && wc -l lorem90.txt 

rikard@newdelli:~/bash-intro/text-files$

The file Lorem90.txt didn't exist, and above we showed how to silent the error message and the successful finding of the files. The new thing here was &>. It is the syntax for redirecting both standard out and standard err.

Globbing and expansion

The following is an inclusion of the page Bash:Bash-Globbing found elsewhere on this wiki! We recommend that you also read the Swedish compendium (if you know Swedish).

Introduction to globbing

This short chapter will introduce the concept of Globbing (wikipedia).

Globbing is about expressing simple patterns for matching file and directory names. We'll show you by example, so please fire up a terminal and follow the instructions below.

The operators [], [!], ?, * and {}

We will create nine files in your test directory. The files will be empty but created with one single command:

$ touch file_{1..9}.txt   # {} is specific to bash and actually not part of globbing!
$ ls
file_1.txt  file_3.txt  file_5.txt  file_7.txt  file_9.txt
file_2.txt  file_4.txt  file_6.txt  file_8.txt

Now, lets learn about the globbing symbol *. A single * means "any number of any characters". The expression [0-9] means "one character which is a number between 0 and 9".

Let's use a combination of * and [0-9]:

$ ls *_[0-9].txt
#all files that have *_ followed by one digit followed by .txt:
file_1.txt  file_3.txt  file_5.txt  file_7.txt  file_9.txt
file_2.txt  file_4.txt  file_6.txt  file_8.txt

Next, create a new empty file:

$ touch file_99.txt

Let's list all the files which have only one number character after the underscore:

$ ls *_[0-9].txt  #the file_99.txt file will not be listed!
file_1.txt  file_3.txt  file_5.txt  file_7.txt  file_9.txt
file_2.txt  file_4.txt  file_6.txt  file_8.txt

As you see, the new file file_99.txt didn't match the globbing expression *_[0-9].txt, since it had more than one number between the underscore and the .txt part.

Now, let's use a globbing expression which will match file_99.txt but not the other files:

$ ls *_[0-9][0-9].txt
file_99.txt
# The file was listed, because it, and only it,
# had _ followed by two digits followed by .txt
# When we are done playing, remove the newly created text files.

There is one more operator in globbing, we'd like to show you. It is the ? operator which matches "one single character".

$ touch a.txt ab.txt abc.txt
$ ls ?.txt
a.txt

?.txt matches one single character followed by ".txt" so a.txt but not "ab.txt", nor "abc.txt" etc.

Finally, we'd like to show you inverted lists. Let's say we have the files file_1.txt file_3.txt file_5.txt file_7.txt file_9.txt file_2.txt file_4.txt file_6.txt file_8.txt, and we want to match all files in that list, except file_4.txt and file_5.txt.

What we want to express then, is all files beginning with "file_" and a character (but not "4" or "5") followed by ".txt".

This is what it looks like in globbing (works on Unix-like systems): file_[!4-5].txt:

$ ls file_[!4-5].txt
file_1.txt  file_3.txt  file_7.txt  file_9.txt
file_2.txt  file_6.txt  file_8.txt


--End inclusion of Bash:Bash-Globbing--

The following is an inclusion of Bash:Bash-Shell-Expansion found elsewhere on this wiki! We recommend that you also read the Swedish compendium (if you know Swedish). Work in progress.

Introduction to shell expansion

This short chapter introduces bash shell expansion. This is a very powerful way of expressing strings in bash. Using certain operators and symbols, we can make bash expand an expression before using it, for instance as an argument to a command. All expansion occurs before the resulting strings are used.

Brace expansion

Brace expansion uses curly braces, { and } to produce combinations of strings. We'll explain by example in this small introduction.

Let's say we want to produce the following strings:

  • SVT1
  • SVT2
  • SVT24

We can do this by observing that they follow a pattern. They start with "SVT" and then one of "1", "2", and "24".

This is how you can produce those combinations, and echo the result to the terminal:

$ echo SVT{1,2,24}
SVT1 SVT2 SVT24
$

Brace expansion works by taking the surrounding string literal(s) and making combinations using a comma separated list within braces.

We can even use intervals:

$ echo SVT{1..24}
SVT1 SVT2 SVT3 SVT4 SVT5 SVT6 SVT7 SVT8 SVT9 SVT10 SVT11 SVT12 SVT13 SVT14 SVT15 SVT16 SVT17 SVT18 SVT19 SVT20 SVT21 SVT22 SVT23 SVT24
$

We can nest brace expressions inside one and other. Let's say we want to produce: SVT1 SVT2 SR1 SR2 SR3. The pattern here is a little more complicated. We want all combinations of SVT followed by 1, 2 and SR followed by 1,2,3:

$ echo {SVT{1,2},SR{1..3}}
SVT1 SVT2 SR1 SR2 SR3
$

Or, simply:

$ echo SVT{1,2} SR{1..3}
SVT1 SVT2 SR1 SR2 SR3

Nesting brace expressions is really powerful when creating directory trees. Let's pretend we want to create the following tree:

music/
├── classical
│   ├── classicism
│   ├── modernism
│   ├── modernist
│   └── renaissance
├── jazz
│   ├── bebop
│   ├── free_jazz
│   └── fusion
└── rock
    ├── hard_rock
    ├── metal
    └── rockabilly

This can be achieved with one single command line:

$ mkdir -p music/{classical/{modernist,renaissance,classicism,modernism},rock/{hard_rock,metal,rockabilly},jazz/{bebop,free_jazz,fusion}}

Why does it work?

Let's have echo printing the expanded string out:

$ echo music/{classical/{modernist,renaissance,classicism,modernism},rock/{hard_rock,metal,rockabilly},jazz/{bebop,free_jazz,fusion}}
music/classical/modernist music/classical/renaissance music/classical/classicism music/classical/modernism music/rock/hard_rock music/rock/metal music/rock/rockabilly music/jazz/bebop music/jazz/free_jazz music/jazz/fusion

Intervals in brace expressions can have a step part. Let's say we want to print out every fourth number between 1 and 20 inclusive:

$ echo {1..20..4}
1 5 9 13 17

Starting at 0, we get:

$ echo {0..20..4}
0 4 8 12 16 20

And it works for letters too!

$ echo {a..m}
a b c d e f g h i j k l m

And, even with steps!

$ echo {a..m..3}
a d g j m

Also, intervals can go down:

$ echo {20..0}
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

$ echo {20..0..4}
20 16 12 8 4 0

Expansion of tilde ~

The character ~ (called "tilde") represents the home directory. When standing alone, it means the home directory of the user issuing the command line where tilde occurs. If it is appended with a username, e.g. ~rikard, then it means the home directory of that user (in the example, user rikard's home directory).

How to make the tilde character differs between keyboards. Often it is a "dead key" on Swedish keyboard layouts, meaning that you have to compose it with another key for it to appear. To make the Spanish letter ñ, you press (on Rikard's keyboard, your mileage may vary) AltGr-~ SPACE. When you compose it with space, it will be a stand-alone tilde (which is what we want). On a Macbook (with Swedish keyboard) You press Alt-^ SPACE.

When used unquoted, Bash will expand the tilde expression to the absolute path of the home directory:

$ echo ~
/home/rikard
$ echo ~root
/root
$

Expansion of variables

Bash is a complete programming language and can be used to write scripts (small programs written in Bash). An important concept of programming languages is called variables. You can think of a variable as a named place in the memory where you can store and later retrieve data. Variables have a name and can have a value. The value can be changed later.

Environment variables

A special class of variables in the shell are called environment variables. When a program is executed, the operating system creates a process. The process is the universe of the program. It contains information that the program might need. Some of that information is stored in predefined variables called environment variables. Environment variables are shared between all shells running on the computer. The environment variables are initialized when the shell starts.

Here are some common environment variables:

  • HOME - your home directory
  • PATH - where to look for commands - a list of directories separated by colon
  • LOGNAME - the name you used to login to the shell or computer
  • USER - your username (usually same as LOGNAME)
  • SHELL - what shell you are running
  • TERM - what type of terminal you are running
  • EDITOR - your default editor (if any)

You can investigate what values they have on your computer by using echo and giving the variable as an argument, but prepend the variable with a dollar sign:

$ echo $HOME
/home/rikard
$ echo $SHELL
/bin/bash
$ echo $LOGNAME
rikard
$ echo $USER
rikard
$ echo $TERM
xterm-256color

Arguments to scrips

When you write a script, it is possible to call the script and provide arguments to it. The arguments will end up (in order of appearance) in the variables:

  • $1 first argument
  • $2 second argument
  • etc

Then, there's the special variable $0, which is the name of the script itself.

The variables with arguments are only available to the shell running the script (they live only as long as the script is running, and they are only available to the code in the script).

Creating your own variables

You can also create your own variables. Both in the shell running in the terminal and in scripts you write (anything that a script can do can be done directly in the shell in the terminal - there's no difference). If you want to work with a file with a long name, you can create a variable and assign it the value of the file. After that, you can use it by name (with the dollar sign) instead of writing the long filename:

rikard@newdelli:~/bash-intro/text-files$ fname="latin_words_sorted_lower_case.txt"
rikard@newdelli:~/bash-intro/text-files$ wc -l $fname
359 latin_words_sorted_lower_case.txt
rikard@newdelli:~/bash-intro/text-files$ tail -2 $fname
volutpat
vulputate
rikard@newdelli:~/bash-intro/text-files$

You use the equal sign as assignment operator and there can't be any spaces between the variable, the assignment operator and the value.

When using text values (strings of text), it is good practice to put quotes around the string. That allows for spaces. But if you are using a string variable that can contain spaces, you should put quotes around it where you use it too. Otherwise Bash will expand the variable and see it as more than one string:

rikard@newdelli:~/bash-intro/text-files$ name="Rikard Fröberg"
rikard@newdelli:~/bash-intro/text-files$ mkdir $name
rikard@newdelli:~/bash-intro/text-files$ ls
a_few_urls.txt       group_genre.txt                    lorem.txt
apa                  latin_uniq_frequencies.txt         replaceme.txt
four.txt             latin_words_sorted_lower_case.txt  Rikard
frequency_table.txt  latin_words_sorted.txt             small_text.txt
Fröberg              latin_words.txt                    swe.txt
group_album.txt      lorem80.txt
rikard@newdelli:~/bash-intro/text-files$ ls -ltr
total 68
-rw-rw-r-- 1 rikard rikard 2438 jul 25 10:22 lorem.txt
-rw-rw-r-- 1 rikard rikard    3 jul 25 10:54 swe.txt
-rw-rw-r-- 1 rikard rikard   21 jul 25 10:59 four.txt
-rw-rw-r-- 1 rikard rikard 2467 jul 25 11:06 lorem80.txt
-rw-rw-r-- 1 rikard rikard   69 jul 25 13:30 small_text.txt
-rw-rw-r-- 1 rikard rikard  212 jul 25 13:35 a_few_urls.txt
-rw-rw-r-- 1 rikard rikard   36 jul 25 13:47 replaceme.txt
-rw-rw-r-- 1 rikard rikard 2354 jul 25 14:19 latin_words.txt
-rw-rw-r-- 1 rikard rikard 2354 jul 25 14:19 latin_words_sorted.txt
-rw-rw-r-- 1 rikard rikard 2354 jul 25 14:22 latin_words_sorted_lower_case.txt
-rw-rw-r-- 1 rikard rikard 2073 jul 25 14:27 latin_uniq_frequencies.txt
-rw-rw-r-- 1 rikard rikard 2073 jul 25 14:42 frequency_table.txt
-rw-rw-r-- 1 rikard rikard  131 jul 25 14:53 group_album.txt
-rw-rw-r-- 1 rikard rikard   55 jul 25 14:53 group_genre.txt
drwxrwxr-x 2 rikard rikard 4096 jul 26 11:37 apa
drwxrwxr-x 2 rikard rikard 4096 jul 29 10:02 Rikard
drwxrwxr-x 2 rikard rikard 4096 jul 29 10:02 Fröberg

In the example above, the user forgot to put quotes around the variable when using it as an argument to mkdir so after Bash expanded the variable to its value, it became two strings, Rikard and Froberg so mkdir created two directories.

Some special variables

There are quite a few special variables in Bash:

  • PWD - current directory
  • OLDPWD - previous directory
  • _ (underscore) - the last argument to the previous command
  • ? (questionmark) - the exit status of the previous command
  • ! (bang) - the process id of the last command running in the background
$ ls /root
ls: cannot open directory '/root': Permission denied
$ echo "Last arg: $_ Exit: $?"
Last arg: /root Exit: 2
$

A longer list can be found here.

--End inclusion of Bash:Bash-Shell-Expansion--

Links

Summary lecture slides

Video lecture and slides

01 - Introduction - key concepts

02 - File system

03 - Issuing commands

04 - Moving around

05 - Directories

06 - Text Files

07 - PATH and permissions

08 - Using an editor

09 - Downloading files

10 - Redirecting streams

11 - More on text files

12 - Text processing commands

13 - Pipelines

14 - Advanced topics

All slides

Further reading

Further reading on this wiki

  • Bash-introduction All chapters in this material is recommended reading (including videos)

Where to go next

The next page is Working_in_the_shell_-_Introduction_to_Bash_-_Exercises.

« PreviousBook TOCNext »