Database:Exercise - Scraping data from the web

From Juneday education
Jump to: navigation, search

Purpose of this exercise

  • To learn how to use common bash and gnu/linux tools to facilitate working with databases and more.
  • To combine tools to form a chain of filters.
  • To use tools to create SQL files for use with databases.
  • To show the student an easy way to scrape data from a web page, manipulate it to extract relevant data, turn the data into SQL INSERT statements, so that the student can create a database with "live" data from the web

This exercise (for teachers: it is possible to use this as a workshop where the students copy every step you do), shows you where data in a database could come from. Let's say you want to create a database from a TV guide online. You have no urge to enter the whole TV guide's data manually into the database. Why not fetch the data from the online TV guide?

In order to do so, you fetch the web page, parse the web page and extract relevant data from it, and turn everything into SQL INSERT statements, which you can pipe to your database.

This exercise is rather focused on scraping and manipulating data from the web using various Bash techniques. It is rather complicated so we walk you through every step in the process. This exercise is thus mostly for those who wants to learn some basic Bash tricks to get data from the web and mangle it into SQL INSERT statements for storage in a database.

We actually hesitated over whether this exercise would better fit a Bash book or a networking book, but decided it belonged in a database book after all. You do need to know SQL in order to make sense of why we are creating INSERT statements of the scraped data. But you may regard this exercise as an advanced topic not at all essential to the knowledge required for using databases. We hope you find it fun and interesting though!

TV guide scraping

Getting information off of the net

Let’s say that we’re interested in today’s TV shows for SVT1 and we know that tv.nu has that information. We may then fetch that information (as an HTML page) from their server and parse the result to our liking.

Open cygwin or your bash terminal (if you are not using Windows) and create a directory Ex09:

$ mkdir Ex09

Next cd to that directory (type cd followed by ‘Escape-.’):

$ cd Ex09

In this directory, download the web page for today’s SVT1 shows:

$ wget 'http://www.tv.nu/kanal/svt1'

This saves the whole page in the file “svt1”. In case you encounter problems when downloading the file you can use a cached one: svt1.zip or svt1

Let’s analyse and parse the information initially

After eying through the HTML code, we find that a few strings are interesting to look closer at. Those are:

data-title
data-start
data-end
data-text="

We want to print out all lines containing either of those strings, since on those lines are the information about TV shows! For instance we have:

	data-title="Rapport"
	data-start-time="2016-01-08T06:00:00+01:00"
	data-end-time="2016-01-08T06:05:00+01:00"
					<p class="normal mobile-content-hide js-broadcast__ellipsis" data-text="Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.">

In order to get a grip of all those lines, we use the command egrep (because it allows us to express strings to look for in terms of str1|str2|str3|... meaning “str1 OR str2...”):

The command becomes:

$ egrep 'data-title|data-start|data-end|data-text="' svt1

Now, we want to break up all matching line in turn, and parse out only the interesting part of the text containing the data we are interesting in (for instance “Rapport”, “2016-01-08T06:00:00+01:00” etc). We see that all interesting text starts with an equals sign and a quote. So what if we split all lines on the equal sign, and print only what comes after?

That we can do using the command cut with the arguments -d ‘=’ -f2- which means, cut the line on all the = and give me all parts from the second one:

$ egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-

The first data we get from that is:

"Rapport"
"2016-01-08T06:00:00+01:00"
"2016-01-08T06:05:00+01:00"
"normal mobile-content-hide js-broadcast__ellipsis" data-text="Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.">

Almost there! The last line still has the interesting stuff after an ‘=’. Let’s fixt that using the same trick again, we’ll add another cut on ‘=’ and keep only the second part:

$ egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-|cut -d '=' -f2

This time we get this (as the first part of the output):

"Rapport"
"2016-01-08T06:00:00+01:00"
"2016-01-08T06:05:00+01:00"
"Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.">

Getting closer :) We now notice, that we only want what’s between the double quotes. So what if we cut up each line, and only keep the second field separated by quotes? Let’s give it a shot:

$ egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-|cut -d '=' -f2|cut -d '"' -f2

This is what we get:

Rapport
2016-01-08T06:00:00+01:00
2016-01-08T06:05:00+01:00
Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.
Gomorron Sverige sammandrag
2016-01-08T06:05:00+01:00
2016-01-08T06:25:00+01:00
Sammandrag av morgonens Gomorron Sveriges sändning.
Gomorron Sverige
2016-01-08T06:25:00+01:00
2016-01-08T10:00:00+01:00
Dagsaktuella gäster i intervjuer, debatter och diskussioner i soffan varje kvart. Tips om mat och resor och recensioner av film och böcker m.m. Rapport sänder nyheter varje halvtimme med start 06.30. Vädret varje halvtimme med start 06.37. Sportnytt sänds kl 07.05, 08.05 och 09.05. Regionala nyheter sänds varje halvtimme med start 07.10. Läs text-tv sid 679.
Strömsö
2016-01-08T10:00:00+01:00
2016-01-08T10:40:00+01:00
Söndagssteken är en kär favorit för många, och här får den en ny utformning när Paul serverar den lager-på-lager. Lee låter stramaljbroderier få nytt liv som klädesplagg och Jim ägnar sig åt brutal-snickeri.
Stoerre Vaerie - Norra Storfjället
2016-01-08T10:40:00+01:00
2016-01-08T10:55:00+01:00
Elle har tagit avstånd från sina samiska rötter. För första gången på många år återvänder hon till sin hembygd tillsammans med sin son och sitt barnbarn för att närvara på sin systers begravning.
Inför Idrottsgalan 2016
2016-01-08T10:55:00+01:00
2016-01-08T11:00:00+01:00
De nominerade till Jerringpriset presenteras och tittarna får ringa och rösta på sina favoriter.
Vinterstudion
2016-01-08T11:00:00+01:00
2016-01-08T11:30:00+01:00
Program som analyserar och summerar det mesta från vintersporterna.
FIS Längdskidåkning
2016-01-08T11:30:00+01:00
2016-01-08T12:30:00+01:00
Viessmann världscupen: Tour de Ski, Toblach, herrar, 10 km, fristil - Viessmann världscupen: Tour de Ski, Toblach, herrar, 10 km, fristil - Från Toblach, Italien.
Vinterstudion
2016-01-08T12:30:00+01:00
2016-01-08T13:30:00+01:00
Program som analyserar och summerar det mesta från vintersporterna.
FIS Längdskidåkning
2016-01-08T13:30:00+01:00
2016-01-08T14:15:00+01:00
Viessmann världscupen: Tour de Ski, Toblach, damer, 5 km, fristil - Viessmann världscupen: Tour de Ski, Toblach, damer, 5 km, fristil - Från Toblach, Italien.
Vinterstudion
2016-01-08T14:15:00+01:00
2016-01-08T15:00:00+01:00
Program som analyserar och summerar det mesta från vintersporterna.
Inför Idrottsgalan 2016
2016-01-08T15:00:00+01:00
2016-01-08T15:10:00+01:00
De nominerade till Jerringpriset presenteras och tittarna får ringa och rösta på sina favoriter.
Anslagstavlan
2016-01-08T15:10:00+01:00
2016-01-08T15:15:00+01:00
Samhällsinformation, på uppdrag av staten.
Året var 1966
2016-01-08T15:15:00+01:00
2016-01-08T16:15:00+01:00
Bob Dylan besöker Sverige och statsminister Erlander har det knepigt med frågor om bostäder. En gränslös kväll på operan arrangeras med Harry Belafonte och Martin Luther King. Esrange invigs och i Kina rullar kulturrevolutionen igång.
Gomorron Sverige sammandrag
2016-01-08T16:15:00+01:00
2016-01-08T16:35:00+01:00
Sammandrag av morgonens Gomorron Sveriges sändning.
En bit av 50-talet
2016-01-08T16:35:00+01:00
2016-01-08T17:20:00+01:00
Dagens unga har upptäckt 1950-talet och går all in. Vi följer här Camilla som sjunger och spelar 50-talslåtar i bandet Pat C Miller and the Tailshakers. En film om människorna som bygger hela sin livsstil runt femtiotalet.
Jakten på norrsken
2016-01-08T17:20:00+01:00
2016-01-08T18:00:00+01:00
Ole Christian Salomonsen är professionell fotograf och har specialiserat sig på att fotografera norrsken. Ole Christian är hela tiden på jakt efter den ultimata norrskensbilden, men utmaningarna att fånga något så storslaget på bild är många.
Rapport
2016-01-08T18:00:00+01:00
2016-01-08T18:13:00+01:00
Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.
Kulturnyheterna
2016-01-08T18:13:00+01:00
2016-01-08T18:25:00+01:00
Kultur- och nyhetsprogram.
Sportnytt
2016-01-08T18:25:00+01:00
2016-01-08T18:30:00+01:00
Senaste sportnyheterna.
Regionala nyheter
2016-01-08T18:30:00+01:00
2016-01-08T18:45:00+01:00
Regionala nyhetsprogram.
Black Jack från film till verklighet
2016-01-08T18:45:00+01:00
2016-01-08T19:30:00+01:00
Programmet innehåller allt om turnélivet, kärleken till musiken och de trogna fansen till Tony och hans dansband. Vi får även se hur dansbandssverige har förändrats där onsdagsdansen i folkparken har bytts ut mot dansbandskryssningar på Östersjön.
Rapport
2016-01-08T19:30:00+01:00
2016-01-08T19:55:00+01:00
Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.
Regionala nyheter
2016-01-08T19:55:00+01:00
2016-01-08T20:00:00+01:00
Regionala nyhetsprogram.
På spåret
2016-01-08T20:00:00+01:00
2016-01-08T21:00:00+01:00
Nykomlingarna Viveca Sten och David Lagercrantz möter veteranerna Filip Hammar och Fredrik Wikingsson. Augustifamiljen står för musiken, i kväll tillsammans med Ramy Essam.
Skavlan
2016-01-08T21:00:00+01:00
2016-01-08T22:00:00+01:00
Musikern och författaren Björn Ulveaus kommer till Skavlan. På plats är även Norges nyutnämnda invandrings- och integrationsminister Sylvi Listhaug (FrP) och Marit Strømøy som är den första kvinnan i världen att vinna ett Formel 1-race.
Adele: Live in London
2016-01-08T22:00:00+01:00
2016-01-08T23:05:00+01:00
När Adele släpper nytt, gör hon det med storslagna toner och svårslagna rekord. Här möter hon Graham Norton för en exklusiv helkväll där hon sjunger, berättar, skrattar och bjuder på sig själv. Hon framför bland annat sin nya hit Hello.
Rapport
2016-01-08T23:05:00+01:00
2016-01-08T23:10:00+01:00
Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.
Katy Perry: The Prismatic World Tour
2016-01-08T23:10:00+01:00
2016-01-09T00:10:00+01:00
Katy Perry bjuder upp till färgsprakande show, på ett fullsatt Allphones Arena i Sidney. The Prismatic World Tour är fullmatad med hits, genomarbetad koreografi och en Katy i högform.
Deportees i Musikguiden i P3: Session
2016-01-09T00:10:00+01:00
2016-01-09T00:40:00+01:00
Konsert med det svenska bandet Deportees. Konserten är inspelad i samband med P3 session i Umeå november 2015.
Ikväll är vi kungar
2016-01-09T00:40:00+01:00
2016-01-09T01:30:00+01:00
Turnéavslutningen av Ikväll är vi kungar, från Cirkus i Stockholm. Vi ser och hör Petter, Ison & Fille, Lilla Namo med flera, som framför egna och varandras låtar inför en intensiv publik.
Black Jack från film till verklighet
2016-01-09T01:30:00+01:00
2016-01-09T02:15:00+01:00
Programmet innehåller allt om turnélivet, kärleken till musiken och de trogna fansen till Tony och hans dansband. Vi får även se hur dansbandssverige har förändrats där onsdagsdansen i folkparken har bytts ut mot dansbandskryssningar på Östersjön.
Verklighetens Downton Abbey
2016-01-09T02:15:00+01:00
2016-01-09T03:05:00+01:00
Vissa fann sann kärlek över Atlanten, men Consuelo Vanderbilts liv kantades av skandaler, ensamhet och lidande.
Inför Idrottsgalan 2016
2016-01-09T03:05:00+01:00
2016-01-09T03:15:00+01:00
De nominerade till Jerringpriset presenteras och tittarna får ringa och rösta på sina favoriter.
Hundra procent bonde
2016-01-09T03:15:00+01:00
2016-01-09T03:45:00+01:00
Frank ser tillbaka på året som gått och presenterar några av de viktigaste händelserna från livet på Kastanjegården. Vi ser bland annat bilder från husbygget och hur man grävde om åns lopp över åkermarkerna.
Leif GW Persson - min klassresa
2016-01-09T04:00:00+01:00
2016-01-09T05:00:00+01:00

Leif GW Persson - min klassresa
2016-01-09T05:00:00+01:00
2016-01-09T06:00:00+01:00


Note that when the long description is missing, it is represented by an empty line.

This was a step-by-step exercise on how to build up a parser using filters piped together, to increasingly get closer to what we wanted to filter out.

Let’s save each data field in a variable

We now know how to parse out all information to the screen. But if we want to do something intelligent with it (such as store it in a database), we’d like to know what each line represents. Looking at the output, we’ll see that there are four fields repeating: Title, Start, End, Description. So, what if we parse the output line by line, and store every sequence of four lines as exactly that, first title, then start, then end and finally description? How could we do that?

One way is to have a counter variable that we increase in each iteration, and check it in modulo 4. If counter%4 is one, we have the title. If counter%4 is 2, we have the starttime, if counter%4 is 3, the endtime, and finally if counter%4 is 0 we have the description (if any).

That would look like this on a one-liner:

count=0;egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-|cut -d '=' -f2|cut -d '"' -f2|while read line;do if [ $((++count%4)) -eq 1 ] ;then echo "title: $line";elif [ $((count%4)) -eq 2 ];then echo "start: $line";elif [ $((count%4)) -eq 3 ];then echo "end: $line";elif [ $((count%4)) -eq 0 ];then echo "desc: $line";fi;done

Yikes. It’s starting to look ugly now, isn’t it? Let’s put all that in a script file instead, so that we can have some indentation.

Making the one-liner into a script

#!/bin/bash

count=0
egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-|cut -d '=' -f2|cut -d '"' -f2|while read line
do
 if [ $((++count%4)) -eq 1 ]
 then
     echo "title: $line"
 elif [ $((count%4)) -eq 2 ]
 then
     echo "start: $line"
 elif [ $((count%4)) -eq 3 ]
 then
     echo "end: $line"
 elif [ $((count%4)) -eq 0 ]
 then
     echo "desc: $line"
 fi
done

You don’t need to know (or even understand all of) the above. We will not examine this on the test, this exercise is more to show you what could be done using bash, and one way to do it. So don’t worry if you feel uncertain about all of this!

I decided to save the script in a file called tv-parser.sh. To run the script, you’d first set the executable flag on it and then call it like this:

$ chmod u+x tv-parser.sh
$ ./tv-parser.sh

Now that we know that we can access every line and know what it represents, we could actually build up INSERT statements from it, and store it in a table later on.

For this, I’ll modify the script to save the line in a corresponding variable. Since I know when the line represents title, for instance, I can save the line in a variable called title and so on for the other representations of lines:

#!/bin/bash

count=0
title=""
start=""
end=""
desc=""
egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-|cut -d '=' -f2|cut -d '"' -f2|while read line
do
 if [ $((++count%4)) -eq 1 ]
 then
     title="$line"
 elif [ $((count%4)) -eq 2 ]
 then
     start="$line"
 elif [ $((count%4)) -eq 3 ]
 then
     end="$line"
 elif [ $((count%4)) -eq 0 ]
 then
     desc="$line"
     echo "Title: $title start: $start end: $end desc: $desc"
     echo "---------------------"
 fi
done

Sample output from the script is:

Title: Rapport start: 2016-01-08T06:00:00+01:00 end: 2016-01-08T06:05:00+01:00 desc: Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.
---------------------
Title: Gomorron Sverige sammandrag start: 2016-01-08T06:05:00+01:00 end: 2016-01-08T06:25:00+01:00 desc: Sammandrag av morgonens Gomorron Sveriges sändning.
---------------------

Why did we do the printout in the last if branch?

Changing the output to SQL statements

From this knowledge, that we have parsed the data into variables, it is trivial to use the variables to create INSERT statements for a table storing the data. The INSERT we want to produce could look like this:

INSERT INTO tv_schedule(channel, title, start, end, description)
     VALUES('SVT1',
            'Rapport',
            '2016-01-08T06:00:00+01:00',
            '2016-01-08T06:05:00+01:00',
            'Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.');

The corresponding echo statement in the script would be:

echo "INSERT INTO tv_schedule(channel, title, start, end, description) VALUES('SVT1','"$title"','"$start"','"$end"','"$desc"');"

Sample output would now be:

INSERT INTO tv_schedule(channel, title, start, end, description) VALUES('SVT1','Rapport','2016-01-08T06:00:00+01:00','2016-01-08T06:05:00+01:00','Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.');
INSERT INTO tv_schedule(channel, title, start, end, description) VALUES('SVT1','Gomorron Sverige sammandrag','2016-01-08T06:05:00+01:00','2016-01-08T06:25:00+01:00','Sammandrag av morgonens Gomorron Sveriges sändning.');

(lines wrap on this page, but it would be one insert statement per line)

Changing the date

If you don't like the date format from tv.nu, you can format the date yourself:

$ date -d 2016-01-08T06:05:00+01:00 +"%Y-%m-%d %H:%M:%S"
2016-01-08 06:05:00

On MacOS, you can use gdate instead of date. This is how you install coreutils:

$ brew install coreutils

So, if you are a Mac-head, use gdate instead of date, now that you have installed it.

In a script that would need to be worked on a variable using expansion:

#!/bin/bash
count=0
egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-|cut -d '=' -f2|cut -d '"' -f2|while read line
do
  if [ $((++count%4)) -eq 1 ]
  then
    echo "title: $line"
  elif [ $((count%4)) -eq 2 ]
  then
    echo "start: $(date -d $line +"%Y-%m-%d %H:%M:%S")"
  elif [ $((count%4)) -eq 3 ]
  then
    echo "end: $(date -d $line +"%Y-%m-%d %H:%M:%S")"
  elif [ $((count%4)) -eq 0 ]
  then
    echo "desc: $line"
  fi
done

The thing called expansion takes place here:

echo "end: $(date -d $line +"%Y-%m-%d %H:%M:%S")"

In bash, when you want to use the value of a command, you can use $(command). In the example above, we used expansion on the command date. We gave date two arguments. First the -d $line flag, then the +"%Y-%m-%d %H:%M:%S" formatting directive (YYYY-MM-DD HH:MM:SS is what the format string means). That makes date parse the string from the output into a date with a different format.

Note that in your script, the assignment of e.g. start will now become start=$(date -d $line +"%Y-%m-%d %H:%M:%S").

Putting the data into a database

Create a database called tv.db with one single table tv_schedule in the same directory as your script.

$ sqlite3 tv.db
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> CREATE TABLE tv_schedule(channel TEXT, title TEXT, start DATETIME, end DATETIME, description TEXT);

Exit the database's interactive shell and now run your script, using a pipe to send the output from the script to the database:

$ ./tv-parser.sh | sqlite3 tv.db

If you get no errors, you can see the data using SELECT as normal.

If you did get errors, fix the output from your script so that it creates correct SQL.

  • If the database schema is wrong - DROP TABLE and start over
  • If you got no errors but the wrong data ended up in the wrong column, DELETE FROM and start over
    • You didn't check your scripts output carefully enough, before sending it to the database! ;-)

One way of retrieving all rows of tv shows, is to send a SELECT to the database from the command line:

$ sqlite3 tv.db 'SELECT * FROM tv_schedule;'

If you get all data back from the database and it looks right, you are done!

Note: The table is simplified to make this exercise simpler. It lacks a primary key and constraints. For the ambitious student, we urge you to create constraints for the text fields so that they don't accept NULL, and constraints for the dates so that they accept only correctly formatted date strings. You will need to force the date values from the script to use the correct date format. See above for how to do that using date (or on Mac OS: gdate).

Challenge version

For the students who never get enough, we challenge you do fetch more than one channel.

Suggestion: Make a script which loops through a list of channels corresponding to the URL of tv.nu's different channel guides (start with svt1 svt2 to see if it works). The script should call your original script, passing the channel as argument.

Your script will get the argument in the special variable $1. Use this variable to create the correct INSERT statements (the first column is the channel in the INSERT statement, so you cannot hard code it to SVT1 any more).

Change your original script so that it does the downloading of the file to parse. Use $1 (the channel name) as part of the URL (at the end).

Hints:

Creating a loop over a list of strings:

$ for chan in svt1 svt2; do echo running tv_parser.sh $chan; done
running tv_parser.sh svt1
running tv_parser.sh svt2

Creating a dynamic URL in a script, using the argument in $1:

channel="$1"

# fetch the page for this channel
wget "http://www.tv.nu/kanal/$channel"

What have we learned and used?

Fetching a web page from the command line

We have learned that we can use wget to fetch a web page:

$ wget 'http://www.tv.nu/kanal/svt1'

Filtering only interesting lines

We have learned that we can use egrep to filter out only interesting lines, giving it an argument with pipe separated search strings:

$ egrep 'data-title|data-start|data-end|data-text="' svt1

Cutting up lines on a delimiter character and select fields

We have learned that we can use the cut command to cut up a line by giving it the arguments of what delimiter to use and what fields to print:

$ egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-

The part that is interesting above is the cut -d '=' -f2- which means: Cut input line up in fields delimited by = and print all fields starting with the second one.

Some more examples of cut:

$ echo "apa|groda|kamel|spindel"|cut -d '|' -f 3
kamel

That means: “Cut up the input line in fields delimited by | (pipe) and print only the third field”.

$ echo "apa,groda|kamel,spindel,varg"|cut -d ',' -f 2
groda|kamel

That means: “Cut up the input line in fields delimited by , (comma) and print only the second field.

We can chain filters together

We can take the output from one filter and use it as input to the next by using the | (pipe):

$ egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-

A simpler example: How would we filter out only “groda” from this string?

"apa,groda|kamel,spindel,varg"

We first use cut to filter on comma, then we use cut again and filter on pipe:

$ echo "apa,groda|kamel,spindel,varg"|cut -d ',' -f 2|cut -d '|' -f1
groda

When we had this text:

"Rapport"
"2016-01-08T06:00:00+01:00"
"2016-01-08T06:05:00+01:00"
"Rapport sänder aktuella nyheter varje dag under hela året på SVT1, SVT2 och SVT24.">

And wanted to keep only the stuff between the quotes, we sent it to cut using double quote as delimiter and printed the second field:

$ egrep 'data-title|data-start|data-end|data-text="' svt1 | cut -d '=' -f2-|cut -d '"' -f2

The relevant part is the last pipe to cut:

 cut -d '"' -f2

An example of that strategy could be:

$ echo '<b>"hej"</b>'|cut -d '"' -f2
hej

There are many alternative ways though, that we could have used. Here’s one:

$ echo '<b>"hej"</b>'|sed -e 's/[^"]*"\([^"]*\).*/\1/'
hej

But who can understand that? ;-)

Note that in order to pass double quotes to echo, the string argument to echo was using single quotes, by the way.

Reading one whole line at the time into a variable

We saw that we could read one whole line into a variable, by piping the output to a while loop that used read line as the loop condition. Read is a function that reads a line, and “line” was just a variable called “line”:

$ cat lines.txt
En rad. Med text.
En rad till!
Och så en rad till.
Ytterligare en rad.
Tjatig fil det här.

In order to read one line at the time, we can cat the file and pipe it to such a while loop:

$ cat lines.txt | while read line; do echo "Rad: $line";done
Rad: En rad. Med text.
Rad: En rad till!
Rad: Och så en rad till.
Rad: Ytterligare en rad.
Rad: Tjatig fil det här.

Using % to keep track of repetitions of lines

In our tv-guide example, we knew that the output was a repetition of four lines where in each repetition line 1 was title, line 2 was start time, line 3 was end time and line 4 was description.

We used this knowledge to check what type (1, 2, 3, or 4) of line we were currently handling inside the loop, by using a counter variable that kept track of the line number. We checked if it was iteration 1, 2, 3, or 4 using an if statement and calculating the line counter % 4.

This works, be cause we know that we would get the following values:

1%4: 1
2%4: 2
3%4: 3
4%4: 0
5%4: 1
6%4: 2
7%4: 3
8%4: 0
9%4: 1
10%4: 2
11%4: 3
12%4: 0
13%4: 1
14%4: 2
15%4: 3
16%4: 0
17%4: 1
18%4: 2
19%4: 3
20%4: 0

Instead of just a line counter, we get a repetition of 1, 2, 3, 0, 1, 2, 3, 0, ... etc.

If we calculated the value 1, we were on a title line. 2 meant start time, 3 meant end time and 0 meant description. The if statements were used to figure out what the line represented!

We could have used another strategy for those who don’t like to count in modulo. We could have reset the counter to one as soon as it became 4. Here are both strategies for analysing a file that repeats four meanings of a file. The file repeats: Breakfast, lunch, snack, dinner:

rikard@T530:~/iths/tv-nu$ cat meals.txt
Egg and tea
Fish and chips
Sandwich and juice
Stake and sallad
Egg and coffee
Hamburger and coke
Peanuts and beer
Pizza
Sandwich and milk
Fish and potato
Apple
Pasta and wine

Let’s use % to find out what type of meal we’re at. Remember, the order is breakfast, lunch, snack, dinner.

$ c=0;cat meals.txt|while read line;do if [ $((++c%4)) -eq 1 ];then echo "Breakfast: $line";elif [ $((c%4)) -eq 2 ]; then echo "Lunch: $line";elif [ $((c%4)) -eq 3 ];then echo "Snack: $line";elif [ $((c%4)) -eq 0 ];then echo "Dinner: $line";fi;done
Breakfast: Egg and tea
Lunch: Fish and chips
Snack: Sandwich and juice
Dinner: Stake and sallad
Breakfast: Egg and coffee
Lunch: Hamburger and coke
Snack: Peanuts and beer
Dinner: Pizza
Breakfast: Sandwich and milk
Lunch: Fish and potato
Snack: Apple
Dinner: Pasta and wine

Now, we saw that such a one-liner was hard to read, so we could make it into a script instead. The script would look like this:

#!/bin/bash

c=0
cat meals.txt|while read line
do
 if [ $((++c%4)) -eq 1 ]
 then
  echo "Breakfast: $line"
 elif [ $((c%4)) -eq 2 ];
 then
  echo "Lunch: $line"
 elif [ $((c%4)) -eq 3 ]
 then
  echo "Snack: $line"
 elif [ $((c%4)) -eq 0 ]
 then
  echo "Dinner: $line"
 fi
done

The second strategy was to reset the counter (c) to 1 when we hit 4. That would look like this on the command line:

$ c=0;cat meals.txt|while read line;do if [ $((++c)) -eq 1 ];then echo "Breakfast: $line";elif [ $((c)) -eq 2 ]; then echo "Lunch: $line";elif [ $((c)) -eq 3 ];then echo "Snack: $line";elif [ $((c)) -eq 4 ];then echo "Dinner: $line";c=0;fi;done

And the script would look like this:

#!/bin/bash
c=0
cat meals.txt|while read line
do
 if [ $((++c)) -eq 1 ]
 then
  echo "Breakfast: $line"
 elif [ $((c)) -eq 2 ]
 then
  echo "Lunch: $line"
 elif [ $((c)) -eq 3 ]
 then
  echo "Snack: $line"
 elif [ $((c)) -eq 4 ]
 then
  echo "Dinner: $line"
  c=0
 fi
done

Links

Github sources

  • Scripts and examples here

Where to go next

This is the last theoretical chapter of the database book. There is, however a second assignment in the next chapter.

« PreviousBook TOCNext »