Database:Scraping data from the web

From Juneday education
Jump to: navigation, search

Meta

Expand using link to the right to see the meta section with instructor's notes.

Description

This topic is slightly off-topic to the database material in that it shows techniques in Bash for scraping data off the web for storage in a database. You may skip this lecture if your core focus of your course is SQL and database technology.

We think, however, that this chapter works well for showing those interested some ways to gather large quantities of data off of some web page. If you find data which you need for some application on a web page, the problem is that it is encoded in HTML. So you need a way to get the data down from the web page and translated into a format suitable for storage in a database. Knowing a little about this prepares the student for doing some actual work and the techniques taught in this lecture are applicable to many situations outside the scope of databases and web.

Mangling text using bash is a skill we consider useful for any student who wants to pursue a career working as a database technician, programmer of systems administrator. And it’s fun and useful knowledge for everyone else as well!

The idea behind this lecture is to inspire the students to experiment on their own, starting with some data they need on the web, getting that web page down to their computer, parsing out the interesting data from the web page and creating a script (or command line) which creates INSERT statements so that they can get the data into tables in their database. When the data is in a database, they can do statistical analysis of it using SQL.

We’ve found that many database courses have more or less interesting datasets in the examples. However, not many courses shows a strategy for how to actually get the data into the database from scratch. Web scraping is one way, and we show that here.

Prepare

This is long lecture - the Swedish video lecture using these slides are divided into 4 parts. Expect it to take between 30 and 45 minutes. As usual, we recommend seeing the lecture and reviewing the exercises. Since the exercises are about scraping data off a web site, make sure that the examples still work (the web page tv.nu may of course have changed, breaking the scripts and command lines used in the exercise).

If your students aren’t comfortable with using bash and the command line, you can do the exercises live on the projector and solve it together with the class. We think it might be interesting for students to at least have seen what can be done in bash, even if they won’t learn it in the course you are giving.

Heads-up

Think about:

  • Scraping data involves a set of tools like curl or wget etc, but also a working internet connection in the classroom. Also, parsing out data from the HTML on the web page can be tricky - and remember that web pages tend to change over time - What worked yesterday may not work today if the page has been re-designed!
  • The focus and idea about this lecture isn’t all the bash one-liners or the bash tools used, but rather to show the students that they can work with “real data” which they might scrape off the web. Knowing some tricks on the command line is, however, very useful, also when you work with databases.
  • The bash used in the exercises requires a basic understanding of the standard streams, redirection and pipes.
  • Tools/commands and techniques used include:
    • wget
    • egrep and grep
    • cut
    • cat
    • echo
    • chmod
    • date
    • read
    • while do done
    • if then elif then fi
    • redirection using < > and also pipes
    • bash variables
    • shell expansion using $(some commands)
    • shell expansion using $((some expression))
    • the % (remainder/modulo) operator

Full frontal - Code examples up-front!

$ # Let's see what Västtrafik has broken just now:
$ w3m -dump 'https://m.vasttrafik.se/reseinformation/trafikstorningar-mobil/?mid=46' |fmt -w 180 | grep Linje | sed -e 's/\(^\ *\)\(.*\)/\2/'
Linje 2, 19 och 158 stannar inte vid Domkyrkan och Grönsakstorget i båda riktningar.
Linje 2, 19 och 158 stannar inte vid Domkyrkan och Grönsakstorget i båda riktningar.
Linje 2, 19 och 158 stannar inte vid Domkyrkan och Grönsakstorget i båda riktningar.

$ # Let's see how many PDF files are available on this Wiki:
$ w3m -dump 'http://wiki.juneday.se/mediawiki/index.php?title=Special:MIMESearch/application/pdf&limit=500&offset=0&mime=application%2Fpdf' | egrep '^[0-9]+\.'|tail -1|cut -d '.' -f1
240
$ # 240 PDFs! Not bad! We all know what to read tonight, then!

$ # What about PNG images?
$ $ w3m -dump 'http://wiki.juneday.se/mediawiki/index.php?title=Special:MIMESearch&limit=250&offset=0&mime=image%2Fpng' | egrep '^[0-9]+\.'|tail -1|cut -d '.' -f1
78
$ # Only 78! That's an outrage! You guys need more images, you text freaks!

$ # How many videos have the fools uploaded to Vimeo?
$ $ w3m -dump 'https://vimeo.com/user52531669' | egrep '[0-9]+\ Videos' | cut -d ' ' -f4
674

$ # What year was Urban's latest publication?
$ GET 'https://ait.gu.se/om-institutionen/avdelningen-for-informatik/personal?userId=xnulur#tabContentAnchor2'|egrep '<h3>[0-9]{4}</h3>'|head -1 | sed -e 's/.*\([0-9]\{4\}\).*/\1/g'
2016

$ # In fact, when did anyone at Informatics last publish?
$ for user in $(GET 'https://ait.gu.se/om-institutionen/avdelningen-for-informatik/?selectedTab=2&itemsPerPage=-1'|grep userId|tr '?' '\n'|tr '&' '\n'|grep userId|cut -d '=' -f2); do name=$(GET "https://ait.gu.se/om-institutionen/avdelningen-for-informatik/personal?userId=$user"|grep '<h1>'|sed -e 's/.*<h1>\([&;A-Za-z ]\+\)<\/h1>.*/\1/g');name=$(echo "$name"|w3m -dump -T text/html);echo -n "$name ";GET "https://ait.gu.se/om-institutionen/avdelningen-for-informatik/personal?userId=$user#tabContentAnchor2"|egrep '<h3>[0-9]{4}</h3>'|head -1 | sed -e 's/.*\([0-9]\{4\}\).*/\1/g';echo;done
Urban Ask 2018

Alan B Carlson 
Ioanna Constantiou 2017

Håkan Enquist 2012

Rikard Fröberg 2018

Mikael Gustavsson 
Aida Hadzic Zukic 2015

Ola Henfridsson 
Frida Ivarsson 
Dina Koutsikouri 2018

Rikard Lindgren 2018

Juho Lindman 2019

Juho Lindman 2019

Jan Ljungberg 2018

Johan Magnusson 2019

Urban Nuldén 2016

Kalevi Pessi 2017

Lidija Polutnik 2018

Agneta Ranerup 2019

Fatemeh Saadatmand 2018

Ted Saarikko 2019

Taline Sandberg Jadaan 2010

Henrik Sandklef 2018

Lisen Selander 2016

Dick Stenmark 2018

Fredrik Svahn 2017

Fahd Omair Zaffar 2017

Yixin Zhang

Introduction

This chapter is slightly off-topic to the database course material in that it shows techniques in Bash for scraping data off the web for storage in a database. You may skip this chapter if your core interest is in SQL and database technology.

We think, however, that this chapter works well for showing those interested some ways to gather large quantities of data off of some web page. If you find data which you need for some application on a web page, the problem is that it is encoded in HTML. So you need a way to get the data down from the web page and translated into a format suitable for storage in a database.

This chapter (the PDF - we aim to record a video lecture for it too in the future!) and the companion exercise chapter shows you a couple of techniques for doing just that.

The chapter requires some patience on your behalf, and a good knowledge of Bash. If you don't have a good knowledge of Bash, you will learn a lot from following the instructions in the lecture and the exercise in the next chapter. Enjoy!

Data sources

It is actually quite common to "scrape" data from some web page, in order to feed some application or program with data. Over the years, we've helped our students to scrape various data from various web sites. From lunch menus for an app which showed today's lunch offers of our local restaurants, to an app showing the TV schedule (which is also the topic for the exercises of this lecture) to scraping information about chemicals from a Swedish authority, and many other sources.

Ideally, data you need is readily available in some standard data format such as Json or XML or at least CSV (comma separated values). But other times we're not so lucky. But often, the data we need is at least available on some web site, coded in HTML as part of the web page over all information. In such situations, it is good to know some tools and techniques to get the data down to your computer and how to extract only the data you need from the web page HTML code.

That's what we want to show you in this lecture.

Some more examples

To find out what kind of publications the staff at Applied IT at Göteborg University are publishing, we could get the full list of publication (we have limited ourselves to 500 publications) as a web page (HTML), then parse out the HTML tag containing the publication type:

$ GET 'https://ait.gu.se/om-institutionen/?selectedTab=3&userName=&siteNodeId=638014&languageId=100000&itemsPerPage=-1&contentId=-1&originalRequestURI=%2Fom-institutionen%2F&publicationsPerPage=500'|grep '="publicationGroup'|sed -e 's/.*>\(.*\)<.*/\1/g'|sort|uniq -c|sort -rnk1|sed -e 's/^\ *//g'
190 Paper i proceeding
114 Artikel i vetenskaplig tidskrift
76 Konferensbidrag (offentliggjort, men ej förlagsutgivet)
44 Kapitel i bok
16 Doktorsavhandling
11 Konstnärligt arbete
10 Övrigt
9 Artikel i övriga tidskrifter
8 Poster (konferens)
7 Rapport
6 Samlingsverk
4 Lärobok
4 Bok
3 Licentiatsavhandling
3 Bidrag till encyklopedi
1 Special / temanummer av tidskrift
1 Recension
1 Proceeding
1 Inledande text i tidskrift
1 Artikel i dagstidning

We can now see that the most common type of publication is "Paper i proceeding" (paper in proceeding) with 190 publications. In second place, we find "Artikel i vetenskaplig tidskrift" (paper in scientific journal) with 114 publications. After that, we have "Konferensbidrag (offentliggjort, men ej förlagsutgivet)" (conference/other - not published by a publisher) as a fairly common type of publication with 76 publications. If we add up "Doktorsavhandling" (doctoral thesis) and "Licentiatsavhandling" (Licentiate thesis), we get 19 publications of those. And if we consider everything else as "Other", we get 187 publications of type Other, about the same as "Paper i proceeding".

How did we do that? Let's break down the command-line into parts:

GET 'https://ait.gu.se/om-institutionen/?selectedTab=3&userName=&siteNodeId=638014&languageId=100000&itemsPerPage=-1&contentId=-1&originalRequestURI=%2Fom-institutionen%2F&publicationsPerPage=500' \ 
                                # Get the HTML page
|grep '="publicationGroup' \    # Filter on lines containing ="publicationGroup
|sed -e 's/.*>\(.*\)<.*/\1/g' \ # Extract the type text from the HTML tag surrounding it
|sort \                         # Sort the lines lexicographically
|uniq -c \                      # Count the number of occurrences, which will produce the number and the uniq text
|sort -rnk1 \                   # Sort the lines descending on the numeric key (occurrences)
|sed -e 's/^\ *//g'             # Remove leading blanks

Another, slightly more advanced example is available at our github repo for this lecture. It is a script (click on "Raw" before you download from github, or clone the whole repo) that only counts publications per author where the author is the first author (primary author) of the article. Otherwise, we'll count the same article four times, if four researchers at our department all have contributed to the article and listed it on their home pages.

The script produces output for each author (staff at our department who has listed something on their home page under "Publications") and a summary for the whole department. As the script is running, you can see it analyzing each publication, printing a '.' for each publication where the author is a co-author, an 'F' if the author is the first author (primary author) and a 'V' if the publication is by the author as first-name (primary) author and if it is published in a scientific journal (which, we believe, is the strongest currency in circles obsessed with publications).

The publications are investigated in reverse chronological order, from the latest publication and back in time to the first one. This way, you can also study how the publishing frequency has changed over the years.

It is only the first-name publications (in journals and elsewhere) that are counted as the department total.

Analysing papers by Albert Einstein from 2017 .......F...F..F....V..FF.....F......... 1991.
 Done analysing papers.
Albert Einstein: 0 single author publications, 7 first-name publications, and 32 articles as co-author
First publication published 1991, last publication published 2017
Years active publishing: 28
Number of publications: 39.
Average publications per year: 1.39
Publication types:
17 Artikel i vetenskaplig tidskrift
16 Paper i proceeding
2 Kapitel i bok
1 Poster (konferens)
1 Licentiatsavhandling
1 Konferensbidrag (offentliggjort, men ej förlagsutgivet)
1 Doktorsavhandling
Number of scientific journal publications as first name: 1

... the same for each of the staff

==========Summary for the department===============
533 (first-name) publications by 22 authors of which 105 were in sci. journals
The ratio (scientific journal publication/publication) is 0.20 .
An average of 24.23 publications per employee (counting only emp. who publishes).
The total years of publishing (summing every authors active years publishing) is 280
In average, each publishing empl has published 4.77 scientific papers in a sci. journal.
In average, each publishing empl has published for 12.73 years.
In average, each publishing empl has published 0.37 sci. papers in a sci. journal per year.

Videos

Links

Files from the lecture

Here's the file kommuner-raw which we used to extract the municipality URLs.

Where to go next

« PreviousBook TOCNext »