C:Robust code

From Juneday education
Jump to: navigation, search

Why robust

Have you ever seen the program ls crash? What are your expectations in the program find program? What would you say if gcc crashed when using it? We, the authors of this material, would stop using such a program. So what expectations do you have on your programs?

It's about time we talk about robust code. We'll begin this by looking at two examples of source code. The first one is from a book and the second one is from a student.

Code examples

It's hard to talk about robust code and not give examples of the opposite. Giving an example of a program crashing is not enough. We will show you two examples of programs that:

  • compiles - albeit with some warnings if descent flags are used
  • runs fine - with correct input data

So by showing you crap code that kind of works we hope to stimulate some thoughts on robustness. Why it is important that you are the toughest tester on the planet. After all, you don't want a report saying your program sucks, do you?

Reading data from a user

As a starting point for a discussion on robust we'll start with a discussion on code that:

  • compiles
  • works (given expected input data)

Let's use some code from a C book (no, we will not disclose the author):

char *read_input(void) {
  char temp[1000];
  int c,  n=0;
  while (( c=getchar()) != '\n') 
    temp[n++] = c ;
  temp[n]='\0';
  char * res = malloc(n+1);
  strcpy(res, temp);
  return res;
}

We've written a small program that uses this function to read input from stdin and simply prints the data out on stdout - Nuthin' fancy but serves well as a starting point for a discussion on robustness. Look at the code for some seconds and see if you can find some problems. You can find the source code here: https://github.com/progund/programming-with-c/blob/master/robust/read_input/cfb.c

Here are some problems we found by simply looking at the code:

  • what if there's no input at all, but an EOF (end of file)?
  • char temp[1000]; - maximum 1000 bytes can be read. Why 1000? What happens if the user supplies more than 1000 characters?
  • char * res = malloc(n+1);
    strcpy(res, temp); char temp[1000]; - what if malloc fails?

Given the three above problems one wonders if it is tested at all, but since the buffer size is 1000 the author have most likely stumbled upon some problems and had to increase the size.

What happens if we compile this code with no flags?

$ gcc cfb.c

Works fine. How about adding some flags to gcc:

$ gcc -g -pedantic -Wconversion -Wall -Werror -Wextra -Wstrict-prototypes -fstack-protector-strong cfb.c
cfb.c: In function ‘read_input’:
cfb.c:13:17: error: conversion to ‘char’ from ‘int’ may alter its value [-Werror=conversion]
     temp[n++] = c ;
                 ^
cfb.c:15:23: error: conversion to ‘size_t {aka long unsigned int}’ from ‘int’ may change the sign of the result [-Werror=sign-conversion]
   char * res = malloc(n+1);
                       ^
cc1: all warnings being treated as errors

Uh oh. Not so nice. Why publish code that produces warning (which we prefer to treat as errors to make sure we fix them).

We could go on and discuss "minor" things with the code, such as:

  • no curly braces in the while loop
  • declare multiple integers on the same line
  • only checks if c == '\n', should check if temp[n++] writes outside the (stack) memory or if c==EOF

but the above three are way more important.

Testing with nice input

Let's write some test code the above code:

$ gcc cfb.c -o cfb
$ echo hi there | ./cfb
hi there

Seems to work with valid input data. Let's try some more interesting things. Let's create 500 characters input data:

$ cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 500 | head -1 | ./cfb
9fFKFtw1tlmtaThbOLRo4l8YpEGLoES4whu4dTi4D7HaxiFaeVzFtF4zxGfojFOdXEVCWeDbKRX2jhPXlts29YuzTyEc5g71tMbZq25XRdELYU4jogyOfs5Fxp5Yf3JMulFKayRwqnHGoAUHlkjYRxdX9xUj3XKHNyVzKIwd0UUuo3aUVxL9YyHWycGxez5TXRhDqbrckcNzn9C5PCnr7147iCEqvgEP1wH2XYKwdA2Qm7y17BSCnWfLfsQQFeYHLulRC3zLd85Cjt5o9unEDtM7ohRlqYQwKjkj1bi5XuIVAGJ8fXeAiAb7fUqRZuEhLRUXI8SkvVpQr1QPlraVGaBPsXqbPHRVKvsVXHcJhGPaXonTIax20AtaCRpJ58A77nkRG7cmc0Z1bszAADHhjoqpeL5FC5iAT4iHEraClSe7muKRCYtDi0PEtgFzfS3VMm8eLE3bM5bHRl2nAWfNGnU1MWYNUbTFEqc9U9xBNdHUSXkYnhzc

Note: the command above uses a so called pseudo random device to create random data. As such you will most likely NOT get the same characters as we did when writing this.

what if there's no input at all?

How do we test this? Easy, we use echo like this:

$ echo -n | ./cfb
Segmentation fault (core dumped)

Wow, the program crashes. Actually it is the function that causes the crash.

If you wonder what on earth we're doing when we write echo -n we're asking echoto print nothing and add no trailing nweline. In other words, print nuthin and that's about it. If we count the number of bytes with wc we can make sure that this results in no data:

$ echo -n | wc -c
0

.... so 0 (zero) bytes are written with the above command. You could also do sleep 0|./cfb but the echo version seems to us to be more natural.

more than 1000 characters

Let's compile with a nice gcc command line option: -fstack-protector-strong. The manual says:

       -fstack-protector
           Emit extra code to check for buffer overflows, such as stack smashing attacks.  This is done by adding a guard variable to
           functions with vulnerable objects.  This includes functions that call "alloca", and functions with buffers larger than 8 bytes.
           The guards are initialized when a function is entered and then checked when the function exits.  If a guard check fails, an error
           message is printed and the program exits.
       -fstack-protector-strong
           Like -fstack-protector but includes additional functions to be protected --- those that have local array definitions, or have
           references to local frame addresses.
$ gcc -fstack-protector-strong -g cfb.c -o cfb

Note: -g adds debug information to the program/object files

$ cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 10000 | head -1 |   ./cfb 
Segmentation fault (core dumped)
$ cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 10000 | head -1 |   ./cfb 
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)

So the program crashes. Not so great is it?

Note: the command above creates a string with 10 000 (ten thousand) of characters.

what if malloc fails?

How the %¤# can we test this? We can (at least) do this in two ways:

  • replace malloc in the source code with a macro
  • use a self-written malloc when running the program

We're going to use the first one right now.

#ifdef FAKED_ALLOC_FAILURE
#define malloc(n) NULL
#endif

If FAKED_ALLOC_FAILURE is defined we're replacing malloc with NULL. Nice trick ;)

$ gcc -DFAKED_ALLOC_FAILURE -fstack-protector-strong -g cfb.c -o cfb 
$ echo hej | ./cfb 
Segmentation fault (core dumped)

So it crashses if memory allocation fails. Not so very good.

Trying to sum things up

The above code* is rather bad when it comes to dealing with unexpected input data. But again, we expect programs to work - not to crash. So we need to write robust code. How we do this is discussed below. But we would like to make you start thinking about your code as something that should work - and not crash. We (the authors) like to think about how we can break our code, make it NOT work by passing bad input data. Feeding your program only valid, typical or expected data is not to test your program. It's merely confirming that your program works when the input is exactly that; valid, typical or expected.

*) How this code can end up in a book is worth thinking about. An additional funny thing is that the book discusses common mistakes when it comes to pointers just after presenting the code above.

In defense of the author of the code

Perhaps the author wanted to spare the reader from details on memory allocation, only EOF as input data etc. This might be the case, but then:

  • the name of the function should reflect that, how about naively_read_input or dangerously_read_input
  • you should make a comment stating that
  • you should not tip readers about common mistakes some 20 lines after making them yourself

Reading data from a file

We got a request from a student to review some code the student had written. We said yes and gave some 54 comments back and had a discussion about the code, our comments. The discussion went fine and we realised, after a while, that these comments probably are worth discussing on this wiki, so let's do that.

Source code

books.h

typedef struct book_ {
  char name [100];
  char author[100];
  long isbn;
  int year;
  int nr_of_pages;
  double price;
}  book;
typedef book *bookptr;

Some comments:

  • why only 100 characters?
  • the file contains the struct book (singular) - why on earth is the file called books.h (plural form of books)

books.c

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "books.h"

book books[100]; /* global array of books */

int main (void) {
    
  char *oneline, *tok;
  char a_book[256];
  char delim[] = ",";
  int nr_records = 0; 
  bookptr b = books;

  FILE *fp;
  if ((fp = fopen("books.csv", "r")) == NULL) {
    fprintf(stderr, "The file books.csv couldn't be opened\n");
    exit(-1);
  }

  while (fgets(a_book, 255, fp) != NULL) {
    a_book[strlen(a_book) - 1] = '\0'; /* Get ride of return character */
    oneline = strdup(a_book);   /* copy line just in case... */
    tok = strtok(oneline, delim);

    nr_records++; /* we read one line/record at a time*/
    strncpy (b->name,tok,99); /* copy name */
    tok = strtok(NULL, delim); 
    strncpy (b->author,tok,99); /* copy author*/
    tok = strtok(NULL, delim);
    b->isbn = atol(tok); /* convert isbn */
    tok = strtok(NULL, delim);
    b->year = atoi(tok); /* convert year */
    tok = strtok(NULL, delim);
    b->nr_of_pages = atoi(tok); /* convert nr_of_pages */
    tok = strtok(NULL, delim);
    b->price=atof(tok);

    
    tok = strtok(NULL, delim); /* reads end of string */
    b++;    

    free(oneline); free(tok);
    printf("Number of lines read %d\n",nr_records); 
  }
  int i = 0;
   b=books;
  while(i<nr_records) {
    printf("Name of book %s\n",b->name);
    printf("Price: %f\n\n",b->price);
    i++;
    b++;
  }

  fclose(fp);
  
}

What? Huah.... Ok, where to begin?

crucial comments:

  • ato* - dangerous function, what if there is no integer/double/float there?
  • 1 memory allocation (strdup) and 2 calls to free - any bell ringing?
  • strtok is not wise to use if the csv file contain empty strings between delimiters (e.g ,,)

and some more:

  • indentation is inconsistent
  • let's discard the following facts
    • that the program always read books from the same file (books.csv) - make the program more generic
    • the program reads four books, prints four books - does this qualify as a program? - make the program more generic
  • global array - why?
  • all code in main - why?
  • only 1000 books? - what happens aftwards? crash?
  • only lines less than 255 bytes? UTF8? - what happens with longer lines?
  • 256 bytes are allocated - but only 254 + 1 ('\0') are used. Why?
  • what if author's name is more than 100 characters?
  • what if books's name is more than 100 characters?
  • a_book is not a book, but a string. Misleading name
  • Why copy the line? - explain if you really need to do that
    • /* copy line just in case... */ - just in case WHAT
  • missing exit or return in main
  • books.c and books.c seem to belong together since they share the same name (excluding suffix) but books.c seems to be only a main function. - Why not rename books.c to main.c
  • oneline - is there a two or second line? Or a variable containing two lines?
  • nr_records++; /* we read one line/record at a time*/. - what does this mean?
  • tok = strtok(NULL, delim); /* reads end of string */ - misleading comment
    • the above code does nothing - unless tok is not NULL when things will go horribly bad. free(NULL) is ignored
  • free(oneline); free(tok); - two statements on one line
  • different use of space (strcpy( resp strcpy ()) - but hey, who does not fail to be perfect
  • hard coded 255 and 256 - why not macros?
  • the size of the autor's name is defined in books.h - perhaps provide a macro for the users of books.h?
  • /* reads end of string */ - why is this needed?
  • a_book[strlen(a_book) - 1] = '\0' - what if the a_book is longer than 255 charcters?
  • if the file is in DOS format (i e \r\n) then the replacing of \n will keep \r

books.csv

The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07
Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87

Testing with nice input

Let's test the code with csv files:

$ gcc books.c -o books 
$ ./books 
book: The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07

Number of lines read 1
book: Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87

Number of lines read 2
Name of book The Name of the Rose
Price: 9.070000

Name of book Foucault's Pendulum
Price: 10.870000

... and we even can test with valgrind. Let's add -g as option to gcc to make valgrind be able to point out where in the code something goes wrong (if it does).

$ gcc -g books.c -o books 
[hesa@bartok books]$ valgrind --leak-check=full ./books 
==1337== Memcheck, a memory error detector
==1337== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1337== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==1337== Command: ./books
==1337== 
book: The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07

Number of lines read 1
book: Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87

Number of lines read 2
Name of book The Name of the Rose
Price: 9.070000

Name of book Foucault's Pendulum
Price: 10.870000

==1337== 
==1337== HEAP SUMMARY:
==1337==     in use at exit: 0 bytes in 0 blocks
==1337==   total heap usage: 5 allocs, 5 frees, 5,789 bytes allocated
==1337== 
==1337== All heap blocks were freed -- no leaks are possible
==1337== 
==1337== For counts of detected and suppressed errors, rerun with: -v
==1337== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Ok, seems to work fine.

more than 255 characters

Let's add a line that is more than 255 characters, what happends then?

This is the new books.cvs:

The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07
Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87
Foucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87

so let's run the program again:

$ ./books 
book: The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07

Number of lines read 1
book: Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87

Number of lines read 2
book: Foucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's PendulumFoucault's Pendulum,Umberto Eco, 015603297X,2
Segmentation fault (core dumped)

Huah, not much of a program.

missing author's name

Let's remove the name of one author, what happens then? We do this since the code uses strtok which explicitly says in the manual it will continue to next delimiter if it finds two "in a row".

This is the new books.cvs:

The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07
Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87
Foucault's Pendulum,,015603297X,2007,640,10.87

so let's run the program again:

$ ./books 
book: The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07

Number of lines read 1
book: Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87

Number of lines read 2
book: Foucault's Pendulum,,015603297X,2007,640,10.87

Segmentation fault (core dumped)

letters instead of int

The code uses atoi and atof to scan the year, number of pages, ISBN and price. the ato* are plain silly to use in this situation since you can't see if they succeeded or not. Let's replace the year with a letter.

This is the new books.cvs:

The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07
Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87
Foucault's Pendulum,Umberto Eco, 015603297X,string-instead-of-int,640,10.87

so let's run the program again:

$ ./books 
book: The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07

Number of lines read 1
book: Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87

Number of lines read 2
book: Foucault's Pendulum,Umberto Eco, 015603297X,string-instead-of-int,640,10.87

Number of lines read 3
book: 

Segmentation fault (core dumped)

add some text at the end of a line

Let's add some text at the end of one line. What happends then? We do this since the code searches for a final string and deallocates that memory (using free). This will trigger two calls to free with actual memory (not NULL). The double deallocation was never triggered before since the input data (the csv file) was perfect match for the program.

This is the new books.cvs:

The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07
Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87
Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87,some,extra,data

so let's run the program again:

$ ./books 
$ ./books 
book: The Name of the Rose,Umberto Eco,0544176561,2014,592,9.07

Number of lines read 1
book: Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87

Number of lines read 2
book: Foucault's Pendulum,Umberto Eco, 015603297X,2007,640,10.87,some,extra,data

free(): invalid pointer
Aborted (core dumped)

Trying to sum things up

Again, the code we've showed is actually working fine. If perfect indata that is. And again, this will not do. We don't expect programs to crash on us. We don't want gcc to crash, do we? firefox? Expect nothing less from your code. Expect way less "suckiness" from your teachers and from authors - that is self proclaimed experts.

In defense of the author of the code

The code as probably written to solve a specific problem with specific input data and as such did not need to take into consideration the above stuff. So the code probably did what it is supposed to do. But the student asked for input and got it.

Are we not mean?

Are we not men?. Who are we to accuse an author/student like this?

  • we feel strongly about pedagogy and believe that the possible pros of the code (easier to read, less code) in no way outweighs the cons (code crashing) since we know that readers/students use teachers'/authors' code as some kind of blueprint for how code should be written. So let's do things correctly.
  • the student ok:ed to publish the comments.
  • we publish all our code - apart from a very limited amount of secret stuff with solutions to hand in assignments etc. We strongly recommend all our students to send our code (and all other material) for review.

But, yes, we're also assholes in the sense that we don't prefer being polite over being true to the ones meaning the most for us - our students and readers. And we believe that the student took the discussion in the right way - not us telling the student that he/she is bad but the code need to be made more robust.

Conclusion

TBD:

We've seen

  • compilable code
  • executable code

suck horribly

Let's spend time discussin'

  • write code - break it
  • be careful
  • be accurate
  • spend time wrinting code - not bug fixing it