Chapter:C extra csv

From Juneday education
Jump to: navigation, search

CSV and C

We've gotten some questions from students how to parse a CSV file in (both from studetns attending a C course as well as from students attending a Java course. So, let's give an example which we also can use in a lab we've written Exposing data over http for an advanced Java course. In this lab the students get (from us) a CSV parser written in Java and they should write code to expose and consume the data using JSON.

The idea behind the original lab/assignment is about fixing the so called "Open API" offered by Systembolaget, a government-owned chain of liquor stores in Sweden. The "Open API" is actually a static XML file. Ok, it's updated daily but it's not an API. The students should download the XML file, put the products in a database and expose the products using JSON.

In this directory you'll find a solution on how to:

  • download XLS file - using a script
  • convert XLS to CSV - using a script
  • read the CSV (product-csv.c) and store the products (product.c) in a list
  • print SQL statements (product-db.c) given the list above
  • pipe the SQL printouts to SQLIte to store the products in a database

About the solution

C code

Missing hash map in C

The XML file has, per elelement, a group. The group could be "Öl" which translates to "Beer" in english. This string will be stored tons of times in the database so we decided to normalise the data. One choice is to use a map of some kind. We thought about using the great stuff over att Glib but decided to go for a plain C solution. We've made an ugly (no, let's say hackish) way to solve this. We keep a (dynamic) list of unique group name and use the index for each of these. Have a look at the function int group_to_int(product_list* list, char *group) in the file product.c

No static sizes

Every string is stored using dynamic memory. It's easier and more flexible so why not? ;) The memory is checked using Valgrind (see below for a link).

strsep - since strtok is quite useless

From the strok manual:

  From  the  above description, it follows that a sequence of two or more contiguous delimiter bytes in the parsed string is considered
  to be a single delimiter, and that delimiter bytes at the start or end of the string  are  ignored.   Put  another  way:  the  tokens
  returned by strtok() are always nonempty strings.  Thus, for example, given the string "aaa;;bbb,", successive calls to strtok() that
  specify the delimiter string ";," would return the strings "aaa" and "bbb", and then a null pointer.

Parsing a csv file with one field missing, e g two delimiters after each other like this ,some data,,, will be impossible. So let's not use strtok.

We're using strsep instead.

Explaining the code (functions and files)

Our solution is pretty straight forward:

  • read a csv file and store the products
  • loop throught the products and print them out

src/product.c and .h

In order to get there we need some data structures to keep our products. Apart from the products we need to keep a list of product groups (e. g. Öl/beer). Since we're using dynamix memory (we do not know how many products there are or how long the name of a products is etc etc)). So we've come up with the following structures defined in product.h:

typedef struct group_table_
{
  char** names;
  unsigned int size;
} group_table;

This structure (group_table) is used to 1) store the names of the products groups and how many names have been stored.

typedef struct _product
{
  char  *name;
  unsigned int nr;
  float price;
  float volume;
  int   group;
  char  *type;
  char  *style;
  char  *package;
  char  *country;
  char  *producer;
  float alcohol;
} product;

This structure (product) is used to store information about one product.

typedef struct _product_list
{
  product*      products;
  unsigned int  size;
  group_table   groups;
} product_list;

This structure (product_list) is used to store a list of products and information about all groups.

There are some functions worth a mention:

product_list*
new_product_list(void);

Creates a new product_list and returns a pointer to it. NULL on failure.

void
free_product_list(product_list* list);

Frees all memory for a product_list.

int
group_to_int(product_list* list, char *group);

Given a product group this functions return a number representing this group.

product*
add_product(product_list* list);

This function creates new memory for a product in the product list and returns a pointer to the new memory. Use this memory to store the information about the new product.

unsigned int
product_list_size(product_list* list);

Returns the size of the product list.

void
print_product(product* p);

Prints one product (stdout).

void
print_product_list(product_list* list);

Prints the products (stdout) in the product list.


src/main.c

This is a short file using the functions in the other files.

A simplified version of the main function looks like this:

1   product_list* list = new_product_list();
2   csv_to_product_list(list, argv[1]);
3   print_product_list_db(list);
4   free_product_list(list);
  1. Creates a new product_list struct
  2. Reads a csv file, stores the products in the struct
  3. prints the product in a format usable by SQLite
  4. Frees all memory for the struct

src/product-csv.c and .h

There's only one function defined in the header filer:

product_list*
csv_to_product_list(product_list* list, char *file_name);

This function reads the csv files and stores the products in the list passed as a parameter. This function uses other (static) functions in product-csv.c as well functions defined in product.h.

static char *read_input(FILE* stream)

This function, which in two variants:

  1. one using GNU's getline (add -DGETLINE to the compiler directives to use this implementation)
  2. one using fgets

The function (both implementations) allocates memory dynamically containing one line in the csv file.

static void
parse_product_line(product_list* list, char *line)

This function parses one line, typically as read by read_input, and stores a product in the product_list.

src/product-db.c and .h

void
print_product_db(product* p)

Prints a product to stdout. The format of the printout is usable by SQLite (or another DBMS).

void
print_product_list_db(product_list* list)

Prints all products in a prouct_list to stdout, including statements to create the database tables. The format of the printout is usable by SQLite (or another DBMS).

Data

The data from Systembolaget is downloaded using either a script:

$ ./bin/download.sh

or using the Makefile

$ make download

which invokes the script above. The script uses curl to download.

Converting to CSV

We're using a program called ssconvert (part of Gnumeric). If it is available under Windows or MacOS we don't know. But we have provided the data files (XML, XLS and CSV) for you so there should be no need if you're ok with old data.

First line of the CSV file:

nr,Artikelid,Varnummer,Namn,Namn2,Prisinklmoms,Pant,Volymiml,PrisPerLiter,Saljstart,Utgått,Varugrupp,Typ,Stil,Forpackning,Forslutning,Ursprung,Ursprunglandnamn,Producent,Leverantor,Argang,Provadargang,Alkoholhalt,Sortiment,SortimentText,Ekologisk,Etiskt,EtisktEtikett,Koscher,RavarorBeskrivning

Example of line from the CSV file:

101,1,1,Renat,,204.00,,700.00,291.43,1993/10/01,0,Okryddad sprit,,,Flaska,,,Sverige,Pernod Ricard,Pernod Ricard Sweden AB,,,37.50%,FS,Ordinarie sortiment,0,0,,0,Säd.

We will only work with the following columns (English translation):

  • nr
  • Namn (name)
  • Prisinklmoms (price including VAT)
  • Volyiml (volume in ml)
  • Varugrupp (group, e g Öl (Beer)))
  • Typ (type)
  • Stil (style)
  • Alkoholhalt (alcohol percentage)

... although we might add code to parse some more out later on. These are the column used in the database.

dealing with '

The ' in the csv file are replaced by '' (two single quotues). This is done using sed.

dealing with "

The " in the csv file are removed . This is done using sed.

Software requirements

  • curl - if you want to download the XML, XLS files instead of using the provided
  • SQLite - if you want to store the data in a database
  • Valgrind - if you want to run the memory checks

Makefile

The following targets are useful in the Makefile:

  • download - downloads the XML, XLS and creates the CSV file
  • db - to store the small csv file to a database file (files/products.db)
  • big-db - to store the big csv file (products.csv) to a database file (files/products.db)
  • mem-check - check memory with the small csv file
  • big-mem-check - check memory with the big csv file
  • clean - clean the generated files
  • cleanup - as clean but also removes the XLS, XML and csv files
  • run - (compile) and run the program