4.1 Comma Separated Value.
*Comma Separated Value *or CSV is a term for a natural and wildly used representation for tabular data.
Data formats like CSV work best if there are convenient libraries for converting to and from the format, perhaps allied with some auxiliary processing such as numerical conversions .But we do not know of an existing public library to handle CSV ,so we will write one ourselves. In the next few sections ,we will build three versions of a library to read CSV data and convert it into an internal representation.
4.2 A Prototype Library
Our first version will ignore many of the difficulties of a thoroughly engineered library, but it will be complete enough to be useful and to let us gain some familiarity with the problem.
Our starting point is a function cvsgetline that reads one line of CVS data from a file into a buffer, splits it into fields in an array, removes quotes, and returns the number of fields.
Here is a **prototype version in C **.
char buf[200]]; //input line buffer
char *field[20]; //fields
//csvgetline:read and parse line, retuen field count
//sample inout: "LU",86.25,"11/4/1998","2:19PM",+4.0625
int csvgetlinr(FILE *fin)
{
int nfield;
char *p,*q;
if (fgets(buf, sizeof(buf),fin) == NULL)
return -1;
nfield =0;
for(q= buf;(p=strtok(q,",\n\r"))!=NULL;q=NULL)
field[nfield++] = unquote(p);
return nfield;
}
The CSV format is too complicated to be parsed easily by scanf so we use the C standard library function strtok,each call of strtok(p,s) returns a pointer to the first token within p consisting of characters not in s; strtok terminates the token by overwriting the following character of the origin string with a null byte. On the first call, strtok's first argument is the string to scan, subsequent calls use NULL to indicate that scanning should resume where it left off in the previous call. This is a poor interface. Because strtok stores a variable in a secret place between calls, only one sequence of calls may be active at one time; unrelated interleaved calls will interfere each other.
Our function unquote removes the leading and trailing quotes that appear in the sample input above, but it doesn't handle nested quotes. However, it's enough for a prototype:
//unquote :remove leading and trailing quote
char *unquote(char *p)
{
if(p[0]=='"')
{
if (p[strlen(p)-1] == '"')
{
p[strlen(p)-1 = '\0';
p++;
}
}
return p;
}
A simple test program helps verify that csvgetline works:
//csvtest main: test csvgetline function
int main(void)
{
int i,nf;
while((nf = cdvgetline(stdin))! = -1)
for(i=0; i<nf; i++)
printf("field[%d] = '%s'\n",i,field [i]);
return 0;
}
Problems in our prototype :
- The prototype doesn't handle long input lines or lots of fields .It can give wrong answers or crash because it doesn't even check for overflows ,let alone return sensible values in case of error.
- The input is assumed to consist of lines terminated by newlines.
- Fields are separated by commas and surrounding quotes are removed. There is no provision for embedded quotes or commas.
The input line is not preserved; it is overwritten by the process of creating fields. - No data is saved from one input line to the next; if something is to be remembered, a copy must be made.
- Access to the fields is through a global variable, the field array, which shared by csvgetline and functions that call it; there is no control over access to the field contents or the pointers. There is also no attempt to prevent accessbeyond the last field.
- The global variables make the design unsuitable for a multi-threaded environment or even for two sequences of interleaved calls.
- The caller must open and close file explicitly ; csvgetline reads only from open files.
- Input and splitting are inextricably linked :each call reads a line and splits it into fields, regardless of whether the application needs that service,
- The return value is the number of fields on the line; each line must be split to compute this value .There is also no way to distinguish errors from end of file.
- There is no way to change any of these properties without changing the code.
4.3 A library for others
** Interface. **
We decide on three basic operations:
char * csvgetline(FILE*) ;read a new CSV line
char * csvfield (int n) : return the n-th field of the current line
int csvnfield (void): return the number of the fields on the current line
We decided make csvgetline will return a pointer to the original line of input, or NULL if end of file has been reached.
** Information hiding. **
We will have to grow memory as longer lines or more fields arrive. Details of how that is done are hidden in the csv functions ;no other parts of program knows how this works, interface doesn't reveal when memory is freed.
** *Resource management * **
Whoever opens an input file should do the corresponding close : matching task should be done at the same level or place.
** Error handing **
As a principle ,library routines should not just die when an error occurs ; error status should be returned to the caller for appropriate action. Nor should they print message or pop up dialog boxes, since they may be running in an environment where a message would interfere with something else.
** Specification **
The best approach is to write the specification early and revise it as we learn from the ongoing implementation.
The rest of this section contains a new implementation of csvgetline that matches the specification . The library is broken into two files ,a header csv.h that contains the function declarations that represent the public part of the interface, and an implement file csv.c that contains the code.
Here is the header file:
//csv.h :interface for csv library
extern char *csvgetline(FILE *f); //read next new line
extern char *csvfield(int n); //return field n
extern int csvfield(void); //return number of fields
The internal variables that store text and the internal functions like split are declared static so they are visible only within the file that contains them, this is the simplest way to hide information in a C program.
enum{NOMEM = -2}; //out of memeory signal
static char *line=NULL; //input chars
static char *silne=NULL; //line copy used by split
static int maxline=0; //size of line [] and sline[]
static char **field =NULL //field pointers
static int maxfield =0;
static int nfield=0; //number of fields in field[]
static char fieldsep[]="'";//field seperator chars
The variables are initialized statically as well .These initial values are used to test whether to create or grow arrays.
4.5 Interface Principles
To prosper, an interface must be well suited for its task - simple ,general, regular, predictable ,robust - and it must adapt gracefully as its users and its implementation change. Good interface follow a set of principles. These sre not independent or even consistent.
** Hide implementation details **
The implementation behind the interface should be hidden from the rest of the program so it can be changed without affecting or breaking anything.
If the header files dose not include the actual structure declaration, just the name of the structure , this is sometimes called an opaque type ,since its properties are not visible and all operations take place through a pointer to whatever real object lurks behind.
Avoid global variables ; whatever possible it is better to pass references to all data through function arguments.
** Choose a small orthogonal set of primitives **
An interface should provide as much functionality as necessary but no more.
A larger interface is harder to write and maintain.
** Don't reach behind the user's back. **
A library function should not write secret files and variables or change global data, and it should be circumspect about modifying data in its caller.
The use of one interface should not demand another one just for the convenience of the interface designer or implementer. Instead, make the interface self-contained, or failing that, be explicit about what external services are required. Otherwise, you place a maintenance burden on the client. An obvious example is the pain of managing huge lists of header files in C and C++ source; header files can be thousands of
lines long and include dozens of other headers.
** Do the same thing the same way everywhere **
Consistency and regularity are important. Related things should be achieved by related means.
No matter what, there is a limit to how well we can do in designing an interface . Even the best interface of today may eventually become the problem of tomorrow, but good design can push tomorrow off a while longer.
4.6 Resource Management
Roughly, the issues fall into the categories of initialization, maintaining state, share and copying, and cleaning up.
In C++ and Java, properly defined constructors ensure that all data members are initialized and that there is no way to create an uninitialized class object.
Java uses references to refer to objects, that is, any entity other than one of the basic types like int. This is more efficient than making a copy, but one can be fooled into thinking that a reference is a copy; and this issue is a perennial source of bugs involving strings in C. Clone methods provide a way to make a copy when necessary.
A program that fails to recover unused memory will eventually run out. Much modem software is embarrassingly prone to this fault. Related problems occur when open files are to be closed: if data is being buffered, the buffer may have to be flushed (and its memory reclaimed).
** Free a resource in the same layer that allocated it. **
One way to control resource allocation and reclamation is to have the same library, package, or interface that allocates a resource be responsible for freeing it.The allocation state of a resource should not change acmss the interface.
Java has built-in garbage collection. As a program runs, it allocates new objects. There is no way to deallocate them explicitly, but the run-time system keeps track of which objects are still in use and which are not, and periodically returns unused ones to the available memory pool.
There are a variety of techniques for garbage collection:
Some schemes keep track of the number of uses of each object, its reference count, and free an object when its reference count goes to zero. This technique can be used explicitly in C and C++ to manage shared objects.
Other algorithms periodically follow a trail from the allocation pool to all referenced objects. Objects that are found this way are still in use; objects that are not referred to by any other object are not in use and can be reclaimed.
Nor is garbage collection free-there is overhead to maintain information and to reclaim unused memory, and collection may happen at unpredictable times.
All of these problems become more complicated if a library is to be used in an environment where more than one thread of control can be executing its routines at the same time, as in a multi-threaded Java program. To avoid problems, it is necessary to write code that is reentrant, which means that it works regardless of the number of simultaneous executions. Reentrant code will avoid global variables, static local variables, and any other variable that could be modified while another thread is using it. The key to good multi-thread design is to separate the components so they share nothing except through well-defined interfaces. Libraries that inadvertently expose variables to sharing destroy the model.
If variables might be shared, they must be protected by some kind of locking mechanism to ensure that only one thread at a time accesses them. Classes are a big help here because they provide a focus for discussing sharing and locking models. Synchronized methods in Java provide a way for one thread to lock an entire class or instance of a class against simultaneous modification by some other thread; synchronized blocks permit only one thread at a time to execute a section of code.
4.7 Abort, Retry, Fail?
What should a function do if an unrecoverable error occurs? The functions we wrote earlier in this chapter display a message and die. This is acceptable behavior for many programs, especially small stand-alone tools and applications. For other programs. however, quitting is wrong since it prevents the rest of the program from attempting any recovery. A useful alternative is to record diagnostic output in an explicit "log file," where it can be monitored independently.
** Detect errors at a low level, handle them at a high level **
In IEEE floating point, a special value called NaN ("not a number") indicates an error and can be returned as an error signal.
Some languages, such as Perl and Tcl, provide a low-cost way to group two or more values into a tuple. In such languages, a function value and any error state can be easily returned together. The C++ STL provides a pair data type that can also be used in this way.
If the various exceptional values can't readily be separated, another option is to return a single "exception" value and provide another function that returns more detail about the last error.
This is the approach used in Unix and in the C standard library, where many system calls and library functions return -1 but also set a global variable called errno that encodes the specific error; strerror returns a string associated with the error number. On our system, this program:
#include<stdio.h>
#include<string.h>
#include<errno.h>
#include<math.h>
//a error main: test errno
int main(void)
{
double f;
errno=0; //clear error state
f= log(-1.23);
printf("%f %d %s \n",f, errno, strerror(errno));
return 0;
}
prints
nan0x10000000 33 Domain error
** Use exceptions only for exceptional situations **
Exceptions are often overused. Because they distort the flow of control, they can lead to convoluted constructions that are prone to bugs. It is hardly exceptional to fail to open a file; generating an exception in this case strikes us as over-engineering. Exceptions are best reserved for truly unexpected events, such as file systems filling up or floating-point errors.
A common source of bugs is trying to use a pointer that points to freed storage. If error-handling code sets pointers to zero after freeing what they point to, this won't go undetected. In general, aim
to keep the library usable after an error has occurred.
4.8 User Interfaces
A diagnostic should not say
estrdup failed
when it could say:
markov: estrdup("Derrida") failed : Memeoru limit reached
Programs should display information about proper usage when an error is made, as shown in functions like:
// usage: print usage message and exit
void usage(void)
{
fprintf(stderr,"usage: %s [-d][-n nwords]"
"[-s seed][files ...]\n",progname());
exit(2);
}
The text of error messages, prompts, and dialog boxes should state the form of valid input. Don't say that a parameter is too large; report the valid range of values. When possible, the text should be valid input itself, such as the full command line with the parameter set properly. In addition to steering users toward proper use, such output can be captured in a file or by a mouse sweep and then used to run some further process. This points out a weakness of dialog boxes: their contents are hard to grab for later use.
One effective way to create a good user interface for input is by designing a specialized language for setting parameters, controlling actions.
Defensive programming, that is, making sure that a program is invulnerable to bad input.
Object-oriented programming excels at graphical user interfaces, since it provides a way to encapsulate all the state and behaviors of windows, using inheritance to combine similarities in base classes while separating differences in derived classes.