Tuesday, April 17, 2012

The Standard C Library for Linux, Part Seven: String Handling


I finally have time (after a few years) to give back a little more to the Linux community with the next in my series of articles on the Standard C Library.  I hope that you enjoy.



The last article was on <assert.h> diagnostics for programmers.  This article is on <string.h> string handling.   C is not much better at handling strings than machine code so machine language programmers will feel quite at home in this section.  There are many limitations and problems with string.h that will be addressed in the apropriate function descriptions.

I am assuming a knowledge of c programming on the part of the reader.  There is no guarantee of accuracy in any of this information nor suitability for any purpose.
The example is rogers_example07.c .  This is a basic example that will demonstrate each of the string functions.  If you compile it and run it you will be able to see the output.   Compare the output to the code and enjoy.

As always, if you see an error in my documentation please tell me and I will correct myself in a later document.  See corrections at end of the document to review corrections to the previous articles.

WARNING:  Copying strings in C is the most dangerous part of programming in C.  C itself doesn't perform bounds checking, so it is very easy to overwrite the end of a string and actually overwrite other variables or even to crash the program.  Crackers use this weakness in C and inexpert coding practices to perform controlled overflows to force programs into giving them a shell to the account that the program is running under.  This is usually root for most servers.

C doesn't really have strings.  I know that is a strange thing to say in a document talking about string handling in C, but it is true.  What C does have is an array of characters.  To make space for a string you can ask the compiler to reserve room for that string.  The most common way is with a simple character array:

char string[17];
 
This reserves room for 16 characters and an end of string marker.

strcpy ( string, "This is a string" );
 
Will work to copy the static string "This is a string" into the space that we allocated.  The static string is composed of 16 characters followed by the ASCII nil character.  So there is plenty of room in the variable called string to hold the static string.  Nil is typically represented with the number zero or with the character '\0' or with the character '\000'.
Suprisingly the following will sometimes work as well, even though there are more then 17 characters copied into the char array:

strcpy ( string, "This is a long string" );
 
There is never any bounds checking when you are copying strings.  So even though you went past the end of string and wrote to memory in an unexpected way, most of the time you can get away with it.  Of course your program can also unexpectedly crash at anytime as well, sometimes in a place far away from the place where you made your error. 

Crackers can get a shell from the computer by overwriting the end of a buffer in such a way that the program executes a shell.  This is one of the reasons that you should really not use strcpy.  Use strncpy instead:

#define  MAX_STRING_LENGTH  17
char string[MAX_STRING_LENGTH];
strncpy ( string, "This is a long string", MAX_STRING_LENGTH );
string[MAX_STRING_LENGTH-1] = '\000';
 
The reason that I used a macro for the string length is that I am using this length in many places in my program, if I ever decide to change the size of the variable string I would have to find everywhere where I used the number 17 and fix each one.  Sometimes you may use the same number in different places to mean different things.  So even if you only use a literal number in a few places using a macro can make the meaning of that number really stand out and it makes it trivial to change the size of the string buffer in this case.

The reason that I put the last line there is that if the literal string is longer than the string that we are copying into then the end of string marker isn't put into place.  If you don't set the final character to null, most of the time you will be fine, but every once in a while your program will crash and you will wonder why.

There is also a third way to define a string and that is with malloc, realloc and calloc.  These functions work by requesting the memory that you need at runtime.  This is the most complicated but also the most flexible and powerful.

#define STATIC_STRING "This is a long string that will be copied into a location during runtime"
char *string;
int string_length;
string_length = strlen(STATIC_STRING);
if (!(string = (char *) malloc ( string_length ))){
   /* no memory left, die */
   exit (1);
}
strncpy( string,  STATIC_STRING, string_length);
string[string_length] = '\000';
/* do something with the string */
free(string);
 

One of the dangers of this method is that you have to clean up after yourself, using the free function.  If you don't free everything when you are done with then you will be leaking memory and eventually your program will crash.



The <string.h> library has numerous problems.

The biggest problem is that the library was never designed to be complete and consistant.  <string.h> really is a collection of functions written by various people, assembled into a library and given to the world.   And now we are stuck with it.

Most of the functions can return a NULL or a pointer to a string.  You must always check every return value that  can return NULL and handle the problem of what to do with a null when you get one.  If you attempt to treat a NULL return value as a pointer to a string, you will quickly crash your program.



I have arbitarily divided the sections up into various sections that clump the functions together that act alike.  This grouping could have been done along string/memory function lines, but since there really isn't that much difference between the the two sets of functions I decided that it makes more sense to see how similar functions work and their slight differences if they were right next to each other.




Copying
    #include <string.h>void *memcpy(void *dest, const void *src, size_t n);
    void *memmove(void *dest, const void *src, size_t n);
    char *strncpy(char *dest, const char *src, size_t n);
    char *strcpy(char *dest, const char *src);
     
void *dest is a pointer to the array which will receive the copy.
char *dest is a pointer to the string which will receive the copy.
const void *src  is a pointer to the array from which the copy will be made.
const char *src is a pointer to the string from which the copy will be made.
size_t n is the number of characters to be copied.

These functions all return a pointer to dest.  Which is strange, because you already have a pointer to dest.

memcpy copies n characters from the location pointed at by src to the location pointed at by dest.  Don't copy areas that overlap or your program will crash.

memmove also copies n characters from the location pointed at by src to the location pointed at by dest.  But it first copies the characters to a temporary location then into the final location, so this is the function to use if you are copying overlapping areas of memory.

strncpy copies no more than n characters from the location pointed at by src to the location pointed at by dest.  This function will stop at the first null character, which may be at any location less than or equal to n. If n characters are copied and no null is found, no null is written.  This is a great way to leave the end of a string open.  You should always explictly write zero to the end of the string.

strcpy copies the string pointed at by src to the location pointed at by dest, including the ending null character.  Warning! Never use this function for data that comes from the real world !!!    The biggest danger of using this function is that if there is no null character you will happily go copying through memory until you randomly find a null or you access memory that doesn't belong to your process and the process is killed with a SEGFAULT.  Programs can capture this signal and shutdown, but at this point you are so hosed that it is best just to let the program core dump.

I have already given a few examples of how to use strcpy and strncpy.  memcpy and memmove are used exactly like strncpy, but they can copy arbitrary blocks of bytes, not just strings.




Concatenation
    #include <string.h>char *strcat(char *dest, const char *src);
    char *strncat(char *dest, const char *src, size_t n);
char *dest  is a pointer to the string which will receive the copy.
const char *src  is a pointer to the string from which the copy will be made.
size_t n  is the number of characters to be copied.
  strcat appends the source string, including the final '\0', onto the end of the destination string. It overwrites the trailing '\0' on the end of the destination string.

strncat does the same, except it will only copy at most n characters from destination and it will append a '\0'.

Both strcat and strncat return a pointer to the destination string.  Again, there is no bounds checking on the resulting string, so make sure that the string you create isn't too long to fit in the memory you have allocated for it.



Comparison
    #include <string.h>int memcmp(const void *s1, const void *s2, size_t n);
    int strcmp(const char *s1, const char *s2);
    int strncmp(const char *s1, const char *s2, size_t n);
    int strcoll(const char *s1, const char *s2);
    size_t strxfrm(const char *s1, const char *s2, size_t n);
const char *s1 is a pointer to the first string.
const void *s1  is a pointer to the first memory array.
const char *s2  is a pointer to the second string.
const void *s2  is a pointer to the second memory array.
size_t n  is the number of characters to be copied.

memcmp compares the number of bytes given by n.  If  s1 is less than s2, return a value less than zero.  If s1 is equal to s2, return zero.  If s1 is greater than s2, return a value greater than zero.  The comparison is based on the byte values of the ASCII characters in the memory array.

strcmp compares the two strings s1 and s2.  A string is a null terminated array of characters.  If  s1 is less than s2, return a value less than zero.  If s1 is equal to s2, return zero.  If s1 is greater than s2, return a value greater than zero.  The comparison is based on the byte values of the ASCII characters in the two strings.

strncmp is very similar to memcmp, except that it compares the two strings, upto the length given by n.  If a string is shorter than n, than the memory locations following n are not compared.  If  s1 is less than s2, return a value less than zero.  If s1 is equal to s2, return zero.  If s1 is greater than s2, return a value greater than zero.

strcoll compares the two strings s1 and s2.   If  s1 is less than s2, return a value less than zero.  If s1 is equal to s2, return zero.  If s1 is greater than s2, return a value greater than zero.  The comparison is based on the locale that is set with the setlocale() function in the <locale.h> library.  I will cover this library in a later article.

strxfrm transforms string s2 based on the locale category LC_COLLATE.  It then copies n bytes into string s1.  Finally it returns the number of characters actually placed into string s1.  If y >= n then there was an error.



Search
    #include <string.h>void *memchr(const void *s, int c, size_t n);
    char *strchr(const char *s, int c);
    size_t *strcspn(const char *s, const char *reject);
    size_t *strspn(const char *s, const char *accept);
    char *strpbrk(const char *s, const char *accept);
    char *strchr(const char *s, int c);
    char *strrchr(const char *s, int c);
    char *strstr(const char *s, const char *substring);
    char *strtok(char *s, const char *delim);
const void *s is the pointer to the array to be searched.
int c is the character to search for.
char *dest  is a pointer to the array which will receive the copy.
const char *src  is a pointer to the array from which the copy will be made.
size_t n  is the number of characters to be copied.

memchr will search the memory array pointed to by s for character c, up to n characters, returning a pointer to the first location, or NULL if the character is not found in the memory array.

strcspn returns the length of the beginning of the string s that contains no characters in the reject string.

strspn returns the length of the beginning of the string s that contains only characters in the accept string.

strpbrk returns a pointer to the location of the first character in string s that matches any character in the accept string.  Or a NULL if c is not found in string s.

strchr will search the string pointed to by s for character c, returning a pointer to the first location, or NULL if the character is not found in the string.

strrchr returns a pointer to the location of the last character in string s that matches the character represented by integer c.  Or a NULL of c is not found in s.

strstr returns a pointer to the location of string substring in string s, or a NULL if the substring is not found in s.

The strtok man page says that there are a lot of problems with this function and says to never use the function.  strtok takes a string and divides it up into tokens.  The first call to the function has string s as its first argument and returns the first token.  After the first call the function is called with NULL as the first argument and the function continues to return each token in turn until a NULL is returned when there are no more tokens.  The delimiter can be changed with each call, or can be kept the same through all the calls.  The limitations of this function are many;  the function modifies the original string s, the value of the delimiter isn't retained between calls and the function won't work with constant strings.



Miscellaneous
    #include <string.h>void *memset(void *s, int c, size_t n);
    char *strerror(int errnum);
    size_t *strlen(const char *s);
     
void *s
int c
size_t n
int errnum
const char *s

memset fills memory array s of size n with the integer value in c and returns a pointer to memory array s.

strerror returns a pointer to the string that describes the errornum passed as an argument, or an unknow error string if the errnum isn't known.  This works with various other error related functions in the <stdio.h> and <error.h> libraries that a future article will have to cover in great depth.

strlen returns the number of characters in string s, not including the '\0' string terminator.



Non Portable Functions
 
The GNU string library has many that the Standard C Library doesn't.  The descriptions are taken out of the man pages cut and paste.  If you want your code to work on any unix box then don't use these functions.  However, they are a good guide for implementing a function in your own code that is portable.

int strcasecmp(const char *s1, const char *s2);
 
strcasecmp compares the two strings s1  and s2,  ignoring  the  case of the characters.  It returns an integer less than, equal to, or greater than zero if s1 is found,  respectively,  to  be  less  than, to match, or be greater than s2.

int strncasecmp(const char *s1, const char *s2, size_t n);
 
strncasecmp is similar, except it only compares the first n characters of s1.

strcasecmp and  strncasecmp return an integer less than, equal to, or greater than  zero  if  s1 (or  the first n bytes thereof) is found, respectively, to be less than, to match, or be greater than s2.

char *strdup(const char *s);
 
I have implemented this function all on my own without knowing about this function!  I learn something new about Linux everyday.

strdup returns a pointer to a new string which is a duplicate of the string s.  Memory for the  new string  is  obtained with malloc(3), and can be freed with free(3).

strdup returns a pointer to the  duplicated string, or NULL if insufficient memory was available.

char *strfry(char *string);
 
strfry randomizes the contents of string by using rand(3) to randomly swap characters in  the  string. The result is an anagram of string.

strfry returns a pointer to the randomized string.

char *strsep(char **stringp, const char *delim);
 
strsep  returns  the  next token from the string stringp which is delimited by delim.  The token  is terminated with a `\0' character and stringp is updated to point past the token.  Similar to the strtok() function, but is non-portable.

strsep returns a pointer to the  token,  or NULL if delim is not found in stringp.

char *index(const char *s, int c);
 
index returns a pointer to the first occurrence of the character c in the string s.  We should probably just use the strchr() function, it performs the same function in a portable manner.
char *rindex(const char *s, int c);
 
rindex returns a pointer to the last occurrence of the character c in the string s.  The terminating '\0' character is considered to be a part of the strings.  Please use the Standard C Library function strrchr(), it performs the exact same function, in a portable manner.

index and rindex return a pointer to the matched character or NULL if the character is not found.



Corrections to previous articles:

That's right!  I have finally gotten around to publishing  all the accumulated corrections to my previous articles.  Just look at all the mistakes that I have made!   My thanks to those who took the time to e-mail me after noticing a mistake in my articles.
  Subject:      The Standard C Library for Linux, Part Two"
  Date:       Wed, 12 Aug 1998 11:27:08 +0200
  From:       Lars Hesdorf <hesdorf@ibm.net>
Hej James M. Rogers
You wrote somewhere in "The Standard C Library for Linux, Part Two"
"putchar writes a character to standard out.  putchar(x) is the same as
fputc(x, STDIN)"
You probably meant "...fputc(x, STDOUT)
Lars Hesdorf
HESDORF@IBM.NET
Reply:
  Actually I think that I even got the capitalization wrong, I believe that it should be "fputc(x, stdout)"  The example program is correct because I compiled and tested that for correctness.
 
Subject:          The Standard C Library for Linux, Part Two
      Date:          Wed, 04 Aug 1999 21:00:59 +1000
     From:          32000151 <32000151@snetmp.cpg.com.au>
 Organization:   Student of Computer Power Institute
 
Dear Sir,
in The Standard C Library for Linux, Part Two you wrote
"   char *fgets(char *s, int n, FILE *stream);
char *s the string that will hold the result.
int n the maximum number of characters to read.
FILE *stream is an already existing stream.
.
.
.
fgets reads at most n characters from the stream into the string.
    char s[1024];
    FILE *stream;
    if((stream = fopen ("filename", "r")) != (FILE *)0) {
       while((fgets(s, 1023, stream)) != (char *)0 ) {
         <process each line>
       }
    } else {
        <do fopen error handling>
    } "
but fgets() actually reads up to n-1 characters, so it always has room
for the \0 (if n is set to the array size).
Tim McCormack
32000151@bran.snetmp.cpg.com.au
Reply:
   Thanks, I am going to have to make sure that I used this function correctly in my example program.
 
  Subject:  snprintf in Article C Library for Linux?
  Date:  Tue, 01 Sep 1998 17:53:19 +0200
  From: Renaud Hebert <hebert@bcv01y01.vz.cit.alcatel.fr>
I didn't know snprintf, but I think that it is a clever thing to
do to avoid overflowing the string buffer (much better than the evil
sprintf).
But that the first time I see it in a C library, so is-it a Linux only
function or is-it a "new" standard function which wasn't included in
HP-UX for example.
Maybe you could distinguish in your article, the standard library
function and those Linux only.
Anyway this snprintf function is "A good Thing" TM.
Thanks for your articles, they are very well-written and very
informative.
--
__________________________________________________________________
Renaud HEBERT                   CR2A-DI
Software Developer
Reply:
  I think that it is a GNU only thing.  So you may want to avoid using the snprintf function unless you only want your programs to work in a GNU environment.  I found a bunch of very useful GNU only string functions and will taking your advice on pointing out those functions that are only found in Linux.
 
Subject:          Standard C Programming Library Part 3
     Date:          Sun, 20 Sep 1998 09:52:29 -0400
    From:          Laurin Killian <lek@uconect.net>
   Organization:          Streamlined Development
 
Since you ask for corrections....
There are a couple of typos in your examples:
------------you wrote:
float x=99.1234;
sprintf(string, "%d", x)
------------should be...
sprintf(string, "%f", x);
                          ^
------------you wrote:
float x=99.1234;
returnValue=sprintf(string, 4, "%d", x)
------------should be...
returnValue=snprintf(string, 5, "%f", x);
                      ^                    ^        ^
(to get the desired result of "99.1" - you need space for the null char)
All the "scanf" type functions should have ampersands (&):
scanf("%f%2d%d", &float1, &int1, &int2);
Hope this helps
-Laurin
Reply:
  Helps a lot, thank you!
 
Subject:      character handling program
  Date:       Mon, 15 Mar 1999 13:31:41 +0100
  From:       jorgen.tegner@sundsdefibrator.com
Hi,
your code in Linux gazette is missing the setlocale() function call at the
beginning. That´s why you don´t get any
useful results for characters above 127 as programs start out in the C locale by
default. Also, isalpha(), toupper()
and tolower() are not restricted to the A-Za-z range.
Regards,
Jörgen Tegnér
Reply:
  Absolutely right, I am saving setlocale() for when I cover <locale.h>.  :)


Bibilography:

The ANSI C Programming Language, Second Edition, Brian W. Kernighan, Dennis M. Ritchie, Printice Hall Software Series, 1988 The Standard C Library, P. J. Plauger, Printice Hall P T R, 1992
The Standard C Library, Parts 1, 2, and 3, Chuck Allison, C/C++ Users Journal, January, February, March 1995
STRING(3), BSD MANPAGE, Linux Programmer's Manual

No comments:

Post a Comment