Saturday, May 31, 2008

How to code

Well, I think this post might start off my tech-blogging spree. That’s not to say much about the original spree. Three posts in two years’ time doesn’t quite fit the definition of one.

This one goes out to all of you out there who have just done 12th standard or typical first year B Tech C or C++ and, in my opinion, have a lot of things to unlearn. Some of you may find the content in this post shocking, some even offensive, but hey, I present facts as they are, and I didn’t write the standards documents.

Yes, there are standards documents for C and C++, very detailed guidelines regarding how to write clean, portable C and C++ code. Download them for free from http://www.open-std.org/.

Before starting, I’d also like to make a few disclaimers:

  • I’ve tried to keep the content of this post as technically accurate and unambiguous as I could help it. Wherever appropriate, I’ve cited references, and most of the things I’ve cared to mention here are probably explained in more detail at the references.
  • Being a C++ programmer myself, some of the content in this post might be C++ specific. Though the languages share many things in common, there are many subtle differences. I’ve made efforts to highlight these differences wherever necessary. On the other hand, having used C++ most of my life, when it comes to C, there might actually be differences that I don’t know about. It’s up to the reader to verify the authenticity of such content in this post. (http://www.c-faq.com/ might come in handy.)
  • If you find some errors in this post, please post them as comments. I’ll do my best to verify and post the corrections here as early as possible.

The order in which topics follow is quite random. Most of them are unrelated, and you can probably read the ones that intrigue you the most.

Important terms:

I’m copy-pasting a few definitions with examples from the standards documents here.

Undefined Behavior:

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements

For example, the behavior of your code on integer overflow, or the behavior of constructs like a = a++; are undefined.

Undefined behavior doesn’t mean that your code wouldn’t compile. It simply means that the standards specify no requirements on the behavior of your code. For all you know, your program might start baking cookies.

This is probably a good time to point out that just because your code compiles without errors, it doesn’t mean the code is absolutely right. Going a step further, protests like, “a = a++; seems to be working fine for me,” are meaningless, and are equivalent to saying, “I was playing soccer the other day, and threw the ball into the goal post with my hands, and it worked fine!”. That’s just not the way the game is played.

Try this page, for more information: http://www.eskimo.com/~scs/readings/undef.950311.html

Unspecified Behavior:

use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance.

For example, the order in which arguments are evaluated in a function call.

Implementation Defined Behavior:

unspecified behavior where each implementation documents how the choice is made

An example would be the propagation of the high-order (most significant) bit when a signed integer is shifted right.

Locale-specific Behavior:

behavior that depends on local conventions of nationality, culture, and language that each implementation documents

Example, whether islower() returns true for characters other than a-z is locale-specific.

Further, as far as C or C++ is concerned, the terms byte and char are, for the most part, interchangeable. Consider the following definitions as per the standards:

  • sizeof returns the size of an object, or a type in bytes.
  • A char is a single-byte character, this means that sizeof(char) is 1 by definition.
  • A byte is an addressable unit of data storage large enough to hold any member of the basic character set of the execution environment.

Putting these pieces together tells you that a byte needn’t be 8 bits anymore. In fact, if I made a standards-compliant C compiler that works with Unicode, rather than ASCII, my chars would need 16 bits. But sizeof(char) is 1, and sizeof returns size in bytes. That means, for my implementation, a byte is 16 bits of data, not 8 bits. The number of bits in a byte is specified by the macro CHAR_BIT found in limits.h (climits, in C++).

Good code should try not to rely on the number of bits in a byte being 8. In most cases, you should be able to do without such assumptions. Use the CHAR_BIT macro if absolutely necessary.

Let's begin, then.

Why main shouldn’t be void:

This is definitely going to be new to somebody who hasn’t seen books other than Kanetkar’s or Sumita Arora’s. void main() is wrong!

The explanation given in these books for such usage is something like, “we do not wish to return any values from main, so we mark it as void.” But main() isn’t any other function, now, is it?

As far as the standards go, main must return int, and only int: (this is with reference to a hosted environment: your C or C++ program runs with the help of an operating system)

5.1.2.2.1, C

The function called at program startup is named main. The implementation declares no prototype for this function. It shall be defined with a return type of int and with no parameters:

 int main(void) { /* ... */ }

or with two parameters (referred to here as argc and argv, though any names may be used, as they are local to the function in which they are declared):

 int main(int argc, char *argv[]) { /* ... */ }

or equivalent; or in some other implementation-defined manner.

The standards clearly specify two valid ways to define main, and they both return int. (And or equivalent simply means that int may be substituted by some other name typedefed to int, or that char *argv[] may be replaced by char **argv and so on.) In fact, a program that defines main in a way not equivalent to either of these specified forms invokes Undefined Behavior.

The question is, where does this return value go? Well, the value returned by main goes to the calling system, something that invoked your application. In many cases, this might be the operating system. It may also be some applications written by other programmers like you. The return value of main is a handy way to test whether your program executed correctly, a lot handier than having to look at the error stream of your application, and parsing it to figure out if something went wrong. Generally, a return value of 0 indicates success, and a non-zero value would stand for different error codes.

As an example, let’s say I’m making an installer, and you’ve already made an application to unpack archives. I can simply run your application with necessary arguments, and check the return value to see if your application extracted my archive correctly.

Sizes of structs:

Consider a struct defined as below:

 struct MyStruct {
int a;
char ch;
};

If you’ve learnt that sizeof(MyStruct) would yield 3, it’s wrong. For one thing, the size of an integer is specified to simply be the natural size suggested by the architecture of the execution environment. This means that an integer can be 2 bytes or 4 bytes, or how many ever bytes as the implementation sees fit.

Now, is sizeof(MyStruct) == sizeof(int) + 1?

The answer is still no, thanks to something called structure padding. Compilers are free to pad structures with excess bits (or bytes) for optimization purposes. Generally, structure padding is done in a way as to align the objects with words of the system. On a 32 bit system, for example, this kind of padding would leave structure sizes to be a multiple of four bytes. Hence, the above struct can very well weigh in at 8 bytes.

See the wikipedia page that deals with this: http://en.wikipedia.org/wiki/Data_structure_alignment

http://www.goingware.com/tips/getting-started/alignment.html might prove to be a good read too.

No more conio.h:

There is no such thing as a conio.h as far as standard C or C++ goes. Or a graphics.h. DOS mode graphics are obsolete, and ought to be done away with.

No more clrscr(), either. Though system(“cls”); and system(“clear”); may prove to be alternatives.

But do you really want to clear the screen? On a terminal like Windows’, clearing the screen practically wipes out everything that the user had on his terminal. There is no way to retrieve the information (as far as I know). What if he’d spent the last few decades calculating the first billion gazillion digits of PI? (in which case, he ought to have redirected the output to a file, but, hey, what the heck? This is just an example. Besides, I doubt if a Windows machine could have such uptimes.) Would you wash it all away without even warning the chap? I wouldn’t.

Short circuit evaluation:

The logical and (&&) and or (||) operators operate by what’s known as short circuit evaluation. (The last time I checked, Balagurusamy didn’t know about this.)

What this means is that they guarantee left to right evaluation of operands, plus:

  • The && operator evaluates the expression on the right hand side only if the left hand side evaluated to true.
  • The || operator evaluates the expression on the right hand side only if the left hand side evaluated to false.

Think of it as a kind of optimization. If the left hand side of an && is false, the result is going to be false, so there’s no point in evaluating the right hand side (who knows how long a function call might take, eh?). Same kinda thing goes for ||.

Besides possibly saving some time, there are a few notable consequences to this.

For example, thanks to short circuit evaluation, expressions like:

 b != 0 && a / b < 100
p != NULL && p->value == 50

and so on become inherently safe.

You can also use handy expressions like:

strcmp(str1, str2) || cout << “Strings are equal!” << endl;

(I’d consider this a little difficult to read, especially for people who don’t know what short circuit evaluation is, and avoid this as far as possible.)

As another example, try:

 int i = -1, j = 5;
int k = ++i && ++j;

Does j get incremented? No.

(C++ only) structs can have member functions:

As far as c++ goes, structs and classes are the exact same thing. structs have public visibility by default, while classes have private. Other than this, possibly small, difference, you can do anything with structs that you can do with classes. Add member functions, inherit them, anything!

Why a = a++ is undefined:

This was an example I used earlier in the context of Undefined Behavior. Perhaps I can elaborate a bit more on why this construct invokes Undefined Behavior in this section.

The standards clearly say:

Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be read only to determine the value to be stored.

This pretty much gets all different combinations like

 a = a++;
a = a++ * ++a;
A[a] = a++;

and so on out of the way for good.

To understand the statement, however, we need to know about side effects and sequence points.

Side effects are basically changes of the state of the execution environment, like, for example, modifying an object, modifying a file, etc., or calling functions or using operators that involve these kinds of operations. For example, in a simple a++, the increment is a side effect.

Sequence points are points in the execution sequence where all side effects of previous evaluations have taken place, and no side effects of subsequent evaluations will have taken place. The end of a full expression, for example, denotes a sequence point.

In a = a++; consider the two consecutive sequence points, one immediately before and one immediately after the statement. You’re modifying a with an increment, as well as trying to assign a new value to a using the = operator. Two modifications between two consecutive sequence points, undefined behavior.

To be a bit plainer, just know that the exact point of time when the ++ increments a is not specified. It is guaranteed that the increment will occur before the next statement, and that’s about it. This leaves several possible ways in which the expression can be evaluated, all of which are completely valid, as far as the standards are concerned, two of which might be:

  1. Perform the assignment, and then the increment. In this case, the value of a gets incremented by 1.
  2. Store the original value of a in a temporary variable, perform the increment, and then assign the value in the temporary variable back into a. In this case, the value of a remains unchanged after the expression.

This kind of a discussion itself is pointless, but I’m including it here for those of you who need stout examples.

Why fflush shouldn’t be used on stdin:

 int fflush(FILE *stream);

If stream points to an output stream or an update stream in which the most recent operation was not input, the fflush function causes any unwritten data for that stream to be delivered to the host environment to be written to the file; otherwise, the behavior is undefined.

This simply rules out any possibility of using fflush on input streams. And for good reason. What does flushing an input stream mean, anyway? Does it make sense? Flushing refers to writing out all the left over contents in a buffer. Why would you ‘write out’ contents in an input stream?

To clear the input stream, use a function similar to the one shown below:

 void clear_stdin() {
char ch;
while((ch = getchar()) != '\n' && ch != EOF);
}

More intelligent use of scanf might save you the hazzle of having to clear the input stream at all. See http://www.cplusplus.com/reference/clibrary/cstdio/scanf.html and pay close attention to the section named 'Whitespace character'.

I can access private variables with pointers!:

Such a statement simply betrays all the misconceptions you have about 'data hiding'.

Firstly, an object is a physical entity that resides on your computer's memory, so yes, a little bit of low level code can tell you what's stored in it. So you're working no magic here.

Secondly, when we say 'data hiding', we simply mean that keeping data members in private sections of your classes will save them from being altered accidentally. Keeping data private gives your class a certain level of confidence about the different states it might find the data in at any point of time, because after all, it's just the class that can meddle with the data right? One needs to understand that such 'data hiding' is simply a kind of contract that you make, which details what are the right ways to work with a class, and what aren't. It's up to the users of the class to follow these guidelines while working with objects of the class.

What are string literals?

A double quoted string lying around in the code is a string literal.

For example, in cout << “ruggedrat” << endl; “ruggedrat” is a string literal.

What are the consequences?

Try

 char *str = “ruggedrat”;

Now, str is a pointer to a literal. This means that any attempts to modify ruggedrat to ruggedbat, like, say, str[6] = 'b'; invokes undefined behavior.

str is a const char * and you'd do well to declare it that way too. To create a mutable string, use one of the following:

 char str[10] = “ruggedrat”;
char str[] = “ruggedrat”;
char str[10]; strcpy(str, “ruggedrat”);

or some equivalent.


I think that should do for now. There are a lot of other issues I could have taken up in this post, but my aim is neither to create a complete C++ reference here, nor to break the longest blog post record. This post is just aimed at directing people to good C and C++ coding practices, and showing where to find more information on them.

I'd conclude by saying that the compiler is just an executable. It's not a foe you're supposed to overcome. In fact, you'd do well at programming once you realize it's quite the other way around. Efforts at cheating a compiler beats the purpose of having one, and better programmers would see them merely as a vulgar display of your ineptitude at using the language.

Throw away outdated books, and delete antique compilers. Get the standards documents, and start writing clean applications.

- A Rugged Rat exhausted from all the typing.


P.S. A few neat FAQs to learn from:

1 comment:

Ankit said...

hmm.. that was a really nice post...
and for a change i almost knew everything! :)
except for the stdin thing which actually never thought about.. anyway try using a stringstream.. its helpful when you are working with a lot of data and the data is created during the compile time and then the whole stream is either flushed to stdout/stderr or a file..
and a suggestion abt ur next post: write about Good coding Practices!
thats something a good programmer should noe about..