Sunday, December 31, 2006

Gray Areas In Java

In case you ever wondered what weak and soft references are in Java, here are couple good articles explain what they are and when you might use them:

Tuesday, December 19, 2006

Python vs. Alligator

It sounds like a crazy Sci-Fi Channel B-movie. CNN has an interesting article about pythons invading the Florida Everglades. The middle of the article has the stories about pythons attacking alligators.

Sunday, December 17, 2006

The Denial Machine

The CBC's Fifth Estate has a great documentary, The Denial Machine, available online (in its entirety without commercials) about how Big Tobacco's PR firms and junk scientists from the 1990's are now on the payroll for Exxon Mobil and other oil and coal companies. They are being paid to deny Global Warming and are busy partnering with the Bush White House to censor science and lie to and manipulate the public.

Linear Algebra And Google's PageRank

The American Mathematical Society has a good article, How Google Finds Your Needle in the Web's Haystack, which explains how Google ranks the importance of each web page. There's lots of linear algebra, but its fairly accesible.

[Via Slashdot].

Tuesday, December 05, 2006

Forward Declarations And Faster Compilations

In C++, a Forward Declaration is a declaration of a type that occurs before the definition of that type. For example, suppose you write the small program:
#include <iostream>

int main(void) {
  doSomething();
  return 0;
}

void doSomething(void) {
  std::cout << "Hello World!" << std::endl;
}


This program, won't compile - you'll get an error like dosomething.cc:4: error: `doSomething' undeclared (first use this function) because doSomething() is called in the source code (line 4) before it is visible (line 8 onwards).

The simplest way to fix this is to simply switch the order that main() and doSomething() appear in the file. i.e.
#include <iostream>

void doSomething(void) {
  std::cout << "Hello World!" << std::endl;
}


int main(void) {
  doSomething();
  return 0;
}


Another way to correct the file is to use a forward declaration of doSomething(). i.e.
#include <iostream>

// forward declaration
void doSomething(void);


int main(void) {
  doSomething();
  return 0;
}

void doSomething(void) {
  std::cout << "Hello World!" << std::endl;
}

The (potential) advantage of this approach is that it gives you the ability to layout the order of the file as you see fit (e.g. to make it more readable) while satisfying the compiler (language requirements).

Now consider the following larger example with three classes:
Name.h
#include <string>

class Name {
  public:
    Name(const std::string & firstName, const std::string & lastName);

    std::string getFirstName() const;
    std::string getLastName() const;

  private:
    std::string _firstName;
    std::string _lastName;
};


EmailAddress.h
#include <string>

class EmailAddress {
  public:
    EmailAddress(const std::string & emailAddress);

    std::string getEmailAddress() const;
    std::string getDomain() const;

  private:
    std::string _userName;   // e.g. "JohnSmith" part of "JohnSmith@gmail.com"
    std::string _domain;    // e.g. "gmail.com" part of "JohnSmith@gmail.com"
};


Person.h
#include "EmailAddress.h"
#include "Name.h"

class Person {
  public:
    Person(const Name & name, const EmailAddress & emailAddress);

    const Name & getName() const;
    const EmailAddress & getEmailAddress() const;

  private:
    const Name & _name;
    const EmailAddress & _emailAddress;
};


Person.cpp
#include "Person.h"

Person::Person(const Name & name, const EmailAddress & emailAddress)
  : _name(name), _emailAddress(emailAddress)
{
  // empty
}

// etc.


NameFunctions.cpp
#include <iostream>
#include <vector>
#include "Person.h"

void printFirstNames(const std::vector< Person > & persons) {
  for( std::vector< Person >::const_iterator iter = persons.begin();
      iter != persons.end(); ++iter ) {
    std::cout << (*iter).getName().getFirstName() << std::endl;
  }
}


Notice how Person.h includes both Name.h and EmailAddress.h. Consider what happens when NameFunctions.cpp is compiled. The preprocessor must replace the include of Person.h with the contents of that file. To do this, the preprocessor first recursively includes the contents of Name.h and EmailAddress.h in Person.h since Person.h includes these two files. Therefore, to build NameFunctions.cpp, four files must be loaded and processed - NameFunctions.cpp as well as its includes (Person.h) and the includes' includes (Name.h and EmailAddress.h).

Similarly, consider what happens when you change EmailAddress.h and then run a program like make to recompile any impacted source code. Make will consider all the files that could be affected by this change, of which NameFunctions.cpp is one of those files! (It consumes EmailAddress.h indirectly, as shown above). Likewise, when you are computing the files that depend upon EmailAddress.h, the calculation is non-trivial since you need to traverse from EmailAddress.h to Person.h to NameFunctions.cpp.

So what do all these includes mean for your compilation times?

  • Compiling a source file involves loading a lot of header files into memory (and not just all the ones listed in the source file; but the includes' includes, the includes' includes' includes, etc.), which can be many more files than you would suspect. In a large project, you are going to have a lot of these files, and they will be scattered across your disk and each will be used only some of the time. i.e. The limited locality and large input set won't be favourable to caching and the hard drive could be hit often (which is slow).
  • Dynamically figuring out dependencies is costly, since the graph of related files has a lot of edges. (And this can, again, involve lots of disk accesses).
  • Dynamically figuring out what dependencies have changed since you last ran make is expensive because the dependencies are expansive. (Again, lots of disk accesses).
  • When you change a header file, there will be a cascading effect of rebuilding a lot of source files.

Now, is all this work really necessary? First notice, that NameFunctions.cpp technically depends on EmailAddress.h (since Person.h does), but NameFunctions.cpp only really uses the Person and Name classes (but not the EmailAddress class). Next, notice that the definition of Person in Person.h only uses references to Name and EmailAddress. Hence, the compiler only needs to know that Name and EmailAddress are classes when it processes Person.h, but not the substance of those classes. (As the references are really just memory addresses, i.e. 1 word, the memory layout of Person is independent of the contents and interface of these two classes). Therefore, we can replace the two lines in Person.h:
  #include "Name.h"
  #include "EmailAddress.h"

with the lines:
  // forward declarations
  class Name;
  class EmailAddress;
.
And the start of NameFunctions.cpp must now be changed from
  #include "Person.h" to
  #include "Person.h"
  #include "Name.h"

(as the function in this file invokes Name::getFirstName(), a forward declartion does not suffice, it needs to know the definition of Name).

What are the consequences of these changes? The code still compiles, but faster:
  • Compiling NameFunctions.cpp no longer involves loading EmailAddress.h.
  • Figuring out the dependencies of NameFunctions.cpp is simpler because there are less dependencies (and they are likely more stable).
  • Figuring out what files need to be recompiled is less costly because there are less dependencies to examine.
  • Rebuilding the affected source code after a change to EmailAddress.h is less work because the (falsely) dependent file NameFunctions.cpp no longer needs to be built.

In addition to slower compiler times, using includes in headers (instead of forward declartions) can cause other problems. e.g. Suppose A.cpp uses the class C, but gets it indirectly via including B.h instead of C.h
  • What happens if B.h is changed to no longer include C.h? Then A.cpp no longer compiles, an unfortunate side effect. Hence, the code (without forward declarations) is brittle and not self-documenting (as A.cpp's dependency on C.h is not explicit).
  • Also, dependencies are more likely to be properly calculated. If your dependency calculation is shallow (i.e. it determines that A.cpp depends upon B.h, but not that A.cpp depends upon the dependencies of B.h, such as C.h) What happens if the definition of C changes in C.h? A.cpp will not be recompiled and the resulting program execution is likely to exhibit bizarre behaviour (i.e. if C's v-table was altered).
  • Searching for dependent files by using a command like grep "#include \"C.h\"" will be miss finding A.cpp.