Blog Splash

CSS Minify Algorithm Updates

by Kerido Sunday, November 21, 2010 8:25 AM

Today we have updated our Ultra Fast CSS Minify Algorithm. The updates include:

  • Fixed bugs causing whitespace removed near the percent sign % and both parentheses ().
  • Improved handling of both single-quoted and double-quoted strings.

We're finally committing the changes to the BlogEngine.NET repository and, hopefully, they will be available in some major product release.

An Ultra Fast CSS Minify Algorithm

by Kerido Saturday, January 30, 2010 6:10 AM

Introduction

You've probably thought a lot about ways you can optimize your Web site. Since the release of this new site, I've been thinking about it a lot. In this post I would like to introduce an advanced version of the CSS minify algorithm. There are several existing implementations available, but I took the one integrated into BlogEngine.NET, as a reference. Compared to it, my algorithm offers several benefits:

Concept

As opposed to the built-in CSS minify algorithm which uses regular expressions, the current algorithm is more like a state machine. This is why you might think it's harder to read and debug. This is why I am writing this post.

First, here are valid CSS states that the algorithm uses:

enum CssState
  { Punctuation, Token, StringD, StringS }

The CSS code is always considered Punctuation if a different state does not apply. Curly braces, square brackets, parentheses – this is all Punctuation. The Token state includes all tag names, element IDs and classes, properties and values – anything that consists of alpha-numeric characters plus a limited number of auxiliary characters – .#_-@*%(). And finally, there are two string states. StringD represents a double-quoted string, StringS – single-quoted. I'll cover more about the reason why the latter two are required, in the Handling Strings section.

The algorithm reads the input character by character and determines if a sequence of whitespace characters can be removed from it. It also handles comments pretty much the same way. Here is the code illustrating the idea:

public static string Minify(string theCss)
{
  // Assume that the length of the output string
  // will be at most 75% the length of the incoming one to avoid
  // additional StringBuilder reallocations.
  StringBuilder aRet = new StringBuilder(theCss.Length * 3 / 4);

  int aNumChars = theCss.Length;

  CssState aPrevState = CssState.Punctuation;
  int aPrevPos = 0;

  int i = 0;

  while(i < aNumChars)
  {
    CssState aCurState = GetState(theCss, ref i, aPrevState);

    if(i > aPrevPos + 1)
    {
      // If whitespace is found between two tokens, keep it compact
      if (aPrevState != CssState.Punctuation && aCurState != CssState.Punctuation)
        aRet.Append(' ');

      // Otherwise, no whitespace is needed, skip everything between aPrevPos and i
    }

    aPrevPos = i;
    aPrevState = aCurState;
    aRet.Append(theCss[i++]);
  }

  return aRet.ToString();
}

As you can see, whitespace must not be removed is when it delimits token or string characters. Examples of this case are:

border-bottom: solid 2px #9f4c1f;
background-position: center top;
quotes: '« ' ' »';
quotes: "»" "«" "\2039" "\203A";

In all other cases, the whitespace can be trimmed. Again, there is a special case when whitespace is found inside a string, but we'll cover it in the next section.

The key of the algorithm is the GetState method. It skips any comments and whitespace characters found in the incoming string, updates the position variable i to a value represented by a meaningful state:

static CssState GetState(string theCss, ref int thePos, CssState theCurState)
{
  int aLen = theCss.Length;
  int i = thePos;

  if (theCurState == CssState.StringD)
  {
    //REMOVED FOR COMPACTNESS
  }
  else if (theCurState == CssState.StringS)
  {
    //REMOVED FOR COMPACTNESS
  }


  bool aSkip = true;

  while(aSkip)
  {
    /////////////////////////////////////////
    // Skip whitespace
    while(aSkip = (i < aLen - 1 && IsWhitespaceChar(theCss[i]) ) )
      i++;

    /////////////////////////////////////////
    // Skip comments
    if(i < aLen - 1)
      if (theCss[i] == '/' && theCss[i+1] == '*') // comment opening
      {
        aSkip = true;

        while(i < aLen - 1)
        {
          i++;

          if(theCss[i-1] == '*' && theCss[i] == '/') // comment closing
          {
            i++;
            break;
          }
        }
      }
  }

  thePos = i;
  if ( IsTokenChar( theCss[i] ) )
    return CssState.Token;

  else if ( theCss[i] == '\"' )
    return CssState.StringD;

  else if(theCss[i] == '\'')
    return CssState.StringS;

  else
    return CssState.Punctuation;
}

For for the sake of saving space I have removed several lines, so let's move on to see what they are meant for.

Handling Strings

According to the CSS specification, two types of strings are supported: single-quoted and double-quoted. My algorithm leaves the string intact, even if whitespace characters are found inside it. Of course, for the HTML language it may not matter at all since all whitespace characters will be merged by the browser. But I just think that keeping the string unmodified is more intuitive and professional. In order achieve that, support needs to be added for escaped quote characters. An escaped single quote character inside a single-quoted string does not close the string. Similarly, an escaped double quote character does not close a double quoted string:

if (theCurState == CssState.StringD)
{
  if(theCss[i] == '\"')
  {
    // Make sure the double quote character is not escaped
    if(thePos > 0)
      if(theCss[i-1] == '\\')
        return CssState.StringD;

    // Enforce a whitespace afterwards
    return CssState.Token;
  }
  else
    return CssState.StringD;
}
else if (theCurState == CssState.StringS)
{
  if(theCss[i] == '\'')
  {
    // Make sure the single quote character is not escaped
    if(thePos > 0)
      if(theCss[i-1] == '\\')
        return CssState.StringS;

    // Enforce a whitespace afterwards
    return CssState.Token;
  }
  else
    return CssState.StringS;
}

There is a special example attribute in the reference CSS file that illustrates the differences in the way the two algorithms work:

angledouble: "Angle=             00deg00'00\"      ";
anglesingle: 'Angle=             00deg00\'00"      ';

Although these attributes make no sense to a browser, they represent valid CSS syntax. My version produces the same string while the BlogEngine.NET version trims several spaces from it. I consider this a flaw even though string length gets reduced.

Performance Testing

The suggested algorithm provides both speed and size wins. The code below lists two methods I used to measure output length and processing time of the two algorithms:

[TestMethod]
public void CssMinifyTest_Length()
{
  string aCss = Properties.Resources.style;

  string aMin1 = CssMinifier.Minify(aCss);
  string aMin2 = CssHandler.StripWhitespace(aCss);

  Console.WriteLine(
    string.Format("KO Software output ({0} bytes):", aMin1.Length) );
  Console.WriteLine(aMin1);

  Console.WriteLine("---------------------------------");

  Console.WriteLine(
    string.Format("BlogEngine.NET output ({0} bytes):", aMin2.Length) );
  Console.WriteLine(aMin2);

  Assert.IsTrue(aMin1.Length <= aMin2.Length);
}

[TestMethod]
public void CssMinifyTest_Speed()
{
  int aNumCycles = 50000;
  string aCss = Properties.Resources.style;

  DateTime aStart = DateTime.Now;
  for(int i = 0; i < aNumCycles; i++)
  {
    string aMin = CssMinifier.Minify(aCss);
  }
  DateTime aEnd = DateTime.Now;

  TimeSpan aTime_AspxGear = aEnd - aStart;
  Console.WriteLine(aTime_AspxGear.ToString());


  aStart = DateTime.Now;
  for(int i = 0; i < aNumCycles; i++)
  {
    string aMin = CssHandler.StripWhitespace(aCss);
  }
  aEnd = DateTime.Now;


  TimeSpan aTime_BE = aEnd - aStart;
  Console.WriteLine(aTime_BE.ToString());


  Assert.IsTrue(aTime_AspxGear <= aTime_BE);
}

The results are illustrated in the table below:

KO Software CSS Minifier BlogEngine.NET CSS Minifier
Output size, bytes 6,888 7,274
50,000 iteration processing time (Debug build) 33 seconds 1 minute and 25 seconds
50,000 iteration processing time (Release build) 15 seconds 1 minute and 20 seconds

Conclusion

The suggested algorithm produces output which is smaller by almost 400 bytes. And it runs more than 5 times faster! Interestingly, the Release build is more than two times faster than the Debug one while the BlogEngine.NET version yields only subtle performance boost. I suspect, this is because the latter uses regular expressions which are already built with most possible optimizations, regardless of our build type.

I am going to contact the BlogEngine.NET team and e-mail my code as a patch to their source code repository. I think, if this code is properly tested in various environments, it can surely improve web site stability and reduce traffic. As an old school programmer, I am still interested in ways of improving the code. So please, download the source code and leave comments with bugs and improvements.