6

Suppose I have very long strings and I want to see if a column is allLower, allUpper, or mixedCase. For example with the following column

text
"hello"
"New"
"items"
"iTem12"
"-3nXy"

The text would be mixedCase. A naive algorithm to determine this might be:

int is_mixed_case, is_all_lower, is_all_upper;
int has_lower = 0;
int has_upper = 0;
// for each row...for each column...
for (int i = 0; (c=s[i]) != '\0'; i++) {
    if (c >='a' && c <= 'z') {
        has_lower = 1;
        if (has_upper) break;
    }
    else if (c >='A' && c <= 'Z') {
        has_upper = 1;
        if (has_lower) break;
    }
}

is_all_lower = has_lower && !has_upper;
is_all_upper = has_upper && !has_lower;
is_mixed_case = has_lower && has_upper;

I'm sure there would be a more performant way to do this, however. What might be the most efficient way to do this algorithm/calculation?

  • 6
    It's mixed case as soon as you've found one character in each case, which could be as few as two characters in total. So if you iterate over the whole string, you may be wasting a lot of time. – jonrsharpe Aug 27 '19 at 21:41
  • @jonrsharpe -- thanks for the suggestion, I've updated it with the break. –  Aug 27 '19 at 21:42
  • 1
    Get the case of the first character. Then use `strspn()` to search for a character in the opposite case. – Barmar Aug 27 '19 at 21:48
  • If your strings had explicit length, this would be very easy to vectorize, which in turn will cause the code to max out cache/memory bandwidth on reasonable machines. – EOF Aug 27 '19 at 21:49
  • `strspn()` should be optimized as well as possible. – Barmar Aug 27 '19 at 21:50
  • Can't you just compare the string to lowercase and uppercase versions of itself, and if neither matches then it's mixed case. – David Aug 27 '19 at 21:50
  • 1
    `(c >='a' && c <= 'z')` is not portable. Use [`isupper()`](https://port70.net/~nsz/c/c11/n1570.html#7.4.1.11) and [`islower()`](https://port70.net/~nsz/c/c11/n1570.html#7.4.1.7). – Andrew Henle Aug 27 '19 at 21:50
  • @David how would that be faster? – EOF Aug 27 '19 at 21:52
  • 2
    Also... What [character set](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) are we talking about? Things get reeeeally hairy with Unicode... ;) – Vilx- Aug 27 '19 at 22:11
  • If you know you're dealing only with ASCII letters, the fastest way to check case is just `& 0x20`, and you can trivially vectorize that using larger-size integers (`memcpy` will prevent UB while generating the same code). Note also that you can just check whether the parity is even/odd like the low bit of the length. (*Should* you do this? That's a whole different question). – o11c Aug 27 '19 at 22:15
  • possible duplicate: [C programming: How to check whether the input string contains combination of uppercase and lowercase](https://stackoverflow.com/q/15171740/995714) – phuclv Nov 29 '19 at 16:17

6 Answers6

5

If you know the character encoding that's going to be used (I've used ISO/IEC 8859-15 in the code example), a look-up table may be the fastest solution. This also allows you to decide which characters from the extended character set, such as µ or ß, you'll count as upper case, lower case or non-alphabetical.

char test_case(const char *s) {
    static const char alphabet[] = {
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  //  ABCDEFGHIJKLMNO
        1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,  // PQRSTUVWXYZ
        0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,  //  abcdefghijklmno
        2,2,2,2,2,2,2,2,2,2,2,0,0,0,0,0,  // pqrstuvwxyz
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,1,0,2,0,2,0,0,0,0,  //        Š š ª
        0,0,0,0,0,1,2,0,0,2,0,2,0,1,2,1,  //      ޵  ž º ŒœŸ
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  // ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
        1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,  // ÐÑÒÓÔÕÖ ØÙÚÛÜÝÞß
        2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,  // àáâãäåæçèéêëìíîï
        2,2,2,2,2,2,2,0,2,2,2,2,2,2,2,2}; // ðñòóôõö øùúûüýþÿ
    char cases = 0;
    while (*s && cases != 3) {
        cases |= alphabet[(unsigned char) *s++];
    }
    return cases; // 0 = none, 1 = upper, 2 = lower, 3 = mixed
}

As suggested in a comment by chux, you can set the value of alphabet[0] to 4, and then you need only one condition cases < 3 in the while loop.

m69 is disappointed in SE
  • 11,464
  • 3
  • 30
  • 52
  • could you please explain a bit what's going on in the code above? I don't really understand how it works, etc. –  Aug 28 '19 at 01:37
  • @Shared For every character (I've assumed 8-bit extended ASCII here) you store a value of 1 if it is uppercase, 2 if it is lowercase, or 0 if it is a non-alphabetical character. Then, to check a string, you look up the value for each character, and OR it with the variable `cases`, which was set to 0 at the start. When you encounter the first letter, `cases` will be set to 1 or 2, depending on the case. If you later encounter a letter of the other case, `cases` will be set to 3, and you know the string has mixed case. – m69 is disappointed in SE Aug 28 '19 at 02:17
  • @mr69 -- wow that's a really cool approach. I'd be interested to see if this would outperform using the standard lower/upper on tons of strings of text input. –  Aug 28 '19 at 02:21
  • 1
    @Shared Often the performance of look-up tables depends a lot on their size. You may find strange jumps in timing if you add or remove a handful of elements. If you don't care about extended characters and reduce the table to 128 characters, or skip the first 32 control characters and use `*s++ -32` that may improve the speed. It's all about low-level optimizations that the compiler can do. – m69 is disappointed in SE Aug 28 '19 at 02:26
  • 4
    This is a good idea, yet can be improved. Change `alphabet[0] = 4;` then `(*s && cases != 3)` to `(cases < 3)`. This eliminates the repeated `*s` test. Then `return cases & 3;`. – chux - Reinstate Monica Aug 28 '19 at 13:31
  • @chux Good call. Every instruction is one too many when micro-optimizing :-) – m69 is disappointed in SE Aug 28 '19 at 16:46
2

This should be fairly efficient - it checks the minimum number of characters necessary. This assumes a bias towards lower-case characters, so checking for lower-case first should be slightly more efficient:

#include <ctype.h>

int ismixed( const unsigned char *str )
{
    int hasUpper = 0;
    int hasLower = 0;

    while ( *str )
    {
        // can't be both upper and lower case
        // but it can be neither
        if ( islower( *str ) )
        {
            hasLower = 1;
        }
        else if ( isupper( *str ) )
        {
            hasUpper = 1;
        }

        // return true as soon as we hit
        // both upper and lower case
        if ( hasLower && hasUpper )
        {
            return( 1 );
        }

        str++;
    }

    return( 0 );
}

Depending on whether your input is biased to lower or upper case, checking isupper() first might be better.

Andrew Henle
  • 1
  • 3
  • 24
  • 53
  • Shame there isn't a `charclassify()` like there is a `fpclassify()`, because then you could do `cc = charclassfy(*str); hasX |= (ISX & cc); hasY |= (ISY & cc);`. – EOF Aug 27 '19 at 22:12
  • @Andrew -- thanks for this answer. From a comment above (`If you know you're dealing only with ASCII letters, the fastest way to check case is just & 0x20`), does the `islower` / `isupper` use the bitwise operation? I guess my question is how is the `islower` better than the naive implementation I've used in the question above? –  Aug 27 '19 at 22:31
  • @Shared The C standard does not specify ASCII encoding. The functions defined by `ctypes.h` will be properly portable to however the characters are encoded. – Christian Gibbons Aug 27 '19 at 22:36
  • @Shared - in general you will find that most compiler implementations of any given function will be far more optimized and far less error prone than what you come up with on the fly. (especially for long strings where many library functions cast to the dereferenced string to unsigned in order to be able to check 4-bytes per-iteration rather than one). A good test is to dump to assembly and then compare from an instruction standpoint. – David C. Rankin Aug 27 '19 at 22:52
  • You could improve the micro-efficiency of the code on mixed alpha and non-alpha strings by testing, for example, `if (hasUpper) return 1;` in the `if (islower((unsigned char)*str)` code block, before the assignment to `hasLower`. This saves unnecessary testing when the character is non-alpha. Also, the question asks for 'all upper' vs 'all lower' vs 'mixed' (and implicitly 'no alpha'). – Jonathan Leffler Aug 28 '19 at 07:08
2

If we assume ASCII

If we assume all alpha,

Then code only needs to count the "case" bits. Is the sum 0, same as string length or otherwise?

void test_case(const char *s) {
  const char *start = s;
  size_t sum = 0;
  size_t mask = 'A' ^ 'a';
  while (*s) {
    sum += *s++ & mask;
  }
  ptrdiff_t len = s - start;
  sum /= mask;
  if (len == 0) puts("Empty string");
  else if (sum == 0) puts("All UC");   
  else if (sum == len) puts("All LC");
  else puts("Mixed");
}

Note: with slight mods, will work for EBCIDIC too.

chux - Reinstate Monica
  • 127,356
  • 13
  • 118
  • 231
1

Is said string guaranteed to only contain letters? If so, could check to see if any two consecutive characters are different cases.

#include <ctype.h>
#include <errno.h>
int mixed_case(const char *str) {
   if(!str){
      // sanity check
      errno = EINVAL;
      return -1;
   }

   // can't be mixed-case without more than one letter
   if(str[0] == '\0' || str[1] == '\0'){
      return 0;
   }

   for(int i = 1; str[i] != '\0' ; ++i) {
      if (!islower(str[i]) ^ !islower(str[i-1])) {
         // if two letter next to each other are not the same case, it's mixed case
         return 1;
      }
   }
   // didn't find any mismatches, so not mixed case
   return 0;
}

Taking a similar approach, but instead of checking consecutive characters, it will find the first alphabetical character and check it against any other alphabetical characters found. This should be able to handle strings with non-alphabetical characters.

int mixed_case(const char *str) {
   if(!str){
      // sanity check
      errno = EINVAL;
      return -1;
   }

   // can't be mixed-case without more than one letter
   if(str[0] == '\0' || str[1] == '\0'){
      return 0;
   }

   // find the first alphabetical character and store its index at 'i'
   int i = 0;
   for(;!isalpha(str[i]) || str[i] == '\0'; ++i);

   if(str[i] == '\0') {
      // no alphabetical characters means you can't have mixed cases
      return 0;
   }

   // See if any of the other alphabetical characters differ from the case of the first one
   for(int j = i+1; str[j] != '\0' ; ++j) {
      if(isalpha(str[j]) && (!islower(str[i]) ^ !islower(str[j]))) {
         return 1;
      }
   }
   // didn't find any mismatches, so not mixed case
   return 0;
}
Christian Gibbons
  • 4,199
  • 1
  • 14
  • 29
  • that's an interesting approach, but yes I suppose a file could be named something like FileVersion-1.txt so it could have non-letters (added an example to the question). –  Aug 27 '19 at 22:40
  • @Shared Updated with another version that should handle non-alphabetical strings. Not sure how it would stack up efficiency-wise versus a more traditional approach. – Christian Gibbons Aug 27 '19 at 22:59
  • 1
    `islower(str[i]) ^ islower(str[j])` is not certainly correct. `islower()` results in an `int`, non-zero value when lower case. Yet the value returned by `islower(str[i])` may differ from `islower(str[j])` even though both are lower case. `^` does a bit-wise exclusive OR here and what is needed is a logical exclusive OR. Suggest `!islower(str[i]) ^ !islower(str[j])` to effect a logical exclusive OR. – chux - Reinstate Monica Aug 28 '19 at 00:10
  • @chux Good catch. Updated. – Christian Gibbons Aug 28 '19 at 14:48
0

Another approach that does not assume ASCII nor all alpha.

Assess the first char and then perform one of 2 optimized loops.

This quits the loops on the first mis-match. Since the while() loops are only doing a single test, this leads to optimal performance.

#include <ctype.h>

void case_test(const char *s) {
  if (*s == '\0') {
    puts("Empty string");
    return;
  }

  unsigned char *us = (unsigned char *)s; // use unsigned char with is***() functions.
  if (islower(*us)) {
    while (islower(*us)) {
      us++;
    }
    if (*us) {
      puts("Mixed or not alpha");
    } else {
      puts("All lower");
    }
  } else if (isupper(*us)) {
    while (isupper(*us)) {
      us++;
    }
    if (*us) {
      puts("Mixed case or not alpha");
    } else {
      puts("All upper");
    }
  } else {
    puts("Not alpha");
  }
}

OP added cases including non-alpha. The below promptly handles that.

void case_test_with_non_letters(const char *s) {
  unsigned char *us = (unsigned char *)s; // use unsigned char with is***() functions.

  //  Find first alpha or null character
  while (!isalpha(*us) && *us) {
    us++;
  }

  if (*us == '\0') {
    puts("Empty string");
    return;
  }

  if (islower(*us)) {
    while (!isupper(*us) && *us) {
      us++;
    }
    if (isupper(*us)) {
      puts("Mixed");
    } else {
      puts("All letters lower");
    }
  } else if (isupper(*us)) {
    while (!islower(*us) && *us) {
      us++;
    }
    if (*us) {
      puts("Mixed case");
    } else {
      puts("All letters upper");
    }
  } else {
    puts("Not alpha");
  }
}
chux - Reinstate Monica
  • 127,356
  • 13
  • 118
  • 231
-1

97 = a = 1100001 65 = A = 0100001

You have just to test the bit number 6.