8

I need to scrub a column of names in Excel to eliminate all non-Alpha characters including periods, commas, spaces, hyphens and apostrophes.

EXAMPLE: Change O'Malley-Smith, Tom, Jr. to OMALLEYSMITHTOMJR

The client requires this to be an Excel function, otherwise I'd make it easy with a quick Java program similar to replaceAll("[^a-zA-Z]", "").toUpperCase(). I cannot seem to find anything that looks like an off-the-shelf function to do this outside of a whole mess of SUBSTITUTE functions - which only seem to be available one-per-cell.

I'm not terribly fluent with developing custom macros if that's what I need.

Community
  • 1
  • 1
dwwilson66
  • 6,576
  • 27
  • 69
  • 114

7 Answers7

10

Indeed a mess of SUBSTITUTEs but within a single cell is possible, eg:

=UPPER(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1," ",""),",",""),"'",""),".",""),"-",""))   

Of course may need to be 'extended' to cover other non-alpha characters.

pnuts
  • 56,678
  • 9
  • 81
  • 133
  • Clever formulaic approach. – brettdj Mar 20 '15 at 03:42
  • 1
    kind of you, but formulas aren't really my thing. Whereas I don't think Barry has any peers, he is simply the best fullstop. – brettdj Mar 21 '15 at 03:37
  • Great solution for removing limited unique non-numeric strings. Worked for me when I had to convert keys such as "14D", "8M", "9hr", and "23min" to numeric values, by simply using "D", "M", "hr", "min" as the subs. – e_i_pi Apr 20 '20 at 06:31
  • Excellent solution for eliminating few characters. Thanks – YosiN Jul 31 '21 at 15:29
10

I had a similar need sometime ago and found something that worked great.

Press Alt+F11 to open the Visual Basic editor. Insert a new Module and paste the following code.

Function CleanCode(Rng As Range)
    Dim strTemp As String
    Dim n As Long

    For n = 1 To Len(Rng)
        Select Case Asc(Mid(UCase(Rng), n, 1))
            Case 48 To 57, 65 To 90
                strTemp = strTemp & Mid(UCase(Rng), n, 1)
        End Select
    Next
    CleanCode = strTemp
End Function

CleanCode now is new function and you can use it as a formula.

So next to the cell with the string you want to manipulate just copy =CleanCode(yourcell)

aurezio
  • 152
  • 1
  • 1
  • 9
8

If you wanted to go down the VBA route - you couldn't use a User-Defined Function (UDF) to change the value of the cell you're entering the function into - but you could use a simple macro and take advantage of Microsoft's VBScript RegEx Engine:

Sub SO()

Dim searchRange     As Excel.Range
Dim cell            As Variant
Dim RegEx           As Object

Set RegEx = CreateObject("VBScript.RegExp")

With RegEx
    .Pattern = "[^a-zA-Z]"
    .Global = True
    .MultiLine = True
End With

Set searchRange = ActiveSheet.Range("A1:D5") '// Change as required

    For Each cell In searchRange.Cells
        If RegEx.test(cell) Then cell.Value = RegEx.Replace(cell.Value, vbNullString)
        cell.Value = UCase(cell.Value)
    Next cell

Set searchRange = Nothing
Set RegEx = Nothing

End Sub
SierraOscar
  • 17,307
  • 5
  • 38
  • 67
  • 1
    Regexp is most efficient VBA method. Though would use this in a variant array rather than range – brettdj Mar 20 '15 at 03:42
4

This can be done with a single formula in Excel 2016 onwards.

While pnuts' solution lists explicit characters to strip, this solution lists explicitly valid characters.

Assume your dirty data is in column A. Assume you want your clean data in column B.

Use the very last formula below, in cell B1. To enter the formula into cell B1, do the following:

  • Click on cell B1
  • Click into the formula bar
  • Paste the formula
  • Press CTRL+Shift+Enter <-- important step

Copy cell B1 and paste it down column B as far as you need.

First, here is a short example to explain what's going on:

=TEXTJOIN("",TRUE,

IFs(
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "t", "t",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "e", "e",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "s", "s",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "T", "T",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "E", "E",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "S", "S",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "2", "2",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = " ", " ",
    true, ""
  )

)

In this case I have specified the following characters as valid: t, e, s, T, E, S, 2 and the space character.

Obviously you need to extend the formula out to the full alphabet, listing each upper and lowercase character separately. Likewise, extend it out to include numbers if you want to keep those. Note that numbers are coded as strings.

How this works:

In a nutshell we are splitting the source string into an array of separate characters, then for each character, checking whether it is in our set of valid characters and including it in the result if it is, otherwise replacing it with an empty string if it is not.

The 'IFS' function goes through its arguments a pair at a time. If the first argument evaluates as true, then the second argument is returned. If not, it proceeds with the next pair of arguments - this is why you see the letters listed twice on each row. The last pair of values in the IFS function is the set 'true' and the empty string. This is saying if we get to the end of the set of valid values (ie and haven't matched a valid value) then return the empty string.

More background on why this works:

This is a variation on a solution given at ExcelJet. In that solution the TEXTJOIN function is used (to concatenate the values of an array) with the INDIRECT function (which splits the string into an array) together with a mathematical operator (the plus symbol) to force the evaluation of a calculation between every character in the string with a numerical value. Numerical characters in the string will return numerical values while other characters will return an error. That solution uses the function ISERR to check for an error to decide whether or not to include a given character in the final output. A similar article exists there to work the other way around - to exclude the numbers and keep the letters.

The problem I wanted solved is for the coder to decide which values are valid and which are not. I went through trying to incorporate VLOOKUP and INDEX functions with the INDIRECT function but they will only work on the first character in the string. The trick is that not all functions will act on the output of INDIRECT in such a way as to evaluate every element in the array (ie. every character in the string). The secret was that ExcelJet used a mathematical operator. If you check Microsoft's full function reference, IFS is categorised as a "logic" function. I suspect logic functions can be used with INDIRECT in this way.

(Note: I also attempted to use AND, and OR in various combinations. However, the INDIRECT evaluates all characters in the string. So for example, using the CODE function to gain each character's ASCII value and asserting all characters must have values between 65 - 90 (uppercase) or 97 - 122 (lowercase letters) will work only if all characters in the string are uppercase, or all are lowercase, but not if there is a mixture.)

I don't know how the performance of this solution compares with the earlier suggestion using SUBSTITUTE. If you only want to strip out a few characters, I recommend the SUBSTITUTE solution. If you want to specify explicitly the valid characters to keep (which was the original question), use this one.

Finally, here is the exact answer you need, including the conversion to uppercase which you didn't note in the question, but displayed in your example. (For anyone else not wanting the uppercase conversion, remove the instances of 'UPPER' from this example, then add the alphabet again to the list, in lowercase, and be sure to leave the 'true'/empty string pair as the last entry in the list.)

=TEXTJOIN("",TRUE,
IFs(
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "A", "A",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "B", "B",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "C", "C",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "D", "D",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "E", "E",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "F", "F",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "G", "G",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "H", "H",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "I", "I",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "J", "J",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "K", "K",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "L", "L",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "M", "M",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "N", "N",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "O", "O",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "P", "P",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "Q", "Q",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "R", "R",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "S", "S",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "T", "T",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "U", "U",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "V", "V",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "W", "W",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "X", "X",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "Y", "Y",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "Z", "Z",
    true, ""
)
)

With the original question being "eliminate all but alpha characters" - this answer does the trick in a formula without the need for VBA.

2

Another VBA solution

Sub RemoveCrap()
    Dim varRange As Range
    Dim varWorkRange As Range

    Set varWorkRange = Range("A1:A10")

    For Each varRange In varWorkRange
        varVal = ""
        For i = 1 To Len(varRange.Value)
            varTemp = Mid(varRange.Value, i, 1)
            If Not (varTemp Like "[a-z]" Or varTemp Like "[A-Z]") Then
                varStr = ""
            Else
                varStr = UCase(varTemp)
            End If
            varVal = varVal & varStr
        Next i
        varRange.Value = varVal
    Next
End Sub
Pankaj Jaju
  • 5,203
  • 2
  • 24
  • 39
  • @user3415869 - The concept will work but not the code because Google Sheets uses Google Apps Script whereas MS Excel uses VBA – Pankaj Jaju Nov 22 '19 at 04:05
2

Can't respond directly to aurezio as just joined and not enough reputation. I really like this solution and found it very efficient and short/simple for my requirements -- which is to clean up text fields to output a filtered character set to columns for filename and web coding output needs.

Although not directly answering the question, this page was the closest I found to the solution I needed so I wanted to give others more context and expand on aurezio's solution for a more generic character selection - in case anyone is interested.

Row 7 relates to the character number or number series to allow. Use excel function CODE() to identify a char's code and CHAR() to post the char code in.

My criteria (as shown below) was to filter all but 45 (dash), 48 To 57 (numeric series), 65 To 90 (alphanumeric uppercase series), 97 To 122 (alphanumeric lowercase series). Also, not requiring uppercase means UCase() can be removed.

'based off aurezio's solution
Function CleanCode(Rng As Range)
    Dim strTemp As String
    Dim n As Long

    For n = 1 To Len(Rng)
        Select Case Asc(Mid(Rng, n, 1))
            Case 45, 48 To 57, 65 To 90, 97 To 122
                strTemp = strTemp & Mid(Rng, n, 1)
        End Select
    Next
    CleanCode = strTemp
End Function

Eventually I'd like to improve it to make it dynamic eventually and allow input arguments of the filter(s) required. eg ClearCode(Range,"45", "48-57", "65-90", "97-122")

dsdrk
  • 21
  • 2
1

further to youcantryreachingme's answer above, and consolidating it with the advice from https://exceljet.net/formula/strip-numeric-characters-from-cell using SEQUENCE & LET functions (available in the more recent editions of Excel), then further experimenting, I've been able to get the function down to...

=LET(character,MID(a1,SEQUENCE(LEN(a1)),1),TEXTJOIN("",TRUE,
 IFS(
     character=" "," ",
     CODE(UPPER(character))>90,"",
     CODE(UPPER(character))<65,"",
     TRUE,character)
 ))

As IFS seems to only execute the 1st statement matched, it is important to place any exceptions you want to include that are outside the range excluded (>90, <65), before that range is declared. It seems more intuitive to use an 'AND' statement to combine the range limits, but this doesn't work for the reasons stated by youcantryreachingme above.

Phil
  • 11
  • 1