69

I'm getting strange characters when pulling data from a website:

Â

How can I remove anything that isn't a non-extended ASCII character?


A more appropriate question can be found here: PHP - replace all non-alphanumeric chars for all languages supported

John
  • 11,516
  • 11
  • 87
  • 151
LordZardeck
  • 7,588
  • 19
  • 58
  • 116

8 Answers8

115

A regex replace would be the best option. Using $str as an example string and matching it using :print:, which is a POSIX Character Class:

$str = 'aAÂ';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

What :print: does is look for all printable characters. The reverse, :^print:, looks for all non-printable characters. Any characters that are not part of the current character set will be removed.

Note: Before using this method, you must ensure that your current character set is ASCII. POSIX Character Classes support both ASCII and Unicode and will match only according to the current character set. As of PHP 5.6, the default charset is UTF-8.

Chris Bornhoft
  • 3,993
  • 4
  • 35
  • 54
  • 4
    This solution is not working for me. :( I am getting aAÂ. php 5.3.0. (windows) – DamirR Jan 08 '12 at 23:12
  • this solution is dependant on the localisation of the perl regex library... in particular it seems to require a broken bersion – Jasen Aug 12 '14 at 00:03
  • @Jasen They're known as [POSIX Character Classes](http://www.regular-expressions.info/posixbrackets.html). They work with any version, but require ASCII to be the selected character set within PHP, since Character Classes also support Unicode fully. I've updated my answer accordingly. – Chris Bornhoft Aug 12 '14 at 16:16
  • 1
    How do you make ASCII the selected character set via code? – vcardillo Oct 17 '14 at 19:29
  • This is a solution for PHP string variable and not for PHP array variable. **What is the solution for PHP array variable containing these htmlentitycodes  = `Â` which is a-circumflex?** – Neocortex Dec 03 '14 at 07:14
  • @BannedfromSO Take a look at the [`array_map`](http://php.net/manual/en/function.array-map.php) function. – Chris Bornhoft Dec 03 '14 at 16:59
  • @ChrisBornhoft - Yes I did this `$a = array_map('trim',$array);` – Neocortex Dec 04 '14 at 03:50
  • Any ideas why this allows any [UTF8 character](https://apps.timwhitlock.info/emoji/tables/unicode) even when PHP has been setup to use Windows-1252 with `ini_set('default_charset', 'windows-1252');`? I want to get rid of all those Unicode characters and allow only characters from the [Windows-1252 codepage](http://www.kostis.net/charsets/cp1252.htm). – andreszs Feb 07 '18 at 01:59
  • if you use `[:print:]` some characters may be changed to `?`, see here for more info on a workaround: https://alvinalexander.com/php/how-to-remove-non-printable-characters-in-string-regex – degenerate May 17 '18 at 15:34
  • 2
    yes, this answer only works on misconfigured systems 'Â' is clearly a printing character:(it is both inked, and consumes space) use `'/[[:^ascii:]]/''` instead of `'/[[:^print:]]/'` to strip non-ascii. – Jasen Sep 08 '19 at 22:18
  • Jasen, your correction was the right solution for me at least. – Hobbes Dec 15 '20 at 04:26
  • @Jasen your answer is the correct one. Thanks – Plugie May 07 '21 at 09:26
  • Didn't work as it made for example from `Anton Dovečer` -> `Anton Doveer` but I'd expect it to do it to `Boris Dovecer` – Kaspar L. Palgi Oct 23 '21 at 18:38
  • @KasparL.Palgi that is *exactly* what the original question asked to accomplish: remove the characters completely. To replace with an non-accented character, you would need to create a custom mapping of the characters you'd like to replace first. – Chris Bornhoft Oct 24 '21 at 19:22
46

You want only ASCII printable characters?

use this:

<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwrešđčžsff";
$res = preg_replace('/[^\x20-\x7E]/','', $str);
echo "($str)($res)";

Or even better, convert your input to utf8 and use phputf8 lib to translate 'not normal' characters into their ascii representation:

require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');

if(!utf8_is_valid($str))
{
  $str=utf8_bad_strip($str);
}

$str = utf8_to_ascii($str, '' );
DamirR
  • 1,598
  • 1
  • 13
  • 15
36

$clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

Rusty Fausak
  • 7,085
  • 1
  • 25
  • 38
Utopia
  • 628
  • 6
  • 8
  • Seems perfect for PHP >= 5.2 – user414873 Oct 22 '15 at 13:33
  • This seems to also strip tags. For me it was removing See [PHP Sanitize filters](http://php.net/manual/en/filter.filters.sanitize.php) – ds00424 Sep 03 '16 at 16:41
  • Heads up: if you [go to functions-online.com to test this](https://ru.functions-online.com/filter_var.html?command={%22variable%22:%22\uf8ff%22,%22filter%22:%22FILTER_SANITIZE_STRING%22,%22options%22:%22FILTER_FLAG_STRIP_HIGH%22}), it will put single quotes around `FILTER_FLAG_STRIP_HIGH` which stops it from working – mehov Feb 03 '20 at 12:26
  • This was helpful. Though I used FILTER_FLAG_ENCODE_HIGH instead of FILTER_FLAG_STRIP_HIGH – bhar1red Apr 11 '22 at 21:38
24

Kind of related, we had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.

Solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.

Normally I would do something like this:

<?php
// transliterate
if (function_exists('iconv')) {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }
?>

... but that replaces everything that can't be translated into a question mark (?).

So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.

<?php
public function cleanNonAsciiCharactersInString($orig_text) {

    $text = $orig_text;

    // Single letters
    $text = preg_replace("/[∂άαáàâãªä]/u",      "a", $text);
    $text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u",     "A", $text);
    $text = preg_replace("/[ЂЪЬБъь]/u",           "b", $text);
    $text = preg_replace("/[βвВ]/u",            "B", $text);
    $text = preg_replace("/[çς©с]/u",            "c", $text);
    $text = preg_replace("/[ÇС]/u",              "C", $text);        
    $text = preg_replace("/[δ]/u",             "d", $text);
    $text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
    $text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u",     "E", $text);
    $text = preg_replace("/[₣]/u",               "F", $text);
    $text = preg_replace("/[НнЊњ]/u",           "H", $text);
    $text = preg_replace("/[ђћЋ]/u",            "h", $text);
    $text = preg_replace("/[ÍÌÎÏ]/u",           "I", $text);
    $text = preg_replace("/[íìîïιίϊі]/u",       "i", $text);
    $text = preg_replace("/[Јј]/u",             "j", $text);
    $text = preg_replace("/[ΚЌК]/u",            'K', $text);
    $text = preg_replace("/[ќк]/u",             'k', $text);
    $text = preg_replace("/[ℓ∟]/u",             'l', $text);
    $text = preg_replace("/[Мм]/u",             "M", $text);
    $text = preg_replace("/[ñηήηπⁿ]/u",            "n", $text);
    $text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u",       "N", $text);
    $text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
    $text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u",     "O", $text);
    $text = preg_replace("/[ρφрРф]/u",          "p", $text);
    $text = preg_replace("/[®яЯ]/u",              "R", $text); 
    $text = preg_replace("/[ГЃгѓ]/u",              "r", $text); 
    $text = preg_replace("/[Ѕ]/u",              "S", $text);
    $text = preg_replace("/[ѕ]/u",              "s", $text);
    $text = preg_replace("/[Тт]/u",              "T", $text);
    $text = preg_replace("/[τ†‡]/u",              "t", $text);
    $text = preg_replace("/[úùûüџμΰµυϋύ]/u",     "u", $text);
    $text = preg_replace("/[√]/u",               "v", $text);
    $text = preg_replace("/[ÚÙÛÜЏЦц]/u",         "U", $text);
    $text = preg_replace("/[Ψψωώẅẃẁщш]/u",      "w", $text);
    $text = preg_replace("/[ẀẄẂШЩ]/u",          "W", $text);
    $text = preg_replace("/[ΧχЖХж]/u",          "x", $text);
    $text = preg_replace("/[ỲΫ¥]/u",           "Y", $text);
    $text = preg_replace("/[ỳγўЎУуч]/u",       "y", $text);
    $text = preg_replace("/[ζ]/u",              "Z", $text);

    // Punctuation
    $text = preg_replace("/[‚‚]/u", ",", $text);        
    $text = preg_replace("/[`‛′’‘]/u", "'", $text);
    $text = preg_replace("/[″“”«»„]/u", '"', $text);
    $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
    $text = preg_replace("/[  ]/u", ' ', $text);

    $text = str_replace("…", "...", $text);
    $text = str_replace("≠", "!=", $text);
    $text = str_replace("≤", "<=", $text);
    $text = str_replace("≥", ">=", $text);
    $text = preg_replace("/[‗≈≡]/u", "=", $text);


    // Exciting combinations    
    $text = str_replace("ыЫ", "bl", $text);
    $text = str_replace("℅", "c/o", $text);
    $text = str_replace("₧", "Pts", $text);
    $text = str_replace("™", "tm", $text);
    $text = str_replace("№", "No", $text);        
    $text = str_replace("Ч", "4", $text);                
    $text = str_replace("‰", "%", $text);
    $text = preg_replace("/[∙•]/u", "*", $text);
    $text = str_replace("‹", "<", $text);
    $text = str_replace("›", ">", $text);
    $text = str_replace("‼", "!!", $text);
    $text = str_replace("⁄", "/", $text);
    $text = str_replace("∕", "/", $text);
    $text = str_replace("⅞", "7/8", $text);
    $text = str_replace("⅝", "5/8", $text);
    $text = str_replace("⅜", "3/8", $text);
    $text = str_replace("⅛", "1/8", $text);        
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[Љљ]/u", "Ab", $text);
    $text = preg_replace("/[Юю]/u", "IO", $text);
    $text = preg_replace("/[fifl]/u", "fi", $text);
    $text = preg_replace("/[зЗ]/u", "3", $text); 
    $text = str_replace("£", "(pounds)", $text);
    $text = str_replace("₤", "(lira)", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
    $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);


    //2) Translation CP1252.
    $trans = get_html_translation_table(HTML_ENTITIES);
    $trans['f'] = '&fnof;';    // Latin Small Letter F With Hook
    $trans['-'] = array(
        '&hellip;',     // Horizontal Ellipsis
        '&tilde;',      // Small Tilde
        '&ndash;'       // Dash
        );
    $trans["+"] = '&dagger;';    // Dagger
    $trans['#'] = '&Dagger;';    // Double Dagger         
    $trans['M'] = '&permil;';    // Per Mille Sign
    $trans['S'] = '&Scaron;';    // Latin Capital Letter S With Caron        
    $trans['OE'] = '&OElig;';    // Latin Capital Ligature OE
    $trans["'"] = array(
        '&lsquo;',  // Left Single Quotation Mark
        '&rsquo;',  // Right Single Quotation Mark
        '&rsaquo;', // Single Right-Pointing Angle Quotation Mark
        '&sbquo;',  // Single Low-9 Quotation Mark
        '&circ;',   // Modifier Letter Circumflex Accent
        '&lsaquo;'  // Single Left-Pointing Angle Quotation Mark
        );

    $trans['"'] = array(
        '&ldquo;',  // Left Double Quotation Mark
        '&rdquo;',  // Right Double Quotation Mark
        '&bdquo;',  // Double Low-9 Quotation Mark
        );

    $trans['*'] = '&bull;';    // Bullet
    $trans['n'] = '&ndash;';    // En Dash
    $trans['m'] = '&mdash;';    // Em Dash        
    $trans['tm'] = '&trade;';    // Trade Mark Sign
    $trans['s'] = '&scaron;';    // Latin Small Letter S With Caron
    $trans['oe'] = '&oelig;';    // Latin Small Ligature OE
    $trans['Y'] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
    $trans['euro'] = '&euro;';    // euro currency symbol
    ksort($trans);

    foreach ($trans as $k => $v) {
        $text = str_replace($v, $k, $text);
    }

    // 3) remove <p>, <br/> ...
    $text = strip_tags($text);

    // 4) &amp; => & &quot; => '
    $text = html_entity_decode($text);


    // transliterate
    // if (function_exists('iconv')) {
    // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    // }

    // remove non ascii characters
    // $text =  preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);      

    return $text;
}

?>
Silas Palmer
  • 1,947
  • 25
  • 26
  • According to http://php.net/manual/en/function.iconv.php#74101 , that should only be an issue if you do not select a proper locale (other than C or POSIX) – MauganRa Dec 16 '14 at 09:39
  • there are only 128 characters in the ascii character set. – Jasen Sep 08 '19 at 22:21
2

I also think that the best solution might be to use a regular expression.

Here's my suggestion:

function convert_to_normal_text($text) {

    $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
    $normal_text = preg_replace("/[^$normal_characters]/", '', $text);

    return $normal_text;
}

Then you can use it like this:

$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
$after = convert_to_normal_text($before);
echo $after;

Displays:

Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .
simhumileco
  • 27,137
  • 16
  • 123
  • 105
1

I just had to add the header

header('Content-Type: text/html; charset=UTF-8');
nhahtdh
  • 54,546
  • 15
  • 119
  • 154
ALHaines
  • 11
  • 2
  • 1
    that will fix the case where UTF8 is being interpreted as WIN-1252 which is the default encoding for HTML, however it will not remove any characters from a string. – Jasen Aug 12 '14 at 00:10
0

This should be pretty straight forwards and no need for iconv function:

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));
// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);
Goran Jakovljevic
  • 2,423
  • 1
  • 26
  • 24
-1

I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on unicode.

$name="βγδεζηΘKgfgebhjrf!@#$%^&";    
//this function will clear all non greek and english characters on greek-iso charset        
function replace_characters($string)    
{    
   $str_length=strlen($string);    
   for ($x=0;$x<$str_length;$x++)    
      {    
          $character=$string[$x];    
          if ((ord($character)>64 && ord($character)<91) || (ord($character)>96 && ord($character)<123) || (ord($character)>192 && ord($character)<210) || (ord($character)>210 && ord($character)<218) || (ord($character)>219 && ord($character)<250) || ord($character)==252 || ord($character)==254)    
             {    
                 $new_string=$new_string.$character;     
             }    
      }    
      return $new_string;    
}    
//end function    

$name=replace_characters($name);    

echo $name;    
  • 1
    Heavy-handed but tweakable... I like it. – Kristen Waite Oct 07 '15 at 13:23
  • You're doing ord() on the same character over and over again just for different comparisons (line 9). That's extremely inefficient. You should save result of ord() in variable and then reuse it in conditional. Also, consider using === instead of == as use of == is discouraged. Although I don't blame you for this, ironically PHP manual for ord() shows using == in examples. – xZero Nov 25 '17 at 13:50