Sunday, February 16, 2014

PERL Matching non-ASCII characters in a converted RTF

I have a data file that was converted from an RTF to a TXT.    When I started trying to parse it using PERL, my regular expressions weren't able to split up lines that looked like they had whitespace delimiters - It would just ignore the whitespace.

After my initial confusion, I figured that the whitespace must be something other than an ASCII space character, tab, etc.   By some experimentation, I noticed that there were several bytes being represented in that "whitespace".

To try and figure out what the bytes/characters were, I created a little PERL code segment that looked like:

while ($filecontents =~ /([^\d\w\s\t\.:;&\,\-\(\)]+)/){
    $f = $1;
    $d = $1;
   
    $f  =~ s/(.)/sprintf("%x ",ord($1))/eg;
    print "f is $f\n";
    $filecontents =~ s/$d/zzz/g;

}


Basically, the code goes thru the file, finds oddball characters and prints them out.  When I ran it, it produced the following:

   f is e2 80 83
   f is e2 80 a8
   f is e2 81 84
   f is c2 b0 

 
Note that each of those looks like a multi-byte character, but what are they?

Well, I do love the internet.  I cut and pasted e2 80 a8  into Google and found that it was an "em space", aka Unicode character \u2003.

Once I was able to get the Unicode character, I could just replace all of the em spaces with a regular space, and the rest of my program worked as designed.  Same idea with the other special characters.  Two of those characters were not whitespace, but were non-ASCII characters as well (fraction slash and degree symbol).

Note that, at least in my case, I had to match using the hex versus the unicode character. In other words

    $filecontents =~ s/\xe2\x80\xa8/ /g;

I'm assuming this is because the Unicode would be a UTF-16 character but I'm dealing with a UTF-8 encoding?   For next time, I should see if I can export the RTF to a UTF-16 text file.  Maybe it would be easier :)

No comments:

Post a Comment