Wednesday, January 11, 2012

Perl, UTF-8, and TextEdit character encoding hell

I'm working on a project where I need to dynamically read in content from a flat file. The flat file contains some template information and then I'm also pasting in the contents (using TextEdit) of a file that was generated using Perl database access.

The problem was that when I read it into a string using stringWithContentsOfFile and NSUTF8StringEncoding it would blow up. I could use NSASCIIStringEncoding, but then some of the characters (e.g em dash, single double quote) were translated incorrectly. If I brought this file up in TextEdit or Dashcode, everything looked great. Displaying the file in vi or the command line did not.

When I did a file -I foo.txt it reported the file type was "unknown" although the Perl generated file was utf-8.

I traced the problem down to the TextEdit "Plain Text File Encoding" preferences. Both "Opening Files" and "Saving Files" were set to UTF-8. This was helpful to read the data, but somehow when saving the pasted-in content, it caused the file type to get hosed such that certain tools (e.g. my command window which is set to UTF-8) could no longer properly read the characters.

Once I set the preferences back to automatic, the saved file is now utf-8, but the special characters don't display correctly in TextEdit. The file also loads with UTF8 encoding into an NSString.

Weird.

That's three hours of my life I'll never get back.

No comments:

Post a Comment