Fix File Encoding Using Go

01 Nov 2019, 21:56

go / file encoding

This is just a short entry on how to fix file encodings using Go. The reason for it is the following: I currently develop a small test automation augmentation for my customer, and it utilizes a third-party tool. For variuous reasons this third-party tool needs to touch the test data files, and in doing so, it completely garbles those files. After analysis, it turned out, that the files are read in as UTF-8, but interpreted as Windows1252, yet written out again as UTF-8. This leads to very weird situation, where my Umlauts are being converted into gibberisch (like: ü -> Ã¼). Fortunately, I also perform some additional postprocessing, and touch every line in each of those files under question, so I just needed to figure out how to fix the situation.

A little bit of digging unveiled an interesting package for me: "golang.org/x/text/encoding", which not only lists a big number of encodings, but also methods to decipher encoded strings using those encodings.

So in order to understand the problem - the most important part of trying to solve it - I came up with the following explanation:

Input file is written properly in UTF-8
The file is read in by third party program, but somehow the UTF-8 is ignored, and instead the text is interpreted using Windows1252
After the third party program is done with the conversion, it writes down the results in Windows1252, but does not adjust the encoding of the file (so it’s still cosidered to be UTF-8, but characters are written using different sets of bytes)
Another program in chain opens the outfile, identifies it as UTF-8, and attempts to decode the characters (I believe those are called glyphs), but the bytes don’t align with the original character map, because they had been written out with another one

The fix then is pretty simple:

read the text back as UTF-8
interpret each glyph back to single bytes
write down the whole text (all bytes) again, but this time us proper character map.

Here’s the full solution and usage:

package postprocessing

// FixEncoding reads in a string that is invalidly represented and outputs valid UTF-8 string
//
// The problem here is that python sometimes gets UTF-8 string, and outputs corresponding bytes, but as Windows-1252
// characters, while telling other programs, that the text is still UTF-8. As a result, the character ü gets transformed
// to Ã¼
func FixEncoding(coder Encoder, input string) string {
	bytes := []byte{}
	for _, r := range []rune(input) {
		if b, ok := coder.EncodeRune(r); ok {
			bytes = append(bytes, b)
		}
	}
	return string(bytes)
}

type Encoder interface {
	EncodeRune(r rune) (b byte, ok bool)
}

package postprocessing_test

import (
	"golang.org/x/text/encoding/charmap"
	"./postprocessing"
	"testing"
)

func TestFixEncoding(t *testing.T) {
	Input := "Ã¼"
	Expected := "ü"

	if out := postprocessing.FixEncoding(charmap.Windows1252, Input); out != Expected {
		t.Errorf(`For input string "%s" expeced output "%s", but got "%s"`, Input, Expected, out)
	}
}