Adam Hepner

Fix File Encoding Using Go

This is just a short entry on how to fix file encodings using Go. The reason for it is the following: I currently develop a small test automation augmentation for my customer, and it utilizes a third-party tool. For variuous reasons this third-party tool needs to touch the test data files, and in doing so, it completely garbles those files. After analysis, it turned out, that the files are read in as UTF-8, but interpreted as Windows1252, yet written out again as UTF-8. This leads to very weird situation, where my Umlauts are being converted into gibberisch (like: ü -> ü). Fortunately, I also perform some additional postprocessing, and touch every line in each of those files under question, so I just needed to figure out how to fix the situation.

A little bit of digging unveiled an interesting package for me: "golang.org/x/text/encoding", which not only lists a big number of encodings, but also methods to decipher encoded strings using those encodings.

So in order to understand the problem - the most important part of trying to solve it - I came up with the following explanation:

The fix then is pretty simple:

Here’s the full solution and usage:

 1package postprocessing
 2
 3// FixEncoding reads in a string that is invalidly represented and outputs valid UTF-8 string
 4//
 5// The problem here is that python sometimes gets UTF-8 string, and outputs corresponding bytes, but as Windows-1252
 6// characters, while telling other programs, that the text is still UTF-8. As a result, the character ü gets transformed
 7// to ü
 8func FixEncoding(coder Encoder, input string) string {
 9	bytes := []byte{}
10	for _, r := range []rune(input) {
11		if b, ok := coder.EncodeRune(r); ok {
12			bytes = append(bytes, b)
13		}
14	}
15	return string(bytes)
16}
17
18type Encoder interface {
19	EncodeRune(r rune) (b byte, ok bool)
20}
 1package postprocessing_test
 2
 3import (
 4	"golang.org/x/text/encoding/charmap"
 5	"./postprocessing"
 6	"testing"
 7)
 8
 9func TestFixEncoding(t *testing.T) {
10	Input := "ü"
11	Expected := "ü"
12
13	if out := postprocessing.FixEncoding(charmap.Windows1252, Input); out != Expected {
14		t.Errorf(`For input string "%s" expeced output "%s", but got "%s"`, Input, Expected, out)
15	}
16}
If you've enjoyed this content, how about

<< Previous Post

|

Next Post >>