Question

1 Approved Answer

Posted on Sep 26, 2024

UTF8 encoding is a variable length encoding. This means that sometimes it uses 1 byte, sometimes it uses 2, 3, or 4 bytes to represent

image text in transcribed

UTF8 encoding is a variable length encoding. This means that sometimes it uses 1 byte, sometimes it uses 2, 3, or 4 bytes to represent a character. How does it know that a character is 1 or 2 or 3 or 4 bytes? UTF-8 distinguishes this using a unary encoding. That is, if the bits of a character start with 0, then there is only 1 byte being read for this character. Oxxxxxxx represents a 1-byte character (basically the first 127 ascii characters). If the first bits of the character start with 110xxxxx, then there will be two bytes, 1110xxxx means 3 bytes, and 11110xxx means 4 bytes. See this table for example. Bytes 1st byte2nd byte 3rd byte 4th byte 110xxxxx10xxxxxx 1110xxxx10xx 11110xxx 10xxxxxx 1 0xxxxxx 10xxxxxx xxxx 10xxxxxx Hence, the capital A, which is 41 is represented as 00101001 encoded as A' in UTF8. The winky emoji: is 128,52 l in decimal which is 00000001 1 1 1 1 01 10 00001001 in binary, which is encoded into UTF8 as 11 110000 10011111 10011000 10001001 (with only 21 significant bits) Encoding errors happen when a file encoded in one format is decoded in another format. Let's say that the string 'Hi' was encoded with ASCII where H is represented in decimal as 72 and where i is represented in decimal as 105 How would it appear if decoded in UTF8! Let's say that the string(which is Chinese 'ni hao, which translates to hello in English) is encoded with UTF-8: whereis represented in decimal as 20320 and whereis represented in decimal as 22909 How would it appear if decoded with ASCII (show your work)? (Hint - for this you only need the 3-byte version of utf8.) UTF8 encoding is a variable length encoding. This means that sometimes it uses 1 byte, sometimes it uses 2, 3, or 4 bytes to represent a character. How does it know that a character is 1 or 2 or 3 or 4 bytes? UTF-8 distinguishes this using a unary encoding. That is, if the bits of a character start with 0, then there is only 1 byte being read for this character. Oxxxxxxx represents a 1-byte character (basically the first 127 ascii characters). If the first bits of the character start with 110xxxxx, then there will be two bytes, 1110xxxx means 3 bytes, and 11110xxx means 4 bytes. See this table for example. Bytes 1st byte2nd byte 3rd byte 4th byte 110xxxxx10xxxxxx 1110xxxx10xx 11110xxx 10xxxxxx 1 0xxxxxx 10xxxxxx xxxx 10xxxxxx Hence, the capital A, which is 41 is represented as 00101001 encoded as A' in UTF8. The winky emoji: is 128,52 l in decimal which is 00000001 1 1 1 1 01 10 00001001 in binary, which is encoded into UTF8 as 11 110000 10011111 10011000 10001001 (with only 21 significant bits) Encoding errors happen when a file encoded in one format is decoded in another format. Let's say that the string 'Hi' was encoded with ASCII where H is represented in decimal as 72 and where i is represented in decimal as 105 How would it appear if decoded in UTF8! Let's say that the string(which is Chinese 'ni hao, which translates to hello in English) is encoded with UTF-8: whereis represented in decimal as 20320 and whereis represented in decimal as 22909 How would it appear if decoded with ASCII (show your work)? (Hint - for this you only need the 3-byte version of utf8.)