Question

1 Approved Answer

Posted on Sep 25, 2024

In Python 3+: A. Implement function wordcount() that takes a string, and returns the frequency of each word as a dictionary. Note: 1. You should

In Python 3+:

A. Implement function wordcount() that takes a string, and returns the frequency of each word as a dictionary. Note: 1. You should ignore the uppercase and lowercase. For example, 'Unicode' and 'unicode' are considered as the same word. 2. The punctuations and digits should NOT be counted. You only need to consider these punctuations: (),./[]".

B. Load the file 'unicode.txt'(attached). Then print the frequency of each word in this file. The output format: The word takes 20 spaces (left aligned), and the frequency number takes 4 spaces (right aligned). For example,

unicode 14

is 14

an 3

Please help --> C. Do Question 3B again but this time, instead of printing the result to the screen, you should write it to a file named 'unicode_count.txt'.

unicode text from file:

Unicode is an information technology (IT) standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of March 2020, there is a total of 143,859 characters, with Unicode 13.0 (these characters consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical. The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional text display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).[1] Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework. Unicode can be implemented by different character encodings. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings. The most commonly used encodings are UTF-8, UTF-16, and UCS-2 (a precursor of UTF-16 without full support for Unicode); GB18030 is standardized in China and implements Unicode fully, while not an official Unicode standard. UTF-8, the dominant encoding on the World Wide Web (used in over 95% of websites as of 2020, and up to 100% for some languages)[2] uses one byte[note 1] for the first 128 code points, and up to 4 bytes for other characters.[3] The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text. UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called Basic Multilingual Plane (BMP). With 1,112,064 possible Unicode code points corresponding to characters (see below) on 17 planes, and with over 143,000 code points defined as of version 13.0, UCS-2 is only able to represent less than half of all encoded Unicode characters. Therefore, UCS-2 is outdated, though still widely used in software. UTF-16 extends UCS-2, by using the same 16-bit encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. As long as it contains no code points in the reserved range U+D800U+DFFF,[clarification needed] a UCS-2 text is valid UTF-16 text. UTF-32 (also referred to as UCS-4) uses four bytes to encode any given codepoint, but not necessarily any given user-perceived character (loosely speaking, a grapheme), since a user-perceived character may be represented by a grapheme cluster (a sequence of multiple codepoints).[4] Like UCS-2, the number of bytes per codepoint is fixed, facilitating character indexing; but unlike UCS-2, UTF-32 is able to encode all Unicode code points. However, because each character uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used. Examples of UTF-32 also being variable-length (as all the other encodings), while in a different sense include: "Devanagari kshi is encoded by 4 code points [..] Flag emojis are also grapheme clusters and composed of two code point characters for example, the flag of Japan"[5] and all "combining character sequences are graphemes, but there are other sequences of code points that are as well; for example is one."[6][7][8][9]