Question

1 Approved Answer

Posted on Aug 24, 2024

Please finish four scripts described in the homework, Thank you! Homework: Find improperly marked files Warning: it will be difficult to do this homework without

image text in transcribed

Please finish four scripts described in the homework, Thank you!

Homework: Find improperly marked files Warning: it will be difficult to do this homework without attending the lab session for hints You're working in a project that has lots of text files. Some of them are in plain ASCII; others are in UTF-8, which is a superset of ASCII. None of the files are supposed to contain NUL bytes (all bits zero). The UTF-8 files that contain non-ASCII characters are supposed to have a first line containing the string"- coding: utf- " (without the quotation marks) POSIX systems typically support various locales to let applications operate well in various countries, languages, and character encodings. The shell command locale -a outputs all the locales available on your system, and should list among others the c, es_Mx.utt8, and ja_JP.utt8 locales. You can select a locale by setting the Lc_ALL the following: with a shell command like export LC ALL-en us.utfB This setting takes effect for all programs later invoked. You can see your current locale settings by running the locale command without any arguments. You can see how the locale affects the output of some standard ies by running the date and 1s -1 commands, using the abovementioned locales Associated with each locale is a character encoding that specifies how characters in that locale are represented as bytes. In the c locale on GNU/Linux systems, each character represents a single byte using the ASCII encoding for bytes whose values are in the range 0-127; bytes outside of that range do not represent any characters, and are considered to be encoding errors. The en_US.utfs locale is like the C locale, except that some (but not all) short sequences of bytes in the range 128-255 represent non-ASCII characters; bytes that are not in such a sequence are considered to be encoding errors 1. Write a shell script find-ascii-text that accepts one or more arguments, and outputs a line for each argument that names an ASCII text ilee., an ASCII file containing no NUL bytes). If an argument names a directory, your script should recursively look at all files under that directory or its subdirectories 2. Write a similar shell script tind-utf-8-text that works like tind-ascii-text, except that it outputs a line only for UTF-8 text files (ie., UTF-8 files containing no NUL bytes) that are not ASCII text files 3. Write a shell script tind-missing-utt-8-header that accepts one or more arguments, and outputs a line for each argument naming a UTF-8 text file that lacks the "-*- coding: utf-s -*- " string in the first line. If an argument names a directory, the command should search the directory recursively 4. Write a shell script find-extra-utf-8-header that is like find-missing-utf-8-header except it outputs a line for each ASCII text file that has the "-*- coding: utf-8 -*" string in the first line You need not worry about the cases where your scripts are given no arguments. However, be prepared to handle files whose names contain special characters like spaces,"", and leading" ". You need not worry about file names containing newlines Your script should be runnable as an ordinary user, and should be portable to any system that supports GNUUgrep along with the other standard POSIX shell and utilities: please see its list of utilities for the commands that your script may use. (Hint: see the find, head and tr utilities.) With GNU grep, the pattern (period) matches only individual characters in the current locale; it does not match encoding errors. GNU grep has special treatment of files with encoding errors, or files containing NUL characters; see its binary-files option. You may also want to look at GNU grep's-H. -m. -n.-1. -L, and-v options When testing your script, it is a good idea to do the testing in a subdirectory devoted just to testing. This will reduce the likelihood of messing up your home directory or main development directory if your script goes haywire Homework: Find improperly marked files Warning: it will be difficult to do this homework without attending the lab session for hints You're working in a project that has lots of text files. Some of them are in plain ASCII; others are in UTF-8, which is a superset of ASCII. None of the files are supposed to contain NUL bytes (all bits zero). The UTF-8 files that contain non-ASCII characters are supposed to have a first line containing the string"- coding: utf- " (without the quotation marks) POSIX systems typically support various locales to let applications operate well in various countries, languages, and character encodings. The shell command locale -a outputs all the locales available on your system, and should list among others the c, es_Mx.utt8, and ja_JP.utt8 locales. You can select a locale by setting the Lc_ALL the following: with a shell command like export LC ALL-en us.utfB This setting takes effect for all programs later invoked. You can see your current locale settings by running the locale command without any arguments. You can see how the locale affects the output of some standard ies by running the date and 1s -1 commands, using the abovementioned locales Associated with each locale is a character encoding that specifies how characters in that locale are represented as bytes. In the c locale on GNU/Linux systems, each character represents a single byte using the ASCII encoding for bytes whose values are in the range 0-127; bytes outside of that range do not represent any characters, and are considered to be encoding errors. The en_US.utfs locale is like the C locale, except that some (but not all) short sequences of bytes in the range 128-255 represent non-ASCII characters; bytes that are not in such a sequence are considered to be encoding errors 1. Write a shell script find-ascii-text that accepts one or more arguments, and outputs a line for each argument that names an ASCII text ilee., an ASCII file containing no NUL bytes). If an argument names a directory, your script should recursively look at all files under that directory or its subdirectories 2. Write a similar shell script tind-utf-8-text that works like tind-ascii-text, except that it outputs a line only for UTF-8 text files (ie., UTF-8 files containing no NUL bytes) that are not ASCII text files 3. Write a shell script tind-missing-utt-8-header that accepts one or more arguments, and outputs a line for each argument naming a UTF-8 text file that lacks the "-*- coding: utf-s -*- " string in the first line. If an argument names a directory, the command should search the directory recursively 4. Write a shell script find-extra-utf-8-header that is like find-missing-utf-8-header except it outputs a line for each ASCII text file that has the "-*- coding: utf-8 -*" string in the first line You need not worry about the cases where your scripts are given no arguments. However, be prepared to handle files whose names contain special characters like spaces,"", and leading" ". You need not worry about file names containing newlines Your script should be runnable as an ordinary user, and should be portable to any system that supports GNUUgrep along with the other standard POSIX shell and utilities: please see its list of utilities for the commands that your script may use. (Hint: see the find, head and tr utilities.) With GNU grep, the pattern (period) matches only individual characters in the current locale; it does not match encoding errors. GNU grep has special treatment of files with encoding errors, or files containing NUL characters; see its binary-files option. You may also want to look at GNU grep's-H. -m. -n.-1. -L, and-v options When testing your script, it is a good idea to do the testing in a subdirectory devoted just to testing. This will reduce the likelihood of messing up your home directory or main development directory if your script goes haywire