Question
Python code below. it is not displaying all the requirements: Remove all the punctuations and non-English words, then count the number of the rest
Python code below. it is not displaying all the requirements:
- Remove all the punctuations and non-English words, then count the number of the rest of the words in the file
- Using the words after step 1 to build a word dictionary, all the words in the dictionary are unique (e.g. the word "But" and "but" should be considered as the same word)
- Count the number of distinct words in your dictionary
- The words in the dictionary should be displayed in an alphabetic order
- Select three sentences from the file, then use any POS tagging tools to identify POS tags in the selected sentences.
Code:
import string
import nltk
from collections import OrderedDict
# Download necessary NLTK data
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
# Define function to check if a word is English
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
def is_english(word):
return word.lower() in english_vocab
try:
with open(r'file_path', 'r') as f:
text = f.read()
# Preprocess: Remove punctuation and non-English words
exclude = set(string.punctuation)
text = ''.join(ch for ch in text if ch not in exclude and ch.isascii())
words = text.split()
words = [word for word in words if is_english(word)]
# Count processed words and print
total_processed_words = len(words)
print(f"Total processed words: {total_processed_words}")
# Build dictionary of unique words
word_dict = OrderedDict()
for word in words:
word_lower = word.lower()
if word_lower not in word_dict:
word_dict[word_lower] = 1
else:
word_dict[word_lower] += 1
# Count distinct words and print
distinct_word_count = len(word_dict)
print(f"Number of distinct words: {distinct_word_count}")
# Print words in alphabetical order
print("\nWords in alphabetical order:")
for word in sorted(word_dict):
print(word)
# Select sentences and POS tag
# Replace these sentences with ones from your file if necessary
sentences = [
"from fairest creatures we desire increase that thereby beautys rose might never die",
"when forty winters shall besiege thy brow and dig deep trenches in thy beautys field",
"for where is she so fair whose uneared womb disdains the tillage of thy husbandry"
]
for sentence in sentences:
pos_tags = nltk.pos_tag(sentence.split())
print("\nSentence:", sentence)
print("POS Tags:", pos_tags)
except Exception as e:
print("An error occurred:", e)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started