Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Problem 3 : TokenizationIf your regular expression is right, all the tests defined below should pass: init _ test ( ) TEST _ EXAMPLES (

Problem 3: TokenizationIf your regular expression is right, all the tests defined below should pass:
init_test()
TEST_EXAMPLES
('This is a test!',['This','is','a', 'test', '!']),
('Is this a test?',['Is', 'this', 'a', 'test', '?']),
("I don't think this is a test", ['I",'do","n't", 'think', 'this', 'is",'a', 'test"]),
"Thy ph c ca ti ly ca ln",
['Thy', 'phi', 'ca','ca','ti','l','y','','ca','ln']),
['Is','it', 'legal', 'to', 'shout', "The word 'very' is very over-used",
["The word 'very' is very over-used",' 'word', "'", 'very', "'.",'is', 'very', 'over', '-', 'used']),
I don't think we'll"ve been there yet",
['I','do',"n't", 'think', 'we',"'11","'ve", 'been', 'there', 'yet']),
("Give me 12 apples, please", ['Give','me','12', 'apples', ',', 'please']),
"A 20% tip on a $30 tab is 6 dollars",
['A','20%', 'tip', 'on','a','$30", 'tab', 'is','6', 'dollars']),
Qpytest.mark.parametrize('text, toks', TEST_EXANPLES)
def test_tokenizer(text, toks):
test_tokenizer(text, toks): assert tokenize(text)= toks
run_test()
If your regular expression is right, all the tests defined below should pass:
init_test()
TEST_EXAMPLES
('This is a test!',['This','is','a', 'test', '!']),
('Is this a test?',['Is', 'this', 'a', 'test', '?']),
("I don't think this is a test", ["I','do',"n't", 'think', 'this', 'is','a', 'test']),
("Thy phi c ca ti ly ca ln",
['Thy', 'phi', 'c','ca','ti','l','y','','ca','ln']),
("Is it legal to shout ' 'Fire!' in a crowded theater?",
['Is','it', 'legal', 'to', 'shout', "'.", 'Fire', '!',"'.','in','a', 'crowded', 'theater', '?']),
("The word 'very' is very over-used",
['The', 'word', "'", 'very', "'",'is', 'very', 'over', '-', 'used']),
("I don't think we'll've been there yet",
['I','do',"n't", 'think', 'we',"'11","'ve", 'been', 'there", 'yet"]),
("Give me 12 apples, please", ['Give','me','12', 'apples', ',', 'please']),
("A 20% tip on a $30 tab is 6 dollars",
['A','20%', 'tip', 'on','a',' $30', 'tab', 'is','6', 'dollars']),
("They're going to pay us 10% of $120,000 by Jun 4,2021",
['They',"'re", 'going', 'to', 'pay', 'us','10%','of','$120,000','by', 'Jun', '4',',','2021']),
Qpytest.mark.parametrize('text,toks', TEST_EXAMPLES)
def test_tokenizer(text, toks):
assert tokenize(text)== toks
run_test()
Modify this expression so that it meets the following additional requirements:
the punctuation marks ' and ''(left double apostrophe and right double apostrophe) should be single tokens
like n't, the contractions 've,'11,'re, and 's should be seperate tokens
numbers should be separate tokens, where:
@ a number may start with $ or end with %
a number may start with or contain a comma but may not end with one (technically, number tokens shouldn't start with a comma but it's okay if your transducer allows it)
tok_patterns ={}
# insert spaces before and after punctuation
tok_patterns['punct']=
FST.re(r"$^rewrite('': ''[!?,]'': '')")
# insert space before n't
tok_patterns['contract']=
FST.re(r"$^rewrite('' : ''n??'t)")
tokenizer =
FST.re("$punct @ $contract", tok_patterns)]
def tokenize(s):
s= list(tokenizer.generate(s))
if len(s)==1 :
return s[].split()
else:
return None
0.0s
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Authors: David M. Kroenke, David J. Auer

7th edition

133544621, 133544626, 0-13-354462-1, 978-0133544626

Students also viewed these Databases questions