Answered step by step
Verified Expert Solution
Question
1 Approved Answer
1. Build a Fourth-gram language model Each student needs to collect an Arabic corpus of at least 100,000 words, but the more is better. A
1. Build a Fourth-gram language model Each student needs to collect an Arabic corpus of at least 100,000 words, but the more is better. A bonus will be given if the corpus contains Arabic dialects. Students cannot use the same corpus, fully or partially. Write a program to tokenize the corpus into tokens/words, then build a 4-gram model for this corpus. That is, your language model is a table that contains: the token, the token counts, and the token probability. The language model should be saved in CSV format. 2. Develop a plagiarism detection interface Develop a program (in JAVA) that uses your language model to compute a plagiarism score for a given sentence. In other words, the user can write a sentence in Arabic, and when clicking "Go", the program will compute the probability of this sentence using the language mode. This probability should be tuned to reflect a plagiarism score. The more similar a given sentence (fully or partially) to sentences in the corpus the higher the plagiarism score. Example: Submission: corpus language model.csv, source code, and all files used to run the project. During the discussion, students will be also asked theoretical questions related NLP
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started