Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Task 1 : Parallel Corpora Parallel corpora contain a collection of texts in a given language and their translation to one or more other languages.
Task : Parallel Corpora
Parallel corpora contain a collection of texts in a given language and their translation to one or more other
languages. In this task, you will build a small parallel corpus using data from OpenSubtitles.org, a
database that allows you to search and download subtitles for various languages. It was previously used to
build the OpenSubtitles corpus, which consists of around billion sentences and covers languages.
Search for the film Monty Python and the Holy Grail on OpenSubtitles.org and download subtitles
for English, German, and a third language of your choosing. Open the files using a text editor eg VS
Code and familiarise yourself with the format. Your corpus will include sentences from a famous scene
that starts at ::first English sentence is : Quiet There are ways of telling whether she is a witch.
and ends at ::last English sentence is: knight of the Round Table. Your goal is to clean up the
data, match subtitles in different languages and put the lines together, transforming them into the following
format:
line in English
line in German
line in chosen language
line in English
line in German
line in chosen language
You will see that this manual process is not feasible for greater amounts of data, and you will learn how to
automate a process like this later on in the course.
Save the created corpus as grailcorpus.txt and submit the file together with the assignment.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started