In natural language processing, it is important to identify phrases from text By considering phrases as word sequences of fixed order that are frequent in the corpus, one can apply the sequential pattern mining algorithm GSP, Prefix Span or SPADE to solve the problem In this assignment, you will be given raw text sentences and you need to implement GSP or other equivalent sequential pattern mining algorithm to find frequent phrases from the text You can assume that the input is in English only As the first step, please remove less important words (stop words) from the sentences You can use the english stopwords list from below a an are as at by be for from has he in is it its of on that the to was were will with Then, convert all words to lower case to remove ambiguity Now, start building the sequence database Specifically, process conjunction joiner ( and ) by grouping joined words into itemsets That is, for example, make A, B and C into (A, B, C) in sequence transaction Input Format The input dataset is raw text sentences The first line of the input corresponds to the minimum support Each following line of the input corresponds to one sentence Words in each transaction are seperated by a space Please refer to the sample input below In sample input 0, the minimum support is 3, and the dataset contains 3 sentences and 5 words (a, b, c, d, e, f and g) Constraints NA Output Format The output are the longest length frequent phrases you mined out from the input dataset This means that you have to report the phrase with the longest length which satisfies the minimum support criterion Each line of the output should be in the format Support frequent phrase Support frequent phrase The frequent phrases should be ordered according to their support from largest to smallest Ties should be resolved by ordering the frequent phrases according to the alphabetical order Please refer to the sample output below In sample output 1, the four phrases are the sequential frequent phrases calcualted after converting b and c into (bc) Sample Input 0 3 b and c d and e f b c d e f b and c d b e g f Sample Output 0 3 b d f 3 b e f 3 c d f 3 c e f Sample Input 1 4 Clustering and classification are important problems in machine learning There are many machine learning algorithms for classification and clustering problems Classification problems require training data Most clustering problems require user specified group number SVM, LogisticRegression and NaiveBayes are machine learning algorithms for classification problems k means, AGNES and DBSCAN are clustering algorithms Dimension reduction methods such as PCA are also learning algorithms for clustering problems Sample Output 1 4 classification problems 4 clustering problems

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Oct 07, 2024

In natural language processing, it is important to identify phrases from text. By considering phrases as word sequences of fixed order that are frequent in

In natural language processing, it is important to identify "phrases" from text. By considering phrases as word sequences of fixed order that are frequent in the corpus, one can apply the sequential pattern mining algorithm GSP, Prefix-Span or SPADE to solve the problem.

In this assignment, you will be given raw text sentences and you need to implement GSP or other equivalent sequential pattern mining algorithm to find frequent phrases from the text.

You can assume that the input is in English only. As the first step, please remove less important words (stop words) from the sentences. You can use the english stopwords list from below.

a an are as at by be for from has he in is it its of on that the to was were will with

Then, convert all words to lower case to remove ambiguity. Now, start building the sequence database. Specifically, process conjunction joiner ("and") by grouping joined words into itemsets. That is, for example, make "A, B and C" into (A, B, C) in sequence transaction.

Input Format

The input dataset is raw text sentences.

The first line of the input corresponds to the minimum support.

Each following line of the input corresponds to one sentence. Words in each transaction are seperated by a space.

Please refer to the sample input below. In sample input 0, the minimum support is 3, and the dataset contains 3 sentences and 5 words (a, b, c, d, e, f and g).

Constraints

Output Format

The output are the longest length frequent phrases you mined out from the input dataset. This means that you have to report the phrase with the longest length which satisfies the minimum support criterion.

Each line of the output should be in the format:

Support [frequent phrase] Support [frequent phrase] ......

The frequent phrases should be ordered according to their support from largest to smallest. Ties should be resolved by ordering the frequent phrases according to the alphabetical order.

Please refer to the sample output below. In sample output 1, the four phrases are the sequential frequent phrases calcualted after converting b and c into (bc).

Sample Input 0

3 b and c d and e f b c d e f b and c d b e g f

Sample Output 0

3 [b d f] 3 [b e f] 3 [c d f] 3 [c e f]

Sample Input 1

4 Clustering and classification are important problems in machine learning. There are many machine learning algorithms for classification and clustering problems. Classification problems require training data. Most clustering problems require user-specified group number. SVM, LogisticRegression and NaiveBayes are machine learning algorithms for classification problems. k-means, AGNES and DBSCAN are clustering algorithms. Dimension reduction methods such as PCA are also learning algorithms for clustering problems.

Sample Output 1

4 [classification problems] 4 [clustering problems]

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Concepts of Database Management

Authors: Philip J. Pratt, Mary Z. Last

8th edition

How do contemporary North African writers address the themes of identity and migration in their works, and what impact do these themes have on the portrayal of North African societies?

Answered: 1 week ago

Previous Question Next Question