After reviewing Ch 5, Predictive Analytics II Text, Web, and Social Media Analytics Application Case 5 7, Understanding Why Customers Abandon Shopping Carts Results in a $10 Million Sales Increase What are your main take a ways from this reading 5 7 Web Mining OverviewThe Internet has changed the landscape for conducting business forever Because of the highly connected, flattened world and broadened competition field, today's companies are increasingly facing greater opportunities (being able to reach customers and markets that they may have never thought possible) and bigger challenges (a globalized and ever changing competitive marketplace) Ones with the vision and capabilities to deal with such a volatile environment are greatly benefiting from it, whereas others who resist adapting are having a hard time surviving Having an engaged presence on the Internet is not a choice anymore it is a business requirement Customers are expecting companies to offer their products and or services over the Internet They are not only buying products and services but also talking about companies and sharing their transactional and usage experiences with others over the Internet The growth of the Internet and its enabling technologies has made data creation, data collection, and data information opinion exchange easier Delays in service, manufacturing, shipping, delivery, and customer inquiries are no longer private incidents and are accepted as necessary evils Now, thanks to social media tools and technologies on the Internet, everybody knows everything Successful companies are the ones who embrace these Internet technologies and use them for the betterment of their business processes so that they can better communicate with their customers, understand their needs and wants, and serve them thoroughly and expeditiously Being customer focused and keeping customers happy has never been as important a concept for businesses as they are now, in this age of Internet and social media The World Wide Web (or for short, Web) serves as an enormous repository of data and information on virtually everything one can conceive business, personal, you name it, an abundant amount of it is there The Web is perhaps the world's largest data and text repository, and the amount of information on the Web is growing rapidly A lot of interesting information can be found online whose home page is linked to which other pages, how many people have links to a specific Web page, and how a particular site is organized In addition, each visitor to a Web site, each search on a search engine, each click on a link, and each transaction on an e commerce site creates additional data Although unstructured textual data in the form of Web pages coded in HTML or XML is the dominant content of the Web, the Web infrastructure also contains hyperlink information (connections to other Web pages) and usage information (logs of visitors' interactions with Web sites), all of which provide rich data for knowledge discovery Analysis of this information can help us make better use of Web sites and also aid us in enhancing relationships and value for the visitors to our own Web sites Because of its sheer size and complexity, mining the Web is not an easy undertaking by any means The Web also poses great challenges for effective and efficient knowledge discovery (Han Kamber, 2006) The Web is too big for effective data mining The Web is so large and growing so rapidly that it is difficult to even quantify its size Because of the sheer size of the Web, it is not feasible to set up a data warehouse to replicate, store, and integrate all of the data on the Web, making data collection and integration a challenge The Web is too complex The complexity of a Web page is far greater than that of a page in a traditional text document collection Web pages lack a unified structure They contain far more authoring style and content variation than any set of books, articles, or other traditional text based document The Web is too dynamic The Web is a highly dynamic information source Not only does the Web grow rapidly, but also its content is constantly being updated Blogs, news stories, stock market results, weather reports, sports scores, prices, company advertisements, and numerous other types of information are updated regularly on the Web The Web is not specific to a domain The Web serves a broad diversity of communities and connects billions of workstations Web users have very different backgrounds, interests, and usage purposes Most users may not have good knowledge of the structure of the information network and may not be aware of the heavy cost of a particular search that they perform The Web has everything Only a small portion of the information on the Web is truly relevant or useful to someone (or some task) It is said that 99 of the information on the Web is useless to 99 of Web users Although this may not seem obvious, it is true that a particular person is generally interested in only a tiny portion of the Web, whereas the rest of the Web contains information that is uninteresting to the user and may swamp desired results Finding the portion of the Web that is truly relevant to a person and the task being performed is a prominent issue in Web related research These challenges have prompted many research efforts to enhance the effectiveness and efficiency of discovering and using data assets on the Web A number of index based Web search engines constantly search the Web and index Web pages under certain keywords Using these search engines, an experienced user may be able to locate documents by providing a set of tightly constrained keywords or phrases However, a simple keyword based search engine suffers from several deficiencies First, a topic of any breadth can easily contain hundreds or thousands of documents This can lead to a large number of document entries returned by the search engine, many of which are marginally relevant to the topic Second, many documents that are highly relevant to a topic may not contain the exact keywords defining them As we will cover in more detail later in this chapter, compared to keyword based Web search, Web mining is a prominent (and more challenging) approach that can be used to substantially enhance the power of Web search engines because Web mining can identify authoritative Web pages, classify Web documents, and resolve many ambiguities and subtleties raised in keyword based Web search engines Web mining (or Web data mining) is the process of discovering intrinsic relationships (i e , interesting and useful information) from Web data, which are expressed in the form of textual, linkage, or usage information The term Web mining was first used by Etzioni (1996) today, many conferences, journals, and books focus on Web data mining It is a continually evolving area of technology and business practice Web mining is essentially the same as data mining that uses data generated over the Web The goal is to turn vast repositories of business transactions, customer interactions, and Web site usage data into actionable information (i e , knowledge) to promote better decision making throughout the enterprise Because of the increased popularity of the term analytics, nowadays many have started to refer to Web mining as Web analytics However, these two terms are not the same Whereas Web analytics is primarily Web site usage data focused, Web mining is inclusive of all data generated via the Internet including transaction, social, and usage data Where Web analytics aims to describe what has happened on the Web site (employing a predefined, metrics driven descriptive analytics methodology), Web mining aims to discover previously unknown patterns and relationships (employing a novel predictive or prescriptive analytics methodology) From a big picture perspective, Web analytics can be considered to be a part of Web mining Figure 5 11 presents a simple taxonomy of Web mining, where it is divided into three main areas Web content mining, Web structure mining, and Web usage mining In the figure, the data sources used in these three main areas are also specified Although these three areas are shown separately, as you will see in the following section, they are often used collectively and synergistically to address business problems and opportunities Figure 5 11 A Simple Taxonomy of Web Mining Figure 5 11 Full Alternative TextAs Figure 5 11 indicates, Web mining relies heavily on data mining and text mining and their enabling tools and techniques, which we have covered in detail early in this chapter and in the previous chapter (Chapter 4) The figure also indicates that these three generic areas are further extended into several very well known application areas Some of these areas were explained in the previous chapters, and some of the others will be covered in detail in this chapter Web Content and Web Structure MiningWeb content mining refers to the extraction of useful information from Web pages The documents may be extracted in some machine readable format so that automated techniques can extract some information from these Web pages Web crawlers (also called spiders) are used to read through the content of a Web site automatically The information gathered may include document characteristics similar to what is used in text mining, but it may also include additional concepts, such as the document hierarchy Such an automated (or semiautomated) process of collecting and mining of Web content can be used for competitive intelligence (collecting intelligence about competitors' products, services, and customers) It can also be used for information news opinion collection and summarization, sentiment analysis, and automated data collection and structuring for predictive modeling As an illustrative example to using Web content mining as an automated data collection tool, consider the following For more than 10 years now, two of the three authors of this book (Drs Sharda and Delen) have been developing models to predict the financial success of Hollywood movies before their theatrical release The data that they use for training of the models come from several Web sites, each having a different hierarchical page structure Collecting a large set of variables on thousands of movies (from the past several years) from these Web sites is a time demanding, error prone process Therefore, they use Web content mining and spiders as an enabling technology to automatically collect, verify, validate (if the specific data item is available on more than one Web site, then the values are validated against each other and anomalies are captured and recorded), and store these values in a relational database That way, they ensure the quality of the data while saving valuable time (days or weeks) in the process In addition to text, Web pages also contain hyperlinks pointing one page to another Hyperlinks contain a significant amount of hidden human annotation that can potentially help to automatically infer the notion of centrality or authority When a Web page developer includes a link pointing to another Web page, this may be regarded as the developer's endorsement of the other page The collective endorsement of a given page by different developers on the Web may indicate the importance of the page and may naturally lead to the discovery of authoritative Web pages (Miller, 2005) Therefore, the vast amount of Web linkage information provides a rich collection of information about the relevance, quality, and structure of the Web's contents, and thus is a rich source for Web mining Web content mining can also be used to enhance the results produced by search engines In fact, search is perhaps the most prevailing application of Web content mining and Web structure mining A search on the Web to obtain information on a specific topic (presented as a collection of keywords or a sentence) usually returns a few relevant, high quality Web pages and a larger number of unusable Web pages Use of a relevance index based on keywords and authoritative pages (or some measure of it) improves the search results and ranking of relevant pages The idea of authority (or authoritative pages) stems from earlier information retrieval work using citations among journal articles to evaluate the impact of research papers (Miller, 2005) Though that was the origination of the idea, there are significant differences between the citations in research articles and hyperlinks on Web pages First, not every hyperlink represents an endorsement (some links are created for navigation purposes and some are for paid advertisements) Although this is true, if the majority of the hyperlinks are of the endorsement type, then the collective opinion will still prevail Second, for commercial and competitive interests, one authority will rarely have its Web page point to rival authorities in the same domain For example, Microsoft may prefer not to include links on its Web pages to Apple's Web sites because this may be regarded as an endorsement of its competitor's authority Third, authoritative pages are seldom particularly descriptive For example, the main Web page of Yahoo may not contain the explicit self description that it is in fact a Web search engine The structure of Web hyperlinks has led to another important category of Web pages called a hub A hub is one or more Web pages that provide a collection of links to authoritative pages Hub pages may not be prominent, and only a few links may point to them however, they provide links to a collection of prominent sites on a specific topic of interest A hub could be a list of recommended links on an individual's home page, recommended reference sites on a course Web page, or a professionally assembled resource list on a specific topic Hub pages play the role of implicitly conferring the authorities on a narrow field In essence, a close symbiotic relationship exists between good hubs and authoritative pages a good hub is good because it points to many good authorities, and a good authority is good because it is being pointed to by many good hubs Such relationships between hubs and authorities make it possible to automatically retrieve high quality content from the Web The most popular publicly known and referenced algorithm used to calculate hubs and authorities is hyperlink induced topic search (HITS) It was originally developed by Kleinberg (1999) and has since been improved on by many researchers HITS is a link analysis algorithm that rates Web pages using the hyperlink information contained within them In the context of Web search, the HITS algorithm collects a base document set for a specific query It then recursively calculates the hub and authority values for each document To gather the base document set, a root set that matches the query is fetched from a search engine For each document retrieved, a set of documents that points to the original document and another set of documents that is pointed to by the original document are added to the set as the original document's neighborhood A recursive process of document identification and link analysis continues until the hub and authority values converge These values are then used to index and prioritize the document collection generated for a specific query Web structure mining is the process of extracting useful information from the links embedded in Web documents It is used to identify authoritative pages and hubs, which are the cornerstones of the contemporary page rank algorithms that are central to popular search engines such as Google and Yahoo Just as links going to a Web page may indicate a site's popularity (or authority), links within the Web page (or the complete Web site) may indicate the depth of coverage of a specific topic Analysis of links is very important in understanding the interrelationships among large numbers of Web pages, leading to a better understanding of a specific Web community, clan, or clique

Question

After reviewing Ch  5,  Predictive Analytics II  Text, Web, and Social Media Analytics  Application Case 5 7,  Understanding Why Customers Abandon Shopping Carts Results in a $10 Million Sales Increase  What are your main take a ways from this reading  5 7 Web Mining OverviewThe Internet has changed the landscape for conducting business forever  Because of the highly connected, flattened world and broadened competition field, today's companies are increasingly facing greater opportunities (being able to reach customers and markets that they may have never thought possible) and bigger challenges (a globalized and ever changing competitive marketplace)  Ones with the vision and capabilities to deal with such a volatile environment are greatly benefiting from it, whereas others who resist adapting are having a hard time surviving  Having an engaged presence on the Internet is not a choice anymore  it is a business requirement  Customers are expecting companies to offer their products and or services over the Internet  They are not only buying products and services but also talking about companies and sharing their transactional and usage experiences with others over the Internet The growth of the Internet and its enabling technologies has made data creation, data collection, and data information opinion exchange easier  Delays in service, manufacturing, shipping, delivery, and customer inquiries are no longer private incidents and are accepted as necessary evils  Now, thanks to social media tools and technologies on the Internet, everybody knows everything  Successful companies are the ones who embrace these Internet technologies and use them for the betterment of their business processes so that they can better communicate with their customers, understand their needs and wants, and serve them thoroughly and expeditiously  Being customer focused and keeping customers happy has never been as important a concept for businesses as they are now, in this age of Internet and social media The World Wide Web (or for short, Web) serves as an enormous repository of data and information on virtually everything one can conceive  business, personal, you name it, an abundant amount of it is there  The Web is perhaps the world's largest data and text repository, and the amount of information on the Web is growing rapidly  A lot of interesting information can be found online  whose home page is linked to which other pages, how many people have links to a specific Web page, and how a particular site is organized  In addition, each visitor to a Web site, each search on a search engine, each click on a link, and each transaction on an e commerce site creates additional data  Although unstructured textual data in the form of Web pages coded in HTML or XML is the dominant content of the Web, the Web infrastructure also contains hyperlink information (connections to other Web pages) and usage information (logs of visitors' interactions with Web sites), all of which provide rich data for knowledge discovery  Analysis of this information can help us make better use of Web sites and also aid us in enhancing relationships and value for the visitors to our own Web sites Because of its sheer size and complexity, mining the Web is not an easy undertaking by any means  The Web also poses great challenges for effective and efficient knowledge discovery (Han   Kamber, 2006) The Web is too big for effective data mining  The Web is so large and growing so rapidly that it is difficult to even quantify its size  Because of the sheer size of the Web, it is not feasible to set up a data warehouse to replicate, store, and integrate all of the data on the Web, making data collection and integration a challenge The Web is too complex  The complexity of a Web page is far greater than that of a page in a traditional text document collection  Web pages lack a unified structure  They contain far more authoring style and content variation than any set of books, articles, or other traditional text based document The Web is too dynamic  The Web is a highly dynamic information source  Not only does the Web grow rapidly, but also its content is constantly being updated  Blogs, news stories, stock market results, weather reports, sports scores, prices, company advertisements, and numerous other types of information are updated regularly on the Web The Web is not specific to a domain  The Web serves a broad diversity of communities and connects billions of workstations  Web users have very different backgrounds, interests, and usage purposes  Most users may not have good knowledge of the structure of the information network and may not be aware of the heavy cost of a particular search that they perform The Web has everything  Only a small portion of the information on the Web is truly relevant or useful to someone (or some task)  It is said that 99  of the information on the Web is useless to 99  of Web users  Although this may not seem obvious, it is true that a particular person is generally interested in only a tiny portion of the Web, whereas the rest of the Web contains information that is uninteresting to the user and may swamp desired results  Finding the portion of the Web that is truly relevant to a person and the task being performed is a prominent issue in Web related research These challenges have prompted many research efforts to enhance the effectiveness and efficiency of discovering and using data assets on the Web  A number of index based Web search engines constantly search the Web and index Web pages under certain keywords  Using these search engines, an experienced user may be able to locate documents by providing a set of tightly constrained keywords or phrases  However, a simple keyword based search engine suffers from several deficiencies  First, a topic of any breadth can easily contain hundreds or thousands of documents  This can lead to a large number of document entries returned by the search engine, many of which are marginally relevant to the topic  Second, many documents that are highly relevant to a topic may not contain the exact keywords defining them  As we will cover in more detail later in this chapter, compared to keyword based Web search, Web mining is a prominent (and more challenging) approach that can be used to substantially enhance the power of Web search engines because Web mining can identify authoritative Web pages, classify Web documents, and resolve many ambiguities and subtleties raised in keyword based Web search engines Web mining (or Web data mining) is the process of discovering intrinsic relationships (i e , interesting and useful information) from Web data, which are expressed in the form of textual, linkage, or usage information  The term Web mining was first used by Etzioni (1996)  today, many conferences, journals, and books focus on Web data mining  It is a continually evolving area of technology and business practice  Web mining is essentially the same as data mining that uses data generated over the Web  The goal is to turn vast repositories of business transactions, customer interactions, and Web site usage data into actionable information (i e , knowledge) to promote better decision making throughout the enterprise  Because of the increased popularity of the term analytics, nowadays many have started to refer to Web mining as Web analytics  However, these two terms are not the same  Whereas Web analytics is primarily Web site usage data focused, Web mining is inclusive of all data generated via the Internet including transaction, social, and usage data  Where Web analytics aims to describe what has happened on the Web site (employing a predefined, metrics driven descriptive analytics methodology), Web mining aims to discover previously unknown patterns and relationships (employing a novel predictive or prescriptive analytics methodology)  From a big picture perspective, Web analytics can be considered to be a part of Web mining  Figure 5 11 presents a simple taxonomy of Web mining, where it is divided into three main areas  Web content mining, Web structure mining, and Web usage mining  In the figure, the data sources used in these three main areas are also specified  Although these three areas are shown separately, as you will see in the following section, they are often used collectively and synergistically to address business problems and opportunities Figure 5 11 A Simple Taxonomy of Web Mining Figure 5 11 Full Alternative TextAs Figure 5 11 indicates, Web mining relies heavily on data mining and text mining and their enabling tools and techniques, which we have covered in detail early in this chapter and in the previous chapter (Chapter 4)  The figure also indicates that these three generic areas are further extended into several very well known application areas  Some of these areas were explained in the previous chapters, and some of the others will be covered in detail in this chapter Web Content and Web Structure MiningWeb content mining refers to the extraction of useful information from Web pages  The documents may be extracted in some machine readable format so that automated techniques can extract some information from these Web pages  Web crawlers (also called spiders) are used to read through the content of a Web site automatically  The information gathered may include document characteristics similar to what is used in text mining, but it may also include additional concepts, such as the document hierarchy  Such an automated (or semiautomated) process of collecting and mining of Web content can be used for competitive intelligence (collecting intelligence about competitors' products, services, and customers)  It can also be used for information news opinion collection and summarization, sentiment analysis, and automated data collection and structuring for predictive modeling  As an illustrative example to using Web content mining as an automated data collection tool, consider the following  For more than 10 years now, two of the three authors of this book (Drs  Sharda and Delen) have been developing models to predict the financial success of Hollywood movies before their theatrical release  The data that they use for training of the models come from several Web sites, each having a different hierarchical page structure  Collecting a large set of variables on thousands of movies (from the past several years) from these Web sites is a time demanding, error prone process  Therefore, they use Web content mining and spiders as an enabling technology to automatically collect, verify, validate (if the specific data item is available on more than one Web site, then the values are validated against each other and anomalies are captured and recorded), and store these values in a relational database  That way, they ensure the quality of the data while saving valuable time (days or weeks) in the process In addition to text, Web pages also contain hyperlinks pointing one page to another  Hyperlinks contain a significant amount of hidden human annotation that can potentially help to automatically infer the notion of centrality or authority  When a Web page developer includes a link pointing to another Web page, this may be regarded as the developer's endorsement of the other page  The collective endorsement of a given page by different developers on the Web may indicate the importance of the page and may naturally lead to the discovery of authoritative Web pages (Miller, 2005)  Therefore, the vast amount of Web linkage information provides a rich collection of information about the relevance, quality, and structure of the Web's contents, and thus is a rich source for Web mining Web content mining can also be used to enhance the results produced by search engines  In fact, search is perhaps the most prevailing application of Web content mining and Web structure mining  A search on the Web to obtain information on a specific topic (presented as a collection of keywords or a sentence) usually returns a few relevant, high quality Web pages and a larger number of unusable Web pages  Use of a relevance index based on keywords and authoritative pages (or some measure of it) improves the search results and ranking of relevant pages  The idea of authority (or authoritative pages) stems from earlier information retrieval work using citations among journal articles to evaluate the impact of research papers (Miller, 2005)  Though that was the origination of the idea, there are significant differences between the citations in research articles and hyperlinks on Web pages  First, not every hyperlink represents an endorsement (some links are created for navigation purposes and some are for paid advertisements)  Although this is true, if the majority of the hyperlinks are of the endorsement type, then the collective opinion will still prevail  Second, for commercial and competitive interests, one authority will rarely have its Web page point to rival authorities in the same domain  For example, Microsoft may prefer not to include links on its Web pages to Apple's Web sites because this may be regarded as an endorsement of its competitor's authority  Third, authoritative pages are seldom particularly descriptive  For example, the main Web page of Yahoo  may not contain the explicit self description that it is in fact a Web search engine The structure of Web hyperlinks has led to another important category of Web pages called a hub  A hub is one or more Web pages that provide a collection of links to authoritative pages  Hub pages may not be prominent, and only a few links may point to them  however, they provide links to a collection of prominent sites on a specific topic of interest  A hub could be a list of recommended links on an individual's home page, recommended reference sites on a course Web page, or a professionally assembled resource list on a specific topic  Hub pages play the role of implicitly conferring the authorities on a narrow field  In essence, a close symbiotic relationship exists between good hubs and authoritative pages  a good hub is good because it points to many good authorities, and a good authority is good because it is being pointed to by many good hubs  Such relationships between hubs and authorities make it possible to automatically retrieve high quality content from the Web The most popular publicly known and referenced algorithm used to calculate hubs and authorities is hyperlink induced topic search (HITS)  It was originally developed by Kleinberg (1999) and has since been improved on by many researchers  HITS is a link analysis algorithm that rates Web pages using the hyperlink information contained within them  In the context of Web search, the HITS algorithm collects a base document set for a specific query  It then recursively calculates the hub and authority values for each document  To gather the base document set, a root set that matches the query is fetched from a search engine  For each document retrieved, a set of documents that points to the original document and another set of documents that is pointed to by the original document are added to the set as the original document's neighborhood  A recursive process of document identification and link analysis continues until the hub and authority values converge  These values are then used to index and prioritize the document collection generated for a specific query Web structure mining is the process of extracting useful information from the links embedded in Web documents  It is used to identify authoritative pages and hubs, which are the cornerstones of the contemporary page rank algorithms that are central to popular search engines such as Google and Yahoo  Just as links going to a Web page may indicate a site's popularity (or authority), links within the Web page (or the complete Web site) may indicate the depth of coverage of a specific topic  Analysis of links is very important in understanding the interrelationships among large numbers of Web pages, leading to a better understanding of a specific Web community, clan, or clique

Accepted Answer

The Answer is in the image, click to view ...

Question

After reviewing Ch. 5, Predictive Analytics II: Text, Web, and Social Media Analytics Application Case 5.7, Understanding Why Customers Abandon Shopping Carts Results in a

Step by Step Solution

Step: 1

Get Instant Access with AI-Powered Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

Business Ethics Ethical Decision Making And Cases

Students also viewed these General Management questions

Question

Question

Question

Question

Question

Question

Question

Question

Question