Question

1 Approved Answer

Posted on Aug 10, 2024

13-31. Consider the SUPPLIER, PART, and SHIPMENT rela- tions and distributed database mentioned in the section on query optimization in this chapter. a. Write a

image text in transcribed

13-31. Consider the SUPPLIER, PART, and SHIPMENT rela- tions and distributed database mentioned in the section on query optimization in this chapter. a. Write a global SQL query (submitted in Columbus) to display the Part Number and Color for every part that is not supplied by a supplier in Chicago. b. Design three alternative query-processing strategies for your answer to part a. c. Develop a table similar to Table 13-2 to compare the processing times for these three strategies. d. Which of your three strategies was best and why? Query Optimization With distributed databases, the response to a query may require the DBMS to assemble data from several different sites (although with location transparency, the user is unaware of this need). A major decision for the DBMS is how to process a query, which is affected by both the way a user formulates a query and the intelligence of the distributed DBMS to develop a sensible plan for processing. Date (2003) provides an excellent yet simple example of this problem. Consider the following situation adapted from Date. A simplified procurement database has the following three tables: Supplier_T(SupplierNumber,City) Part_T(PartNumber, Color) Shipment_T(Supplier Number PartNumber) 10,000 records, stored in Detroit 100,000 records, stored in Chicago 1,000,000 records, stored in Detroit A query, written in SQL, is made to list the supplier numbers for Cleveland sup- pliers of red parts: SELECT FROM Supplier_T.SupplierNumber Supplier_T, Shipment_T, Part_T Supplier_T.City = 'Cleveland' Shipment_T.PartNumber = Part_T.PartNumber Part_T. Color = 'Red'; AND AND Each record in each relation is 100 characters long, and there are 10 red parts, a history of 100,000 shipments from Cleveland, and a negligible query computation time compared with communication time. Also, there is a communication system with a data transmission rate of 10,000 characters per second and 1-second access delay to send a message from one node to another. These data rates and times are quite slow compared to the modern standards, but they are still useful for illustrating the drastic differences between different query processing strategies. Date (2003) identifies six plausible strategies for this situation and develops the associated communication times; these strategies and times are summarized in Table 13-2. Depending on the choice of strategy, the time required to satisfy the query ranges from 1 second to 2.3 days! Although the last strategy is best, the fourth strategy is also acceptable. The technology described in Date's article is somewhat dated, but the strategies and the relative times are still valid In general, this example indicates that it is often advisable to break a query in a distributed database environment into components that are isolated at different sites, determine which site has the potential to yield the fewest qualified records, and then move this result to another site where additional work is performed. Obviously, more than two sites require even more complex analyses and more complicated heuristics to guide query processing. A distributed DBMS typically uses the following three steps to develop a query processing plan (zsu and Valduriez, 1992): 1. Query decomposition in this step, the query is simplified and rewritten into a structured, relational algebra form. 2. Data localization Here, the query is transformed from a query referencing data across the network as if the database were in one location into one or more frag- ments that each explicitly reference data at only one site. 3. Global optimization In this final step, decisions are made about the order in which to execute query fragments, where to move data between sites, and where parts of the query will be executed. TABLE 13-2 Query-Processing Strategies in a Distributed Database Environment Method Time Move PART relation to Detroit and process whole query at Detroit computer. 18.7 minutes Move SUPPLIER and SHIPMENT relations to Chicago and process whole 28 hours query at Chicago computer. Join SUPPLIER and SHIPMENT at the Detroit computer, PROJECT these 2.3 days down to only tuples for Cleveland suppliers, and then for each of these check at the Chicago computer to determine if the associated PART is red. PROJECT PART at the Chicago computer down to just the red items, and 20 seconds for each check at the Detroit computer to see if there is some SHIPMENT involving that PART and a Cleveland SUPPLIER. JOIN SUPPLIER and SHIPMENT at the Detroit computer, PROJECT just 16.7 minutes SupplierNumber and PartNumber for only Cleveland SUPPLIERs, and move this qualified projection to Chicago for matching with red PARTS. Select just red PARTs at the Chicago computer and move the result 1 second to Detroit for matching with Cleveland SUPPLIERS. Source: Based on Date (2003). FIGURE 13-11 Distributed database, with one table at each of two sites Site 1 Site 2 Customer_T table CustNo 10 bytes CustName 50 bytes ZipCode 10 bytes SIC 5 bytes 10,000 rows Order_T table OrderNo CustNo Order Date OrderAmount 400,000 rows 10 bytes 10 bytes 4 bytes 6 bytes Certainly, the design of the database interacts with the sophistication of the distributed DBMS to yield the performance for queries. A distributed database will be designed based on the best possible understanding of how and where the data will be used. Given the database design (which allocates data partitions to one or more sites), however, all queries, whether anticipated or not, must be processed as efficiently as possible. One technique used to make processing a distributed query more efficient is to use a semijoin operation (Elmasri and Navathe, 2006). In a semijoin, only the joining attribute is sent from one site to another, and then only the required rows are returned. If only a small percentage of the rows participate in the join, the amount of data being transferred is minimal. For example, consider the distributed database in Figure 13-11. Suppose that a query at site 1 asks to display the CustName, SIC, and Order Date for all customers in a particular ZipCode range and an Order Amount above a specified limit. Assume that 10 percent of the customers fall in the ZipCode range and 2 percent of the orders are above the amount limit. A semijoin would work as follows: Semijoin A joining operation used with distributed databases in which only the joining attribute from one site is transmitted to the other site, rather than all the selected attributes from every qualified row. 1. A query is executed at site 1 to create a list of the CustNo values in the desired Zip- Code range. So 10 percent of 10,000 customers-1,000 rows satisfy the ZipCode qualification. Thus, 1,000 rows of 10 bytes each for the CustNo attribute (the join- ing attribute), or 10,000 bytes, will be sent to site 2. 2. A query is executed at site 2 to create a list of the CustNo and Order Date values to be sent back to site 1 to compose the final result. If you assume roughly the same number of orders for each customer, then 40,000 rows of the Order table will match with the customer numbers sent from site 1. If you assume that any customer order is equally likely to be above the limit, then 800 (2 percent of 40,000) of the Order table rows are relevant to this query. For each row, the CustNo and Order Date need to be sent to site 1, or 14 bytes x 800 rows, thus 11,200 bytes. The total data transferred is only 21,200 bytes, using the semijoin just described. Com- pare this total to simply sending the subset of each table needed at one site to the other site: To send data from site 1 to site 2 would require sending the CustNo, CustName, and SIC (65 bytes) for 1,000 rows of the Customer table (65,000 bytes) to site 2. To send data from site 2 to site 1 would require sending CustNo and Order Date (14 bytes) for 8,000 rows of the Order table (112,000 bytes). Clearly, the semijoin approach saves network traffic, which can be a major contributing factor to the overall time to respond to a user's query. A distributed DBMS uses a cost model to predict the execution time (for data processing and transmission) of alternative execution plans. The cost model is performed before the query is executed based on general network conditions, consequently, the actual cost may be more or less, depending on the actual network and node loads, database reorganizations, and other dynamic factors. Thus, the parameters of the cost model should be periodically updated as general conditions change in the network (e.g., as local databases are redesigned, network paths are changed, and DBMSs at local sites are replaced). 13-31. Consider the SUPPLIER, PART, and SHIPMENT rela- tions and distributed database mentioned in the section on query optimization in this chapter. a. Write a global SQL query (submitted in Columbus) to display the Part Number and Color for every part that is not supplied by a supplier in Chicago. b. Design three alternative query-processing strategies for your answer to part a. c. Develop a table similar to Table 13-2 to compare the processing times for these three strategies. d. Which of your three strategies was best and why? Query Optimization With distributed databases, the response to a query may require the DBMS to assemble data from several different sites (although with location transparency, the user is unaware of this need). A major decision for the DBMS is how to process a query, which is affected by both the way a user formulates a query and the intelligence of the distributed DBMS to develop a sensible plan for processing. Date (2003) provides an excellent yet simple example of this problem. Consider the following situation adapted from Date. A simplified procurement database has the following three tables: Supplier_T(SupplierNumber,City) Part_T(PartNumber, Color) Shipment_T(Supplier Number PartNumber) 10,000 records, stored in Detroit 100,000 records, stored in Chicago 1,000,000 records, stored in Detroit A query, written in SQL, is made to list the supplier numbers for Cleveland sup- pliers of red parts: SELECT FROM Supplier_T.SupplierNumber Supplier_T, Shipment_T, Part_T Supplier_T.City = 'Cleveland' Shipment_T.PartNumber = Part_T.PartNumber Part_T. Color = 'Red'; AND AND Each record in each relation is 100 characters long, and there are 10 red parts, a history of 100,000 shipments from Cleveland, and a negligible query computation time compared with communication time. Also, there is a communication system with a data transmission rate of 10,000 characters per second and 1-second access delay to send a message from one node to another. These data rates and times are quite slow compared to the modern standards, but they are still useful for illustrating the drastic differences between different query processing strategies. Date (2003) identifies six plausible strategies for this situation and develops the associated communication times; these strategies and times are summarized in Table 13-2. Depending on the choice of strategy, the time required to satisfy the query ranges from 1 second to 2.3 days! Although the last strategy is best, the fourth strategy is also acceptable. The technology described in Date's article is somewhat dated, but the strategies and the relative times are still valid In general, this example indicates that it is often advisable to break a query in a distributed database environment into components that are isolated at different sites, determine which site has the potential to yield the fewest qualified records, and then move this result to another site where additional work is performed. Obviously, more than two sites require even more complex analyses and more complicated heuristics to guide query processing. A distributed DBMS typically uses the following three steps to develop a query processing plan (zsu and Valduriez, 1992): 1. Query decomposition in this step, the query is simplified and rewritten into a structured, relational algebra form. 2. Data localization Here, the query is transformed from a query referencing data across the network as if the database were in one location into one or more frag- ments that each explicitly reference data at only one site. 3. Global optimization In this final step, decisions are made about the order in which to execute query fragments, where to move data between sites, and where parts of the query will be executed. TABLE 13-2 Query-Processing Strategies in a Distributed Database Environment Method Time Move PART relation to Detroit and process whole query at Detroit computer. 18.7 minutes Move SUPPLIER and SHIPMENT relations to Chicago and process whole 28 hours query at Chicago computer. Join SUPPLIER and SHIPMENT at the Detroit computer, PROJECT these 2.3 days down to only tuples for Cleveland suppliers, and then for each of these check at the Chicago computer to determine if the associated PART is red. PROJECT PART at the Chicago computer down to just the red items, and 20 seconds for each check at the Detroit computer to see if there is some SHIPMENT involving that PART and a Cleveland SUPPLIER. JOIN SUPPLIER and SHIPMENT at the Detroit computer, PROJECT just 16.7 minutes SupplierNumber and PartNumber for only Cleveland SUPPLIERs, and move this qualified projection to Chicago for matching with red PARTS. Select just red PARTs at the Chicago computer and move the result 1 second to Detroit for matching with Cleveland SUPPLIERS. Source: Based on Date (2003). FIGURE 13-11 Distributed database, with one table at each of two sites Site 1 Site 2 Customer_T table CustNo 10 bytes CustName 50 bytes ZipCode 10 bytes SIC 5 bytes 10,000 rows Order_T table OrderNo CustNo Order Date OrderAmount 400,000 rows 10 bytes 10 bytes 4 bytes 6 bytes Certainly, the design of the database interacts with the sophistication of the distributed DBMS to yield the performance for queries. A distributed database will be designed based on the best possible understanding of how and where the data will be used. Given the database design (which allocates data partitions to one or more sites), however, all queries, whether anticipated or not, must be processed as efficiently as possible. One technique used to make processing a distributed query more efficient is to use a semijoin operation (Elmasri and Navathe, 2006). In a semijoin, only the joining attribute is sent from one site to another, and then only the required rows are returned. If only a small percentage of the rows participate in the join, the amount of data being transferred is minimal. For example, consider the distributed database in Figure 13-11. Suppose that a query at site 1 asks to display the CustName, SIC, and Order Date for all customers in a particular ZipCode range and an Order Amount above a specified limit. Assume that 10 percent of the customers fall in the ZipCode range and 2 percent of the orders are above the amount limit. A semijoin would work as follows: Semijoin A joining operation used with distributed databases in which only the joining attribute from one site is transmitted to the other site, rather than all the selected attributes from every qualified row. 1. A query is executed at site 1 to create a list of the CustNo values in the desired Zip- Code range. So 10 percent of 10,000 customers-1,000 rows satisfy the ZipCode qualification. Thus, 1,000 rows of 10 bytes each for the CustNo attribute (the join- ing attribute), or 10,000 bytes, will be sent to site 2. 2. A query is executed at site 2 to create a list of the CustNo and Order Date values to be sent back to site 1 to compose the final result. If you assume roughly the same number of orders for each customer, then 40,000 rows of the Order table will match with the customer numbers sent from site 1. If you assume that any customer order is equally likely to be above the limit, then 800 (2 percent of 40,000) of the Order table rows are relevant to this query. For each row, the CustNo and Order Date need to be sent to site 1, or 14 bytes x 800 rows, thus 11,200 bytes. The total data transferred is only 21,200 bytes, using the semijoin just described. Com- pare this total to simply sending the subset of each table needed at one site to the other site: To send data from site 1 to site 2 would require sending the CustNo, CustName, and SIC (65 bytes) for 1,000 rows of the Customer table (65,000 bytes) to site 2. To send data from site 2 to site 1 would require sending CustNo and Order Date (14 bytes) for 8,000 rows of the Order table (112,000 bytes). Clearly, the semijoin approach saves network traffic, which can be a major contributing factor to the overall time to respond to a user's query. A distributed DBMS uses a cost model to predict the execution time (for data processing and transmission) of alternative execution plans. The cost model is performed before the query is executed based on general network conditions, consequently, the actual cost may be more or less, depending on the actual network and node loads, database reorganizations, and other dynamic factors. Thus, the parameters of the cost model should be periodically updated as general conditions change in the network (e.g., as local databases are redesigned, network paths are changed, and DBMSs at local sites are replaced)