Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

SSE/AVX/AVX512 programming: write assembly statements for overlapping two images of equal size to produce a new image. See the figure attached which shows four steps

SSE/AVX/AVX512 programming: write assembly statements for overlapping two images of equal size to produce a new image.

See the figure attached which shows four steps that overlap two images (B and E) to result in G. Given two images, B and E, you are asked to produce image G with no blue background. Assuming the images each are of size 1K x 1K pixels or 1 MB, where each pixel is a byte representing 256 colors. For 1024 * 1024 (1 MB) images, a simple minded loop would require 1 million iterations (1K * 1K), where each iteration works on a pixel. For large-sized images, this simple minded loop is prohibitively expensive. A solution to this expensive computation beside CUDA and OpenCL for GPGPU programming is SSE/AVX/AVX512 instructions which allow for one instruction to operate on 16, 32, or 64 bytes at the same time. With SSE/AVX/AVX512, a row for 1024 * 1024 (1 MB) image can now be processed in 64, 32, or 16 iterations. The entire image can therefore require only 64K, 32K, or 16K iterations instead of 1 million.

The assembly instructions required for this problem are variation of PCMPEQ, ANDP, ANDNP, and POR which are documented in the Intel manual (https://software.intel.com/en-us/articles/intel-sdm#combined). For example, the instructions for AVX512 to overlap the two images are variations of the following AVX512 instructions:

 VPCMPEQD zmm1, zmm2, zmm3 /m256 -> cmp VANDPS zmm1, zmm2, zmm3/m256 -> and VANDNPS zmm1, zmm2, zmm3/m256 -> !and VPORQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst -> or 

Assuming

image A is loaded in xmm/ymm/zmm1

image B is loaded in xmm/ymm/zmm3

image E is loaded in xmm/ymm/zmm4

and 8-bit 256-color map below:

red is 0xE0,

greeen is 0x1C,

blue is 0x03,

yellow is 0xFC

white is 0xFF

black is 0x00

write four assembly instructions for

128-bit SSE using xmm registers

256-bit AVX using ymm registers

512-bit AVX512 using zmm registers

The four assembly instructions are called loop body inside a loop that iterates 64K, 32K, or 16K depending on SSE, AVX, or AVX512. You may use register xmm/ymm/zmm2 to hold a temporary value.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Systems Design Implementation And Management

Authors: Peter Robb,Carlos Coronel

5th Edition

061906269X, 9780619062699

More Books

Students also viewed these Databases questions

Question

Find dy/dx if x = te, y = 2t2 +1

Answered: 1 week ago