Question
SSE/AVX/AVX512 programming: write assembly statements for overlapping two images of equal size to produce a new image. See the figure attached which shows four steps
SSE/AVX/AVX512 programming: write assembly statements for overlapping two images of equal size to produce a new image.
See the figure attached which shows four steps that overlap two images (B and E) to result in G. Given two images, B and E, you are asked to produce image G with no blue background. Assuming the images each are of size 1K x 1K pixels or 1 MB, where each pixel is a byte representing 256 colors. For 1024 * 1024 (1 MB) images, a simple minded loop would require 1 million iterations (1K * 1K), where each iteration works on a pixel. For large-sized images, this simple minded loop is prohibitively expensive. A solution to this expensive computation beside CUDA and OpenCL for GPGPU programming is SSE/AVX/AVX512 instructions which allow for one instruction to operate on 16, 32, or 64 bytes at the same time. With SSE/AVX/AVX512, a row for 1024 * 1024 (1 MB) image can now be processed in 64, 32, or 16 iterations. The entire image can therefore require only 64K, 32K, or 16K iterations instead of 1 million.
The assembly instructions required for this problem are variation of PCMPEQ, ANDP, ANDNP, and POR which are documented in the Intel manual (https://software.intel.com/en-us/articles/intel-sdm#combined). For example, the instructions for AVX512 to overlap the two images are variations of the following AVX512 instructions:
VPCMPEQD zmm1, zmm2, zmm3 /m256 -> cmp VANDPS zmm1, zmm2, zmm3/m256 -> and VANDNPS zmm1, zmm2, zmm3/m256 -> !and VPORQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst -> or
Assuming
image A is loaded in xmm/ymm/zmm1
image B is loaded in xmm/ymm/zmm3
image E is loaded in xmm/ymm/zmm4
and 8-bit 256-color map below:
red is 0xE0,
greeen is 0x1C,
blue is 0x03,
yellow is 0xFC
white is 0xFF
black is 0x00
write four assembly instructions for
128-bit SSE using xmm registers
256-bit AVX using ymm registers
512-bit AVX512 using zmm registers
The four assembly instructions are called loop body inside a loop that iterates 64K, 32K, or 16K depending on SSE, AVX, or AVX512. You may use register xmm/ymm/zmm2 to hold a temporary value.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started