Compared to classic machine learning, Deep Learning often requires _____? few training data and few computations few training data and lots of computations lots of training data and few computations lots of training data and lots of computations

lots of training data and lots of computations

Which claim is correct when choosing a learning rate for training a neural network? The learning rate should be adjusted randomly. Initially, we can use a small learning rate, and then increase it gradually. It is better to keep a constant learning rate. Initially, we can use a large learning rate, and then decrease it gradually.

Initially, we can use a large learning rate, and then decrease it gradually.

Which claim is correct regarding the choice of learning rate based on the gradient's absolute value? If the absolute value of gradient is large, we should choose a large learning rate, too. A large gradient always requires a large learning rate. The learning rate should be independent of the gradient's absolute value. If the absolute value of gradient is large, we should choose a small learning rate.

If the absolute value of gradient is large, we should choose a small learning rate.

Which claim is WRONG about the use of labeled data in different learning paradigms? Unsupervised learning usually use unlabeled data for training. Semi-supervised learning uses both labeled and unlabeled data. Reinforcement learning usually use labeled data for training. Supervised learning usually use labeled data for training.

Reinforcement learning usually use labeled data for training.

Which claim is WRONG about the branches of deep learning? Deep learning is one branch of data mining. Machine learning is one branch of artificial intelligence (AI). Deep learning is one branch of artificial intelligence (AI). Deep learning is one branch of machine learning.

Deep learning is one branch of data mining.

When updating parameters using gradient descent, which way of calculating loss works better for efficiency and robustness? calculate loss for a mini-batch of data examples in every iteration calculate loss for the entire data examples in every iteration calculate loss for a single data example in every iteration calculate loss for random data examples in every iteration

calculate loss for a mini-batch of data examples in every iteration

In mini-batch SGD training, why is it important to shuffle the training data before every epoch? It helps run the program on GPU in parallel. It helps the training converge fast and prevents bias. It reduces the memory usage during training. It helps calculating loss faster.

It helps the training converge fast and prevents bias.

Logistic Regression is widely used to solve which type of problem? a clustering problem with grouping similar data points. a classification problem with predicting probabilities of discrete (or categorical) values. a regression problem with predicting continuous values. a dimensionality reduction problem.

a classification problem with predicting probabilities of discrete (or categorical) values.

Comprehensive Deep Learning and Neural Networks Concepts and Applications

Flashcard

Learn Mode

Match

Library

Create

Flashcards

Library

Match (Coming Soon)

Computer Science - Artificial Intelligence

user_hodr Created by 7 mon ago

Cards in this deck(46)

Compared to classic machine learning, Deep Learning often requires _____?

In order to reduce loss step by step, what direction does the gradient descent algorithm take a step in every iteration?

Which claim is correct when choosing a learning rate for training a neural network?

Which claim is correct regarding the choice of learning rate based on the gradient's absolute value?

Which claim is WRONG about the use of labeled data in different learning paradigms?

Which claim is WRONG about the branches of deep learning?

When updating parameters using gradient descent, which way of calculating loss works better for efficiency and robustness?

In mini-batch SGD training, why is it important to shuffle the training data before every epoch?

Logistic Regression is widely used to solve which type of problem?

In information theory, which event includes more information?

Which statement is true about activation functions in neural networks?

Which case is an example of overfitting in a machine learning model?

What approach could be used to handle overfitting in a machine learning model?

All regularizations (e.g., L1 norm, L2 norm) penalize larger parameters. Is this statement true or false?

Besides penalizing larger parameters, which regularization makes parameters more sparse?

In Backpropagation, which claim is true about the use of information in forward and backward passes?

As an activation function, does tanh avoid the vanishing gradient problem?

As an activation function, does ReLU solve the vanishing gradient problem?

About SGD optimization, which statement is NOT correct?

Which statement about the learning rate in Stochastic Gradient Descent (SGD) optimization is correct?

Does MaxPooling preserve detected features and downsample the feature map (image)?

If the input volume of an image is 227x227x3, and we apply 96 11x11 filters with stride 4, how many parameters are there?

In CNN, can two convolutional layers be connected directly without a pooling layer in the middle?

In the design of CNN, does the fully connected layer usually contain more parameters than convolutional layers?

What is the purpose of the ReLU activation function in a CNN?

Which statement is true about the convolution layer in neural networks?

What is the main advantage of using dropout in a CNN?

Given two stacking dilated convolution layers with kernel size 3x3, stride 1, and dilation 2, what is the size of the receptive field?

Given a convolution layer with input channels 3, output channel 64, kernel size 4x4, stride 2, dilation 3, and padding 1, what is the parameter size?

In PyTorch, which layer configuration downsamples the input size into half?

In the design of an auto-encoder, should the encoder and decoder follow the exact same structure?

How can you identify activation in relation to function and gradient?

Which way do we usually use to train an autoencoder model?

Which claim is true about attention and self-attention in neural networks?

What's the major purpose of multi-head attention in neural networks?

In the transformer neural network architecture, do the encoder blocks usually use the identical neural network structure?

In the transformer neural network architecture, the output of the final encoder block will go to _____

In the autoregressive model, does the output variable at the current step depend only on the hidden states at all previous steps?

In Transformer, how does the decoder use the information (features) from the encoder?

In the policy gradient approach for reinforcement learning, the reward R(τ^n ) is considered based on _____

In the two major approaches of reinforcement learning, which is usually more sample-efficient?

In Q-Learning, which method is more scalable for predicting a Q-value for a pair of (state, action)?

Describe the process of training in relation to model and layer?

In the two major approaches of reinforcement learning, which one is on-policy training?

For Discrete-event modeling, what approach do we often use?

Why is using stochastic gradient descent to train a generator based on the following loss function inefficient?

Ask Our AI Tutor

Get Instant Help with Your Questions

Need help understanding a concept or solving a problem? Type your question below, and our AI tutor will provide a personalized answer in real-time!

How it works

Ask any academic question, and our AI tutor will respond instantly with explanations, solutions, or examples.

Get Started

Browse questions and discover topic-based flashcards
Practice with engaging flashcards designed for each subject
Strengthen memory with concise, effective learning tools

Discover By Topic

Comprehensive Deep Learning and Neural Networks Concepts and Applications

Related Decks