ZHAOZHIHAO · November 16, 2018 04:27
diff --git a/Short notes on papers b/Short notes on papers
 1. Mobilenet
   Thanks to:
      https://blog.csdn.net/t800ghb/article/details/78879612
      https://blog.csdn.net/wfei101/article/details/78310226
      (Mobilenet V2)https://blog.csdn.net/u011995719/article/details/79135818

   Suppose there are M input channels, and N kernels, then the output are N output channels.
   With padding to let the output channel size the same as the input channel.
   
   Standard convolution:
   The times of calculation is 
        (Dk * Dk) * (Dc * Dc) * N * M     (1)
   where Dk is dimension of kernel, and Dc is the dimension of input/output.
   
   Convolution in Mobilenet:
   First, use only a kernel on the M input channels to get M output channels,
        The times of calculation is
              (Dk * Dk) * (Dc * Dc) * M   (2)
   Second, use N * M * (Df * Df) kernels(every kernel here is 1 * 1 size) on the M output chanels in the frist step,
   to get the same dimension output as in standard convolution.
        The times of calculation is 
              N * M * (Df * Df)           (3)
   
   (1) / ((2) + (3)), is (1/N)+(1/(Dk*Dk)), this is usually 1/8 or 1/9 as said in this paper.
   
   V2 is to let mobilenet adapt with resnet. See the previous link.


 2. Soccer on Your Tabletop
   See the demo the know what they done. https://www.youtube.com/watch?v=eRGAB4QBS6U
   Input image -> camera calibration -> player detection -> pose estimation -> tracking -> player segmentation 
   -> player depth estimation -> mesh generation -> scene reconstruction
   (1). camera calibration: 
        using sidelines,  penalty  box  around  the  goal, solve for the camera parameters w 
        (focal length,  rotation and translation) that align rendered synthetic field lines with the extracted edge points.
        i.e., min distance(extracted edges, rendered synthetic field lines) w.r.t parameters w.
   (2). player detection:
         using faster rcnn.
   (3). pose estimation:
         detect person keypoint by "Convolutional pose machines". 
         using keypoint to refine the bounding box in (2).
   (4). tracking:
         tracking by the bounding box in (3)
   (5). player segmentation:
         For every tracked player we need to estimate its segmentation mask to be used in the depth estimation network.
         
         A straightforward approach is to apply at each frame a person segmentation method [52], refined with a dense CRF[25] 
         as we did for training. 
         
         [52]F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016
         [25]P.Krahenbuhl and V.Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011
   (6). player depth estimation
        traing:
              using FIFA games to get (depth, image, estimated segmentation mask) to train a network.
              Note that we use a player-centric depth estimation because 
                we get more training data by breaking down each frame into 10-20 players, 
                and it is easier for the network to learn individual player’s configuration rather than whole-scene arrangements.
        inference:
              input: image with persons' segmentation mask(I think it's processed one by one)
              output: depth at every pixel
   (7). mesh generation
        The depth map is then unprojected to world coordinates using the camera parameters, generating the player’s point-cloud 
        in 3D. Each pixel corresponds to a 3D point and we use pixel connectivity to establish faces. We texture-map the mesh 
        with the input image.  
        In short, convert depth map to world coordinates, then get meshes with these points, finaly texture-map these meshes.
 
 3. Simple Baselines for Human Pose Estimation and Tracking
    This paper uses a simple method, but still gets state-of-the-art, which I think is because of the flownet.
    The code will be available later.
    
    Pipeline:
    1. detect person 
         using faster rcnn. if not detected, using flownet and pose keypoint in the previous frame to get the predicted keypoints
         in the current frame. Then estimate the bounding box by the predicted keypoints.
    2. predict pose keypoint
         estimate human pose with a CNN(nothing special for this CNN, see fig 1), where the input is the bounding box in step 1.
    3. association
         the distance is calculated as distance(real detected pose, predicted pose from the previous pose with flownet). 
         I think here the distance is just the pixel distance.
         The association is very simple. Everytime, we get the least distance pair and pop them from the pool. The remaining 
         unmatched pose(bounding box) is treated as new target.
         

 4. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
   c++/tensorflow/pytorch/... version code available.
   As said in the paper, "Our method has achieved the speed of 8.8 fps for a video with 19 people". The c++ version code should
   be faster.
   
   A nice comment(maybe by the author): http://image-net.org/challenges/talks/2016/Multi-person%20pose%20estimation-CMU.pdf
   
   pipeline:
   First, all keypoints are detected without grouping them to individual persons.
   Then, these keypoints are connected/grouped to get different people, by greedy matching. The author uses Part Affinity Fields 
   to calculate the score on connecting two keypoints. 
   For the Part Affinity Fields, see page 38, 46, 47, 48 in the comment PDF.
    
   The Part Affinity Fields branch and key points detection are jointly learning and inferencing. So it's fast.
   
 5. Depth-aware CNN for RGB-D Segmentation
   https://arxiv.org/pdf/1803.06791.pdf
   Equations (2) (3) (4) (5) are all theoretical content of this paper.

 6. Relational inductive biases, deep learning, and graph networks   
   GN := graph network
   (i). I think the authors "propose" GN for tasks where the structrue prior of CNN doesn't suit, such as relation/interaction
   reasoning. For example, 
      "Graph Networks as Learnable Physics Engines for Inference and Control" on ICLR 2018
            video, https://m.facebook.com/story.php?story_fbid=429607650887089&id=118896271958230&_rdr
      "Neural Message Passing for Quantum Chemistry" on ICML 2017,
            video, https://vimeo.com/238221090
   I think the background for proposing GN holds true if we treat deep learning(CNN/RNN/autoencoder ...) as a new basic "atom", and 
   use it in more sophisticated cases. As the two says,
      https://twitter.com/goodfellow_ian/status/1042246801376436224
      https://medium.com/@karpathy/software-2-0-a64152b37c35
   
   (ii). There are many reference papers, and they are said as application of GN. But actually, only few of them are direct applications
   of this structure form. But the authors prove that those undirected applications can be tranformed to GN form, which I think is
   definitely true by intuition. The undirected application, for example, 
      "Non-local neural networks", on CVPR 2018,
            Figure 2 and equations (1), (6) are core in theoretical part.
         
 7. Dynamic Routing Between Capsules
 reference: [1]. Neural Network Encapsulation
   The input are for example, many channel feature maps v_i, then we got many capsules v_{j|i}.
   Then, from the capsules v_{j|i}, the output are predictions, for example, 10 vectors for mnist problem. 
   The output is vector, because it encodes objects' information in addition to catgory, such as rotation, scale. The length of 
   this vector is probability.

   The output is  $s_j=\sum_i{c_{ij}{\hat{v}_{j|i}}$, where j is the number of classes, for example, 10 for mnist, s_j is the 
   vector for every class.
   What we want is good c_{ij}, because c_{ij} allows us to give an interpretation on the output s_j. And in my opinion, 
   the c_ij is why they give the name of "routing". To get good c_ij, we use EM. The variables in EM includes, c_ij, 
   $p(\hat v_{j|i}, u, \sigma)$, s_j, u, sigma. Here u and \sigma are involved because we use gaussian cluster. u and \sigma are
   model parameters in Gaussian model. To see the rough EM step, see equations (2), (3) in [1], note that in my notation a_j is
   replaced by s_j, .
   
   nice words:
   The objective of the EM (Expectation Maximization) routing is to group capsules to form a part-whole relationship using a 
   clustering technique (EM).
   
 8. CNN visualization  
   based on the code https://github.com/utkuozbulak/pytorch-cnn-visualizations
   (1) Generate an image that maximizing a specific filter's output in a specific layer
       A random or zero image is generated
       Then input to a network 
       Then get the desired filter's output
       Then because of our goal, treat that filter's output in negative form as loss 
       Then backward the loss to input image 
       Then adjust input image according to loss
       Iterate several times
   (2) Generate an image that maximizing a class probability
       This is actually the same. One thing is that, the probability is not in [0, 1] after softmax, but right before softmax
       in the code. From some papers, it's said right before softmax could have better result than after that.
   (3) Vanilla backprop
       For all classes right before softmax, only set the graddient of the desired class be 1, and all the left 0.
       Backprop this gradient vector(actually a scalar 1 since 0 element doesn't backprop) to input image.
       Got a gradient (image) for the input image, then equalize and strech to [0, 255] for better view.
   (4) gradcam.py (Now I don't know what is the term name of the operation described below)
       To see one specific layer's output, aggregating all channels (convolution output) in that layer into a single one by
       weighted sum, and weight for a channel is the mean of the gradient on that channel (a scalar mean on a 2D feature map).
       Then normalize the result to [0, 255].
   (5) Guided backprop
       This modify a little on Vanilla backprop
       For all classes right before softmax, only set the graddient of the desired class be 1, and all the left 0.
       In a network, a --> b, where a is the sum of elementwise multiplication of a filter and an image patch, --> is relu,
       b is a after -->. During the backprop, the gradient on b is set to 0 if it's smaller than 0.
   (6) smooth_grad.py    
       Generate a noise image with 0 mean and not large variance.
       Add this noise image to the original image.
       Do vanilla backprop or guided backprop on the new image, get the (un-post-processed) gradient image.
       Repeat the above process for serveral time, and average these images.
       Post-process the average image to have a better looking.
   (7) guided_gradcam.py
       This is just pointwise multiplication of gradcam.py mask and guided backprop mask.
   (8) inverted_representation.py (Understanding Deep Image Representations by Inverting Them)
       For a input image, for example, a cat image, we get its feature maps at a specific layer.
       Then we generate a random image, and get the feature maps for the random image at the same layer as the input image do.
       Then we calculate the L2 distance(loss) between the two sets of feature maps, plus some regularization terms(for smooth result).
            There are two kind of regularizer in the paper. The first kind is x(x=6 in the paper) norm. I don't know how the effect of
            this regularizer. The second kind of regularizer is the x-axis and y-axis first order direvation, this encourage the 
            result image to have constant regions.
       Then we use the loss the update the input image, and iterative for several times.    

 9. 15 Logical Fallacies You Should Know Before Getting Into a Debate
   https://thebestschools.org/magazine/15-logical-fallacies-know/
   
   (1)  Ad Hominem Fallacy
   Instead of addressing the candidate’s stance on the issues, or addressing his or her effectiveness as a statesman or
   stateswoman, ad hominems focus on personality issues, speech patterns, wardrobe, style, and other things that affect
   popularity but have no bearing on their competence. 
   (2)  Straw Man
   In the straw man fallacy, someone attacks a position the opponent doesn’t really hold.Instead of contending with the
   actual argument, he or she instead attacks the equivalent of a lifeless bundle of straw, an easily defeated effigy.
   Straw man fallacies are a cheap and easy way to make one’s position look stronger than it is. Often the straw man 
   fallacy is accidental, because one doesn’t realize he or she is oversimplifying a nuanced position, or 
   misrepresenting a narrow, cautious claim as if it were broad and foolhardy. 
   (3)  Appeal to Ignorance (argumentum ad ignorantiam)
   Consider the following two claims: “No one has ever been able to prove definitively that extra-terrestrials exist,
   so they must not be real.” “No one has ever been able to prove definitively that extra-terrestrials do not exist, so
   they must be real.” If we don’t know whether they exist, then we don’t know that they do exist or that they don’t 
   exist. Ignorance doesn’t prove any claim to knowledge.
   (4)  False Dilemma/False Dichotomy
   False Dilemma fails by limiting the options to two when there are in fact more options to choose from. For example,
   there are only two kinds of people in the world, people who love Led Zeppelin, and people who hate music.
   It’s not a fallacy if there really are only two options. For example, “either Led Zeppelin is the greatest band of
   all time, or they are not.” That’s a true dilemma, since there really are only two options there.
   (5)  Slippery Slope
   The slippery slope fallacy suggests that unlikely or ridiculous outcomes are likely when there’s just not enough 
   evidence to think so. You may have used this fallacy on your parents as a teenager: “But, you have to let me go to 
   the party! If I don’t go to the party, I’ll be a loser with no friends. Next thing you know I’ll end up alone and 
   jobless living in your basement when I’m 30!” 
   (6)  Circular Argument (petitio principii)
   When a person’s argument is just repeating what they already assumed beforehand, it’s not arriving at any new 
   conclusion. We call this a circular argument or circular reasoning. Another way to explain circular arguments is 
   that they start where they finish, and finish where they started.
   (7)  Hasty Generalization
   Hasty generalizations are general statements without sufficient evidence to support them. They are general claims 
   too hastily made, hence they commit some sort of illicit assumption, stereotyping, unwarranted conclusion,
   overstatement, or exaggeration. Is one example enough to prove the claim that "Apple computers are the most
   expensive computer brand?" What about 12 examples? What about if 37 out of 50 apple computers were more expensive 
   than comparable models from other brands? A simple way to avoid hasty generalizations is to add qualifiers like 
   “sometimes,” "maybe," "often," or "it seems to be the case that ... ".
   (8)  Red Herring (ignoratio elenchi)
   A “red herring” is a distraction from the argument typically with some sentiment that seems to be relevant but 
   isn’t really on-topic. This tactic is common when someone doesn’t like the current topic and wants to detour into
   something else instead, something easier or safer to address.
   (9)  Tu Quoque Fallacy
   The “tu quoque,” Latin for “you too,” is also called the “appeal to hypocrisy” because it distracts from the argument 
   by pointing out hypocrisy in the opponent. If Jack says, “Maybe I committed a little adultery, but so did you Jason!” 
   Jack is trying to diminish his responsibility or defend his actions by distributing blame to other people. But no one
   else’s guilt excuses his own guilt. No matter who else is guilty, Jack is still an adulterer.
   (10) Causal Fallacy
   The Causal Fallacy is any logical breakdown when identifying a cause. You can think of the Causal Fallacy as a parent 
   category for several different fallacies about unproven causes.
        i). One causal fallacy is the False Cause or non causa pro causa ("not the-cause for a cause") fallacy, which is 
        when you conclude about a cause without enough evidence to do so. Consider, for example, “Since your parents 
        named you ‘Harvest,’ they must be farmers.” 
        ii). Another causal fallacy is the Post Hoc fallacy. This fallacy happens when you mistake something for the
        cause just because it came first. “Yesterday, I walked under a ladder with an open umbrella indoors while
        spilling salt in front of a black cat. And I forgot to knock on wood with my lucky dice. That must be why I’m 
        having such a bad day today. It’s bad luck.”
        iii). Another kind of causal fallacy is the correlational fallacy. This fallacy happens when you mistakenly 
        interpret two things found together as being causally related. Two things may correlate without a causal relation,
        or they may have some third factor causing both of them to occur. Or perhaps both things just, coincidentally,
        happened together. Consider for example, “Every time Joe goes swimming he is wearing his Speedos. Something
        about wearing that Speedo must make him want to go swimming.” 
   (11) Fallacy of Sunk Costs
   Sometimes we invest ourselves so thoroughly in a project that we’re reluctant to ever abandon it, even when it turns 
   out to be fruitless and futile. It’s natural, and usually not a fallacy to want to carry on with something we find 
   important, not least because of all the resources we’ve put into it. However, this kind of thinking becomes a fallacy
   when we start to think that we should continue with a task or project because of all that we’ve put into it, without
   considering the future costs we’re likely to incur by doing so. There may be a sense of accomplishment when finishing,
   and the project might have other values, but it’s not enough to justify the cost invested in it.
   (12) Appeal to Authority (argumentum ad verecundiam)
   This fallacy happens when we misuse an authority. This misuse of authority can occur in a number of ways. We can cite
   only authorities — steering conveniently away from other testable and concrete evidence as if expert opinion is always
   correct. Or we can cite irrelevant authorities, poor authorities, or false authorities. Suppose someone says, “I buy 
   Fruit of the Loom™ underwear because Michael Jordan says it’s the best.” But Michael Jordan isn’t a relevant authority
   when it comes to underwear. This is a fallacy of irrelevant authority. There’s another problem with relying too 
   heavily on authorities. Even the authorities can be wrong sometimes. 
   (13) Equivocation (ambiguity)
   Equivocation happens when a word, phrase, or sentence is used deliberately to confuse, deceive, or mislead by 
   sounding like it’s saying one thing but actually saying something else.  For example, a euphemism might be replacing
   "lying" with the phrase "creative license," or replacing my "criminal background" with my "youthful indiscretions," 
   or replacing "fired from my job" with "early retirement." 
   (14) Appeal to Pity (argumentum ad misericordiam)
   It is a fallacy of relevance. Personal attacks, and emotional appeals, aren’t strictly relevant to whether something
   is true or false. In this case, the fallacy appeals to the compassion and emotional sensitivity of others when these
   factors are not strictly relevant to the argument. Appeals to pity often appear as emotional manipulation. For example,
   “How can you eat that innocent little carrot? He was plucked from his home in the ground at a young age, and violently
   skinned, chemically treated, and packaged, and shipped to your local grocer and now you are going to eat him into 
   oblivion when he did nothing to you. You really should reconsider what you put into your body.”
   To be fair, emotions can sometimes be relevant. Often, the emotional aspect is a key insight into whether something
   is morally repugnant or praiseworthy, or whether a governmental policy will be winsome or repulsive. People’s feelings
   about something can be critically important data when planning a campaign, advertising a product, or rallying a group
   together for a charitable cause. But it becomes a fallacious appeal to pity when the emotions are used in
   substitution for facts or as a distraction from the facts of the matter.
   (15) Bandwagon Fallacy
   The bandwagon fallacy assumes something is true (or right, or good) because other people agree with it. The form of 
   this argument often looks like this: “Many people do or think X, so you ought to do or think X too.” One problem with 
   this kind of reasoning is that the broad acceptance of some claim or action is not always a good indication that the
   acceptance is justified. People can be mistaken, confused, deceived, or even willfully irrational. And when people 
   act together, sometimes they become even more foolish — i.e., “mob mentality.”
	1. Mobilenet
	Thanks to:
	https://blog.csdn.net/t800ghb/article/details/78879612
	https://blog.csdn.net/wfei101/article/details/78310226
	(Mobilenet V2)https://blog.csdn.net/u011995719/article/details/79135818

	Suppose there are M input channels, and N kernels, then the output are N output channels.
	With padding to let the output channel size the same as the input channel.

	Standard convolution:
	The times of calculation is
	(Dk * Dk) * (Dc * Dc) * N * M (1)
	where Dk is dimension of kernel, and Dc is the dimension of input/output.

	Convolution in Mobilenet:
	First, use only a kernel on the M input channels to get M output channels,
	The times of calculation is
	(Dk * Dk) * (Dc * Dc) * M (2)
	Second, use N * M * (Df * Df) kernels(every kernel here is 1 * 1 size) on the M output chanels in the frist step,
	to get the same dimension output as in standard convolution.
	The times of calculation is
	N * M * (Df * Df) (3)

	(1) / ((2) + (3)), is (1/N)+(1/(Dk*Dk)), this is usually 1/8 or 1/9 as said in this paper.

	V2 is to let mobilenet adapt with resnet. See the previous link.


	2. Soccer on Your Tabletop
	See the demo the know what they done. https://www.youtube.com/watch?v=eRGAB4QBS6U
	Input image -> camera calibration -> player detection -> pose estimation -> tracking -> player segmentation
	-> player depth estimation -> mesh generation -> scene reconstruction
	(1). camera calibration:
	using sidelines, penalty box around the goal, solve for the camera parameters w
	(focal length, rotation and translation) that align rendered synthetic field lines with the extracted edge points.
	i.e., min distance(extracted edges, rendered synthetic field lines) w.r.t parameters w.
	(2). player detection:
	using faster rcnn.
	(3). pose estimation:
	detect person keypoint by "Convolutional pose machines".
	using keypoint to refine the bounding box in (2).
	(4). tracking:
	tracking by the bounding box in (3)
	(5). player segmentation:
	For every tracked player we need to estimate its segmentation mask to be used in the depth estimation network.

	A straightforward approach is to apply at each frame a person segmentation method [52], refined with a dense CRF[25]
	as we did for training.

	[52]F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016
	[25]P.Krahenbuhl and V.Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011
	(6). player depth estimation
	traing:
	using FIFA games to get (depth, image, estimated segmentation mask) to train a network.
	Note that we use a player-centric depth estimation because
	we get more training data by breaking down each frame into 10-20 players,
	and it is easier for the network to learn individual player’s configuration rather than whole-scene arrangements.
	inference:
	input: image with persons' segmentation mask(I think it's processed one by one)
	output: depth at every pixel
	(7). mesh generation
	The depth map is then unprojected to world coordinates using the camera parameters, generating the player’s point-cloud
	in 3D. Each pixel corresponds to a 3D point and we use pixel connectivity to establish faces. We texture-map the mesh
	with the input image.
	In short, convert depth map to world coordinates, then get meshes with these points, finaly texture-map these meshes.

	3. Simple Baselines for Human Pose Estimation and Tracking
	This paper uses a simple method, but still gets state-of-the-art, which I think is because of the flownet.
	The code will be available later.

	Pipeline:
	1. detect person
	using faster rcnn. if not detected, using flownet and pose keypoint in the previous frame to get the predicted keypoints
	in the current frame. Then estimate the bounding box by the predicted keypoints.
	2. predict pose keypoint
	estimate human pose with a CNN(nothing special for this CNN, see fig 1), where the input is the bounding box in step 1.
	3. association
	the distance is calculated as distance(real detected pose, predicted pose from the previous pose with flownet).
	I think here the distance is just the pixel distance.
	The association is very simple. Everytime, we get the least distance pair and pop them from the pool. The remaining
	unmatched pose(bounding box) is treated as new target.


	4. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
	c++/tensorflow/pytorch/... version code available.
	As said in the paper, "Our method has achieved the speed of 8.8 fps for a video with 19 people". The c++ version code should
	be faster.

	A nice comment(maybe by the author): http://image-net.org/challenges/talks/2016/Multi-person%20pose%20estimation-CMU.pdf

	pipeline:
	First, all keypoints are detected without grouping them to individual persons.
	Then, these keypoints are connected/grouped to get different people, by greedy matching. The author uses Part Affinity Fields
	to calculate the score on connecting two keypoints.
	For the Part Affinity Fields, see page 38, 46, 47, 48 in the comment PDF.

	The Part Affinity Fields branch and key points detection are jointly learning and inferencing. So it's fast.

	5. Depth-aware CNN for RGB-D Segmentation
	https://arxiv.org/pdf/1803.06791.pdf
	Equations (2) (3) (4) (5) are all theoretical content of this paper.

	6. Relational inductive biases, deep learning, and graph networks
	GN := graph network
	(i). I think the authors "propose" GN for tasks where the structrue prior of CNN doesn't suit, such as relation/interaction
	reasoning. For example,
	"Graph Networks as Learnable Physics Engines for Inference and Control" on ICLR 2018
	video, https://m.facebook.com/story.php?story_fbid=429607650887089&id=118896271958230&_rdr
	"Neural Message Passing for Quantum Chemistry" on ICML 2017,
	video, https://vimeo.com/238221090
	I think the background for proposing GN holds true if we treat deep learning(CNN/RNN/autoencoder ...) as a new basic "atom", and
	use it in more sophisticated cases. As the two says,
	https://twitter.com/goodfellow_ian/status/1042246801376436224
	https://medium.com/@karpathy/software-2-0-a64152b37c35

	(ii). There are many reference papers, and they are said as application of GN. But actually, only few of them are direct applications
	of this structure form. But the authors prove that those undirected applications can be tranformed to GN form, which I think is
	definitely true by intuition. The undirected application, for example,
	"Non-local neural networks", on CVPR 2018,
	Figure 2 and equations (1), (6) are core in theoretical part.

	7. Dynamic Routing Between Capsules
	reference: [1]. Neural Network Encapsulation
	The input are for example, many channel feature maps v_i, then we got many capsules v_{j\|i}.
	Then, from the capsules v_{j\|i}, the output are predictions, for example, 10 vectors for mnist problem.
	The output is vector, because it encodes objects' information in addition to catgory, such as rotation, scale. The length of
	this vector is probability.

	The output is $s_j=\sum_i{c_{ij}{\hat{v}_{j\|i}}$, where j is the number of classes, for example, 10 for mnist, s_j is the
	vector for every class.
	What we want is good c_{ij}, because c_{ij} allows us to give an interpretation on the output s_j. And in my opinion,
	the c_ij is why they give the name of "routing". To get good c_ij, we use EM. The variables in EM includes, c_ij,
	$p(\hat v_{j\|i}, u, \sigma)$, s_j, u, sigma. Here u and \sigma are involved because we use gaussian cluster. u and \sigma are
	model parameters in Gaussian model. To see the rough EM step, see equations (2), (3) in [1], note that in my notation a_j is
	replaced by s_j, .

	nice words:
	The objective of the EM (Expectation Maximization) routing is to group capsules to form a part-whole relationship using a
	clustering technique (EM).

	8. CNN visualization
	based on the code https://github.com/utkuozbulak/pytorch-cnn-visualizations
	(1) Generate an image that maximizing a specific filter's output in a specific layer
	A random or zero image is generated
	Then input to a network
	Then get the desired filter's output
	Then because of our goal, treat that filter's output in negative form as loss
	Then backward the loss to input image
	Then adjust input image according to loss
	Iterate several times
	(2) Generate an image that maximizing a class probability
	This is actually the same. One thing is that, the probability is not in [0, 1] after softmax, but right before softmax
	in the code. From some papers, it's said right before softmax could have better result than after that.
	(3) Vanilla backprop
	For all classes right before softmax, only set the graddient of the desired class be 1, and all the left 0.
	Backprop this gradient vector(actually a scalar 1 since 0 element doesn't backprop) to input image.
	Got a gradient (image) for the input image, then equalize and strech to [0, 255] for better view.
	(4) gradcam.py (Now I don't know what is the term name of the operation described below)
	To see one specific layer's output, aggregating all channels (convolution output) in that layer into a single one by
	weighted sum, and weight for a channel is the mean of the gradient on that channel (a scalar mean on a 2D feature map).
	Then normalize the result to [0, 255].
	(5) Guided backprop
	This modify a little on Vanilla backprop
	For all classes right before softmax, only set the graddient of the desired class be 1, and all the left 0.
	In a network, a --> b, where a is the sum of elementwise multiplication of a filter and an image patch, --> is relu,
	b is a after -->. During the backprop, the gradient on b is set to 0 if it's smaller than 0.
	(6) smooth_grad.py
	Generate a noise image with 0 mean and not large variance.
	Add this noise image to the original image.
	Do vanilla backprop or guided backprop on the new image, get the (un-post-processed) gradient image.
	Repeat the above process for serveral time, and average these images.
	Post-process the average image to have a better looking.
	(7) guided_gradcam.py
	This is just pointwise multiplication of gradcam.py mask and guided backprop mask.
	(8) inverted_representation.py (Understanding Deep Image Representations by Inverting Them)
	For a input image, for example, a cat image, we get its feature maps at a specific layer.
	Then we generate a random image, and get the feature maps for the random image at the same layer as the input image do.
	Then we calculate the L2 distance(loss) between the two sets of feature maps, plus some regularization terms(for smooth result).
	There are two kind of regularizer in the paper. The first kind is x(x=6 in the paper) norm. I don't know how the effect of
	this regularizer. The second kind of regularizer is the x-axis and y-axis first order direvation, this encourage the
	result image to have constant regions.
	Then we use the loss the update the input image, and iterative for several times.

	9. 15 Logical Fallacies You Should Know Before Getting Into a Debate
	https://thebestschools.org/magazine/15-logical-fallacies-know/

	(1) Ad Hominem Fallacy
	Instead of addressing the candidate’s stance on the issues, or addressing his or her effectiveness as a statesman or
	stateswoman, ad hominems focus on personality issues, speech patterns, wardrobe, style, and other things that affect
	popularity but have no bearing on their competence.
	(2) Straw Man
	In the straw man fallacy, someone attacks a position the opponent doesn’t really hold.Instead of contending with the
	actual argument, he or she instead attacks the equivalent of a lifeless bundle of straw, an easily defeated effigy.
	Straw man fallacies are a cheap and easy way to make one’s position look stronger than it is. Often the straw man
	fallacy is accidental, because one doesn’t realize he or she is oversimplifying a nuanced position, or
	misrepresenting a narrow, cautious claim as if it were broad and foolhardy.
	(3) Appeal to Ignorance (argumentum ad ignorantiam)
	Consider the following two claims: “No one has ever been able to prove definitively that extra-terrestrials exist,
	so they must not be real.” “No one has ever been able to prove definitively that extra-terrestrials do not exist, so
	they must be real.” If we don’t know whether they exist, then we don’t know that they do exist or that they don’t
	exist. Ignorance doesn’t prove any claim to knowledge.
	(4) False Dilemma/False Dichotomy
	False Dilemma fails by limiting the options to two when there are in fact more options to choose from. For example,
	there are only two kinds of people in the world, people who love Led Zeppelin, and people who hate music.
	It’s not a fallacy if there really are only two options. For example, “either Led Zeppelin is the greatest band of
	all time, or they are not.” That’s a true dilemma, since there really are only two options there.
	(5) Slippery Slope
	The slippery slope fallacy suggests that unlikely or ridiculous outcomes are likely when there’s just not enough
	evidence to think so. You may have used this fallacy on your parents as a teenager: “But, you have to let me go to
	the party! If I don’t go to the party, I’ll be a loser with no friends. Next thing you know I’ll end up alone and
	jobless living in your basement when I’m 30!”
	(6) Circular Argument (petitio principii)
	When a person’s argument is just repeating what they already assumed beforehand, it’s not arriving at any new
	conclusion. We call this a circular argument or circular reasoning. Another way to explain circular arguments is
	that they start where they finish, and finish where they started.
	(7) Hasty Generalization
	Hasty generalizations are general statements without sufficient evidence to support them. They are general claims
	too hastily made, hence they commit some sort of illicit assumption, stereotyping, unwarranted conclusion,
	overstatement, or exaggeration. Is one example enough to prove the claim that "Apple computers are the most
	expensive computer brand?" What about 12 examples? What about if 37 out of 50 apple computers were more expensive
	than comparable models from other brands? A simple way to avoid hasty generalizations is to add qualifiers like
	“sometimes,” "maybe," "often," or "it seems to be the case that ... ".
	(8) Red Herring (ignoratio elenchi)
	A “red herring” is a distraction from the argument typically with some sentiment that seems to be relevant but
	isn’t really on-topic. This tactic is common when someone doesn’t like the current topic and wants to detour into
	something else instead, something easier or safer to address.
	(9) Tu Quoque Fallacy
	The “tu quoque,” Latin for “you too,” is also called the “appeal to hypocrisy” because it distracts from the argument
	by pointing out hypocrisy in the opponent. If Jack says, “Maybe I committed a little adultery, but so did you Jason!”
	Jack is trying to diminish his responsibility or defend his actions by distributing blame to other people. But no one
	else’s guilt excuses his own guilt. No matter who else is guilty, Jack is still an adulterer.
	(10) Causal Fallacy
	The Causal Fallacy is any logical breakdown when identifying a cause. You can think of the Causal Fallacy as a parent
	category for several different fallacies about unproven causes.
	i). One causal fallacy is the False Cause or non causa pro causa ("not the-cause for a cause") fallacy, which is
	when you conclude about a cause without enough evidence to do so. Consider, for example, “Since your parents
	named you ‘Harvest,’ they must be farmers.”
	ii). Another causal fallacy is the Post Hoc fallacy. This fallacy happens when you mistake something for the
	cause just because it came first. “Yesterday, I walked under a ladder with an open umbrella indoors while
	spilling salt in front of a black cat. And I forgot to knock on wood with my lucky dice. That must be why I’m
	having such a bad day today. It’s bad luck.”
	iii). Another kind of causal fallacy is the correlational fallacy. This fallacy happens when you mistakenly
	interpret two things found together as being causally related. Two things may correlate without a causal relation,
	or they may have some third factor causing both of them to occur. Or perhaps both things just, coincidentally,
	happened together. Consider for example, “Every time Joe goes swimming he is wearing his Speedos. Something
	about wearing that Speedo must make him want to go swimming.”
	(11) Fallacy of Sunk Costs
	Sometimes we invest ourselves so thoroughly in a project that we’re reluctant to ever abandon it, even when it turns
	out to be fruitless and futile. It’s natural, and usually not a fallacy to want to carry on with something we find
	important, not least because of all the resources we’ve put into it. However, this kind of thinking becomes a fallacy
	when we start to think that we should continue with a task or project because of all that we’ve put into it, without
	considering the future costs we’re likely to incur by doing so. There may be a sense of accomplishment when finishing,
	and the project might have other values, but it’s not enough to justify the cost invested in it.
	(12) Appeal to Authority (argumentum ad verecundiam)
	This fallacy happens when we misuse an authority. This misuse of authority can occur in a number of ways. We can cite
	only authorities — steering conveniently away from other testable and concrete evidence as if expert opinion is always
	correct. Or we can cite irrelevant authorities, poor authorities, or false authorities. Suppose someone says, “I buy
	Fruit of the Loom™ underwear because Michael Jordan says it’s the best.” But Michael Jordan isn’t a relevant authority
	when it comes to underwear. This is a fallacy of irrelevant authority. There’s another problem with relying too
	heavily on authorities. Even the authorities can be wrong sometimes.
	(13) Equivocation (ambiguity)
	Equivocation happens when a word, phrase, or sentence is used deliberately to confuse, deceive, or mislead by
	sounding like it’s saying one thing but actually saying something else. For example, a euphemism might be replacing
	"lying" with the phrase "creative license," or replacing my "criminal background" with my "youthful indiscretions,"
	or replacing "fired from my job" with "early retirement."
	(14) Appeal to Pity (argumentum ad misericordiam)
	It is a fallacy of relevance. Personal attacks, and emotional appeals, aren’t strictly relevant to whether something
	is true or false. In this case, the fallacy appeals to the compassion and emotional sensitivity of others when these
	factors are not strictly relevant to the argument. Appeals to pity often appear as emotional manipulation. For example,
	“How can you eat that innocent little carrot? He was plucked from his home in the ground at a young age, and violently
	skinned, chemically treated, and packaged, and shipped to your local grocer and now you are going to eat him into
	oblivion when he did nothing to you. You really should reconsider what you put into your body.”
	To be fair, emotions can sometimes be relevant. Often, the emotional aspect is a key insight into whether something
	is morally repugnant or praiseworthy, or whether a governmental policy will be winsome or repulsive. People’s feelings
	about something can be critically important data when planning a campaign, advertising a product, or rallying a group
	together for a charitable cause. But it becomes a fallacious appeal to pity when the emotions are used in
	substitution for facts or as a distraction from the facts of the matter.
	(15) Bandwagon Fallacy
	The bandwagon fallacy assumes something is true (or right, or good) because other people agree with it. The form of
	this argument often looks like this: “Many people do or think X, so you ought to do or think X too.” One problem with
	this kind of reasoning is that the broad acceptance of some claim or action is not always a good indication that the
	acceptance is justified. People can be mistaken, confused, deceived, or even willfully irrational. And when people
	act together, sometimes they become even more foolish — i.e., “mob mentality.”