I used TensorFlow to implement the neural network and achieved accuracy 0.825224 on the DEVTEST set.
Details:
- use Adam to optimize
- initial learning rate = 0.5
- decay the learning rate by 0.5 every 256 epochs
- max iteration = 1024
I choose to test different context window sizes.
w | test accuracy |
---|---|
2 | 0.794463 |
1 | 0.825224 |
0 | 0.841024 |
For w = 2, i got test accuracy of 0.600112 with the same values for other training hyper-parameters. One problem seems to be the sparsity in hidden layer reaching 0.99 very early which prevents gradients to flow across the ReLU. So I decided to halve the initial learning rate and that did save some units from being zeroed (0.97 sparsity) and got an improved test accuracy of 0.794463 which is not as good as w = 1.
For w = 0, i got test accuracy of 0.841024 which is better than w = 1. This result is somewhat surprising to me. But such result can be explained by the fact that many words have one very common POS especially for the special symbols in tweets such as hashtags, mentions and emoji. So considering context might be a distractor.
I choose to explore the effect of regularizations.
For L2 regularization, I tried lambda of 0.1, 0.01, and 0.001. 0.001 produced the highest DEV accuracy which leads to a test accuracy of 0.845078 which is better than the original performance.
For dropout regularization, I tried dropout rate of 0.2, 0.5, and 0.9. 0.2 produced the highest DEV accuracy which leads to a DEVTEST accuracy of 0.729866 which is much worse than not using dropout.
I also tried dropout on both the input and the hidden layers. With dropout rate of 0.2, I got better DEV accuracy and DEVTEST accuracy of 0.842142 which is better than not using dropout.