sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/postgresql/ gitlabhq_production
Sql query to fetch discussions
SELECT n.id,
mr.iid AS merge_request_iid,
import com.amazonaws.auth.AWSCredentials; | |
import com.amazonaws.auth.AWSStaticCredentialsProvider; | |
import com.amazonaws.auth.BasicAWSCredentials; | |
import com.amazonaws.regions.Regions; | |
import com.amazonaws.services.translate.AmazonTranslate; | |
import com.amazonaws.services.translate.AmazonTranslateClientBuilder; | |
import com.amazonaws.services.translate.model.TranslateTextRequest; | |
import com.amazonaws.services.translate.model.TranslateTextResult; | |
public class AmazonTranslateUtil { |
sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/postgresql/ gitlabhq_production
Sql query to fetch discussions
SELECT n.id,
mr.iid AS merge_request_iid,
Idea is to use a uniformly distributed numeric column. So, we should prefer primary key and then any numeric column and we should avoid using text column for splitting.
We should consider:
Unified analytics engine for large-scale data processing (mostly in-memory).
Fetching data via JDBC using pyspark
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Data used - academia.stackexchange.com.7z
from https://archive.org/details/stackexchange
Data from multiple tables are there in academia.stackexchange.com
folder. I used PostHistory.xml.
package com.dev.util.aws; | |
import java.io.File; | |
import java.io.IOException; | |
import java.nio.ByteBuffer; | |
import java.nio.file.Files; | |
import java.util.ArrayList; | |
import java.util.List; | |
import com.amazonaws.auth.AWSCredentials; |