Devender Yadav devender-yadav

sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/postgresql/ gitlabhq_production

Sql query to fetch discussions

 SELECT n.id,
 mr.iid AS merge_request_iid,

How to choose split by column?

Idea is to use a uniformly distributed numeric column. So, we should prefer primary key and then any numeric column and we should avoid using text column for splitting.

How to determine the number of Mappers?

We should consider:

Number of rows
Number of tasks that can be run in parallel in Hadoop

Spark:sparkles:

Unified analytics engine for large-scale data processing (mostly in-memory).

Table of Content

Fetching data via JDBC using pyspark

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Database used - MySQL

Data used - academia.stackexchange.com.7z from https://archive.org/details/stackexchange

Data from multiple tables are there in academia.stackexchange.com folder. I used PostHistory.xml.

Let us assume, initally mongo is using /data/db1 dbpath and then /data/db2

Start mongo `/data/db1`

sudo mongod --dbpath=/data/db1

Export data from mongo `/data/db1`

mongoexport --db test_db --collection collection1 --out tb_collection1_db1.json

Static code analysis for python application

Pylint

How to install

pip install pylint

	import com.amazonaws.auth.AWSCredentials;
	import com.amazonaws.auth.AWSStaticCredentialsProvider;
	import com.amazonaws.auth.BasicAWSCredentials;
	import com.amazonaws.regions.Regions;
	import com.amazonaws.services.translate.AmazonTranslate;
	import com.amazonaws.services.translate.AmazonTranslateClientBuilder;
	import com.amazonaws.services.translate.model.TranslateTextRequest;
	import com.amazonaws.services.translate.model.TranslateTextResult;

	public class AmazonTranslateUtil {

	package com.dev.util.aws;

	import java.io.File;
	import java.io.IOException;
	import java.nio.ByteBuffer;
	import java.nio.file.Files;
	import java.util.ArrayList;
	import java.util.List;

	import com.amazonaws.auth.AWSCredentials;