Skip to content

Instantly share code, notes, and snippets.

View geosmart's full-sized avatar
🎯
Focusing

geosmart geosmart

🎯
Focusing
View GitHub Profile
@geosmart
geosmart / testRegex.js
Created September 13, 2024 01:30 — forked from hanxiao/testRegex.js
Regex for chunking by using all semantic cues
// Updated: Aug. 20, 2024
// Run: node testRegex.js whatever.txt
// Live demo: https://jina.ai/tokenizer
// LICENSE: Apache-2.0 (https://www.apache.org/licenses/LICENSE-2.0)
// COPYRIGHT: Jina AI
const fs = require('fs');
const util = require('util');
// Define variables for magic numbers
const MAX_HEADING_LENGTH = 7;
@geosmart
geosmart / snowflake.py
Created July 19, 2022 03:48
Snowflake fro python3
import time
import logging
# 位数
WORKER_ID_BITS = 5
DATA_CENTER_ID_BITS = 5
SEQUENCE_BITS = 12
TIMESTAMP_EPOCH = 1288834974657
# 0-31
@geosmart
geosmart / log.java
Created October 9, 2021 09:14
flinkx on k8s error log
2021-10-09 09:06:10.974 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver - Recovered 0 pods from previous attempts, current attempt id is 1.
2021-10-09 09:06:10.975 [flink-akka.actor.default-dispatcher-4] INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Recovered 0 workers from previous attempt.
2021-10-09 09:06:10.977 [flink-akka.actor.default-dispatcher-4] INFO o.a.f.runtime.resourcemanager.active.ActiveResourceManager - ResourceManager akka.tcp://[email protected]:6123/user/rpc/resourcemanager_0 was granted leadership with fencing token 00000000000000000000000000000000
2021-10-09 09:06:10.981 [flink-akka.actor.default-dispatcher-4] INFO o.a.f.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2021-10-09 09:06:11.181 [flink-akka.actor.default-dispatcher-5] INFO com.dtstack.flinkx.util.DataSyncFactoryUtil - load flinkx plugin hdfsreader:com.dtstack.flinkx.connector.hd
@geosmart
geosmart / R_Demo.r
Created September 30, 2021 07:08
r read write csv by args
library(optparse)
# Rscript test.r --in.csv1 data/mock.csv --in.csv2 data/mock.csv --out.csv1 data/out.csv
# read param
option_list <- list(
make_option(c("-i", "--in.csv1"), type = "character", default = "", action = "store", help = "This is first!"),
make_option(c("-f", "--in.csv2"), type = "character", default = "", action = "store", help = "This is first!"),
make_option(c("-t", "--out.csv1"), type = "character", default = "", action = "store", help = "This is first!")
)
print("do klist")
os.system("klist")
krb5 = os.getenv("KRB5_CONFIG")
print("get krb5: {}".format(krb5))
if os.getenv("KRB5_CONFIG") is not None:
keytab = os.getenv("KEYTAB")
principal = os.getenv("PRINCIPAL")
kinit_cmd = "env KRB5_CONFIG={} kinit -kt {} {}".format(krb5, keytab, principal)
print("do kinit: {}".format(kinit_cmd))
os.system(kinit_cmd)
@geosmart
geosmart / check_partition.md
Last active August 23, 2021 09:26
dolphinscheduler数据加工-分区数据校验

dolphinscheduler可以通过shell节点校验数据是否符合要求

变量定义

PT_DATE=${system.biz.date}
PT_PATH=/user/hive/warehouse/default.db/test/pt_d=${PT_DATE}

校验hdfs分区是否存在

Flinkx 1.10.2 数据同步断点续传代码阅读

isRestore逻辑

是否开启断点续传,适用于文件和Jdbc类型的数据源,kafka等流式的不支持

input插件

BaseRichInputFormat

@geosmart
geosmart / test_spark_udtf.py
Created July 13, 2021 02:55
use pyspark to do some udtf
def _test_spark_udtf(self):
"""
# source
root
|-- id: long (nullable = true)
|-- title: string (nullable = true)
|-- abstract: string (nullable = true)
|-- content: string (nullable = true)
|-- else: string (nullable = true)
package cn.lite.flow.executor.plugin.sql.hive;
import org.apache.hive.jdbc.HiveStatement;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.List;
/**
* @description: hive日志搜集
package cn.lite.flow.executor.plugin.sql.hive;
import cn.lite.flow.common.model.consts.CommonConstants;
import cn.lite.flow.executor.plugin.sql.base.SQLHandler;
import com.alibaba.fastjson.JSONObject;
import org.apache.commons.lang3.StringUtils;
import org.apache.hive.jdbc.HiveStatement;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;