Skip to content

Instantly share code, notes, and snippets.

View dayyass's full-sized avatar
🚀
Rocket Science

Dani El-Ayyass dayyass

🚀
Rocket Science
View GitHub Profile
@dayyass
dayyass / muse_tokenize.ipynb
Last active September 5, 2023 08:19
How to get and use tokenizer from "universal-sentence-encoder-multilingual".
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@dayyass
dayyass / sklearn_tokenizer.py
Created June 17, 2021 13:54
sklearn tokenizer used in HashingVectorizer, CountVectorizer and TfidfVectorizer.
import re
# Method build_tokenizer from _VectorizerMixin mixin from which classes HashingVectorizer, CountVectorizer and
# TfidfVectorizer (through CountVectorizer) are partially inherited.
# It is used to split a string into a sequence of tokens (only if analyzer == 'word').
def build_tokenizer(token_pattern: str = r"(?u)\b\w\w+\b"):
"""
Return a function that splits a string into a sequence of tokens.
@dayyass
dayyass / matrix_to_dict.py
Created June 17, 2021 18:29
Convert matrix into a dictionary whose keys are the row and column indices of the matrix and values correspond to the matrix values for given key indices.
import numpy as np
from tqdm import trange
from collections import defaultdict
from typing import Dict, Tuple, DefaultDict
def get_matrix_idx_to_value_dict(
matrix: np.ndarray,
verbose: bool = True,
) -> DefaultDict[Tuple[int, int], int]:
- repo: local
hooks:
- id: unittest
name: unittest
entry: python -m unittest discover
language: python
always_run: true
pass_filenames: false
@dayyass
dayyass / Dockerfile
Last active July 19, 2021 10:06
jupyter-cuda10.1-tf2.2.0-docker-mlspace
FROM cr.msk.sbercloud.ru/aicloud-jupyter/jupyter-cuda10.1-tf2.2.0-mlspace:latest
MAINTAINER Dani El-Ayyass <[email protected]>
USER root
# Docker
# Set up the repository
RUN apt-get update
RUN apt-get -y install apt-transport-https ca-certificates curl gnupg lsb-release
@dayyass
dayyass / humanize_bytes.py
Created July 25, 2021 08:25
Convert bytes to human readable format.
def humanize_bytes(bytes: int, suffix: str = "B") -> str:
"""
Convert bytes to human readable format.
:param int bytes: number of bytes.
:param str suffix: bytes suffix.
:return: human readable size.
:rtype: str
"""
@dayyass
dayyass / lemmatized.py
Last active May 26, 2022 11:26
Pymorphy2 lemmatizer class.
import pymorphy2
class Lemmatizer:
"""
Pymorphy2 lemmatizer class.
"""
def __init__(self):
"""
@dayyass
dayyass / tfidf_token2idf.py
Last active September 29, 2021 12:25
Extract token2idf mapper from TfidfVectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
# data
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
@dayyass
dayyass / tfidf_lemmatization.py
Created September 29, 2021 09:20
How to use sklearn TfidfVectorizer with lemmatizer.
from sklearn.feature_extraction.text import TfidfVectorizer
# pymorphy2 lemmatizer
import pymorphy2
class Lemmatizer:
def __init__(self):
self.morph = pymorphy2.MorphAnalyzer()
def __call__(self, x: str) -> str:
@dayyass
dayyass / permutation_accuracy.py
Created October 9, 2021 14:01
Find a labels mapper with the highest accuracy.
from itertools import permutations
import numpy as np
from sklearn.metrics import accuracy_score
np.random.seed(42)
y_true = np.random.randint(low=0, high=3, size=100)
noize_mapper = {0: 1, 1: 2, 2: 0}