SergioEanX’s gists

SergioEanX / weaviate_example_referenced_collection.ipynb

Last active November 26, 2024 10:14

Quick overview of Weaviate with referenced collections

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

SergioEanX / connect_weaviate_embedded

Created November 23, 2024 18:29

How to connect to weaviate embedded, if already running reconnect otherwise instantiate a new instance

	def get_weaviate():

	try:
	# Try to connect to an existing Weaviate instance
	client = weaviate.connect_to_local(port=8079, grpc_port=50060)
	if client.is_ready():
	print("Connected to an already running Weaviate instance.")
	return client
	# else:
	# print("Weaviate instance not ready.")

SergioEanX / Inference.py

Created November 8, 2024 07:24

Inference function for trained LLM and tokenixzer fixing error: "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:0 for open-end generation."

	def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
	"""
	Generates a continuation of the input text using the provided model and tokenizer.

	Args:
	text (str): The input text prompt.
	model: The pre-trained language model for generation.
	tokenizer: The tokenizer corresponding to the model.
	max_input_tokens (int, optional): Maximum number of tokens for the input. Defaults to 1000.
	max_output_tokens (int, optional): Maximum number of tokens to generate. Defaults to 100.

SergioEanX / lamini_foundation_vs_finetuned.ipynb

Created November 7, 2024 18:08

Foundation vs Fine Tuned LLM response (using Lamini.ai)

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

SergioEanX / Fine-tuning Llama 3.2 Using Unsloth.ipynb

Last active November 7, 2024 08:40

Fine Tune Llama 3.2-3B-Instruct on Mental Health Conversations dataset (from original Kaggle Blog by Abid Ali Awan)

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

SergioEanX / langchain_intro.py

Last active September 23, 2024 12:01

Example of how to use LangChain pipe with a model fron HugginFace (e.g. "google/gemma-2-2b-it")

	"""
	This script loads the Hugging Face "google/gemma-2-2b-it" model using 8-bit quantization for optimized inference.
	It integrates with LangChain using a prompt template to simulate a travel agent AI, which breaks down travel requests into start, pitstops, and end locations.
	"""
	# Inspired by "Get Started with LangChain: Your Key to Mastering LLM Pipelines"
	# https://medium.com/the-ai-espresso/get-started-with-langchain-your-key-to-mastering-llm-pipelines-b25a1728e8f3/

	from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
	from langchain.prompts import PromptTemplate
	from langchain_huggingface.llms import HuggingFacePipeline

SergioEanX / intro_to_embeddings.ipynb

Created September 21, 2024 17:08

Simple introduction to embeddings (derived from Cohere blog)

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

SergioEanX / keyword_vs_semantic.py

Created September 17, 2024 07:58

Comparing keyword search to semantic search

	from pymongo import MongoClient
	import weaviate
	from weaviate.classes.query import MetadataQuery
	import os
	from faker import Faker
	from dotenv import load_dotenv
	from pathlib import Path

	from weaviate.classes.config import Configure, Property, DataType
	import random

SergioEanX / parse_pdf_unstructured.py

Created September 16, 2024 09:06

Parsing pdf using unstructured API

	"""
	This module processes a PDF file using the Unstructured API to extract text, tables, and images.
	The extracted data is saved in a specified output directory. The module also provides functions
	to convert HTML tables to pandas DataFrames and HTML content to Markdown.

	Functions:
	- html_table_to_dataframe: Convert an HTML table to a pandas DataFrame.
	- html_to_markdown: Convert HTML content to Markdown.
	- process_pdf: Process a PDF file using the Unstructured API and save the extracted data.
	Images are saved as PNG files, tables as HTML files, and text as JSON.

SergioEanX / foundation_vs_instruct_datasets.py

Created September 10, 2024 17:18

Example on how download datasets from HugginFace

	import pandas as pd
	import os
	from dotenv import load_dotenv
	from datasets import load_dataset
	from itertools import islice

	def load_and_preview_dataset(dataset_name, config=None, split="train"):
	print(f"Loading {dataset_name} dataset...")

	try: