Skip to content

Instantly share code, notes, and snippets.

@SergioEanX
SergioEanX / weaviate_example_referenced_collection.ipynb
Last active November 26, 2024 10:14
Quick overview of Weaviate with referenced collections
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@SergioEanX
SergioEanX / connect_weaviate_embedded
Created November 23, 2024 18:29
How to connect to weaviate embedded, if already running reconnect otherwise instantiate a new instance
def get_weaviate():
try:
# Try to connect to an existing Weaviate instance
client = weaviate.connect_to_local(port=8079, grpc_port=50060)
if client.is_ready():
print("Connected to an already running Weaviate instance.")
return client
# else:
# print("Weaviate instance not ready.")
@SergioEanX
SergioEanX / Inference.py
Created November 8, 2024 07:24
Inference function for trained LLM and tokenixzer fixing error: "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:0 for open-end generation."
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
"""
Generates a continuation of the input text using the provided model and tokenizer.
Args:
text (str): The input text prompt.
model: The pre-trained language model for generation.
tokenizer: The tokenizer corresponding to the model.
max_input_tokens (int, optional): Maximum number of tokens for the input. Defaults to 1000.
max_output_tokens (int, optional): Maximum number of tokens to generate. Defaults to 100.
@SergioEanX
SergioEanX / lamini_foundation_vs_finetuned.ipynb
Created November 7, 2024 18:08
Foundation vs Fine Tuned LLM response (using Lamini.ai)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@SergioEanX
SergioEanX / Fine-tuning Llama 3.2 Using Unsloth.ipynb
Last active November 7, 2024 08:40
Fine Tune Llama 3.2-3B-Instruct on Mental Health Conversations dataset (from original Kaggle Blog by Abid Ali Awan)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@SergioEanX
SergioEanX / langchain_intro.py
Last active September 23, 2024 12:01
Example of how to use LangChain pipe with a model fron HugginFace (e.g. "google/gemma-2-2b-it")
"""
This script loads the Hugging Face "google/gemma-2-2b-it" model using 8-bit quantization for optimized inference.
It integrates with LangChain using a prompt template to simulate a travel agent AI, which breaks down travel requests into start, pitstops, and end locations.
"""
# Inspired by "Get Started with LangChain: Your Key to Mastering LLM Pipelines"
# https://medium.com/the-ai-espresso/get-started-with-langchain-your-key-to-mastering-llm-pipelines-b25a1728e8f3/
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
from langchain.prompts import PromptTemplate
from langchain_huggingface.llms import HuggingFacePipeline
@SergioEanX
SergioEanX / intro_to_embeddings.ipynb
Created September 21, 2024 17:08
Simple introduction to embeddings (derived from Cohere blog)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@SergioEanX
SergioEanX / keyword_vs_semantic.py
Created September 17, 2024 07:58
Comparing keyword search to semantic search
from pymongo import MongoClient
import weaviate
from weaviate.classes.query import MetadataQuery
import os
from faker import Faker
from dotenv import load_dotenv
from pathlib import Path
from weaviate.classes.config import Configure, Property, DataType
import random
@SergioEanX
SergioEanX / parse_pdf_unstructured.py
Created September 16, 2024 09:06
Parsing pdf using unstructured API
"""
This module processes a PDF file using the Unstructured API to extract text, tables, and images.
The extracted data is saved in a specified output directory. The module also provides functions
to convert HTML tables to pandas DataFrames and HTML content to Markdown.
Functions:
- html_table_to_dataframe: Convert an HTML table to a pandas DataFrame.
- html_to_markdown: Convert HTML content to Markdown.
- process_pdf: Process a PDF file using the Unstructured API and save the extracted data.
Images are saved as PNG files, tables as HTML files, and text as JSON.
@SergioEanX
SergioEanX / foundation_vs_instruct_datasets.py
Created September 10, 2024 17:18
Example on how download datasets from HugginFace
import pandas as pd
import os
from dotenv import load_dotenv
from datasets import load_dataset
from itertools import islice
def load_and_preview_dataset(dataset_name, config=None, split="train"):
print(f"Loading {dataset_name} dataset...")
try: