VenkataSakethDakuri · October 13, 2025 05:29
diff --git a/gistfile1.txt b/gistfile1.txt
 Main Components of LLM training:
 architecture, training algo/loss, data, evaluation, systems



 AI systems:
 Majority completion strategy: for doing a hard reasoning task, the model can go in diff reasoning paths and final answer can be the majority output from these paths.
 dspy is a declarative programming library designed for building and optimizing AI-powered applications, especially those leveraging large language models (LLMs). It provides a structured and modular approach to defining tasks, training models, and integrating AI capabilities.



 Books:
 LLM engineer handbook	

 AI Product Manager:
 IBM AI Product Manager Professional Certificate (Coursera)
 https://drive.google.com/drive/folders/1NQqtEYqdo3jAbzQGtvrdsSrIrA-045N-

 Gemini models have the longest context window.


 Common ML Mistakes:
 1. Not Cleaning Data Properly  
   - Dirty data (duplicates, outliers, mixed formats) leads to biased and unreliable models.  
   - Data cleaning is essential—"Garbage in, garbage out."

 2. Skipping Feature Scaling  
   - Normalization/standardization is crucial when features have different scales.  
   - Without scaling, gradient descent struggles, leading to poor model performance.

 3. Data Leakage  
   - If test/validation data influences training, the model performs unrealistically well.  
   - Always split data first, then preprocess training and test sets separately.

 4. Ignoring Class Imbalance  
   - Heavily skewed data (e.g., fraud detection) results in misleading accuracy.  
   - Use oversampling, undersampling, or SMOTE to balance classes.

 5. Incorrect Handling of Missing Values  
   - Simply deleting missing values loses information, while naive imputation may mislead.  
   - Treat missing values contextually; sometimes they hold important signals.

 6. Choosing the Wrong Evaluation Metric  
   - Accuracy is misleading for imbalanced datasets.  
   - Use precision, recall, F1-score, or problem-specific metrics.

 7. Overfitting to Training Data  
   - A model that performs well only on training data has learned noise, not patterns.  
   - Use regularization, cross-validation, and early stopping to control overfitting.

 8. Underfitting the Data  
   - Too simple a model (e.g., linear regression for non-linear data) fails to capture patterns.  
   - Increase model complexity or use feature engineering to improve learning.

 9. Using an Incorrect Learning Rate  
   - A high learning rate causes instability; a low one makes training slow.  
   - Use learning rate schedules or tuning to find a balance.

 10. Poor Hyperparameter Choices  
   - Randomly selecting batch sizes, network architectures, or other parameters hurts performance.  
   - Use grid search, random search, or Bayesian optimization for systematic tuning.

 11. Not Using Cross-Validation  
   - Training on a single split can be misleading.  
   - K-fold cross-validation provides a more reliable performance estimate.

 12. Train-Test Contamination  
   - If test data is used for hyperparameter tuning or feature selection, the test set becomes invalid.  
   - Keep test data completely separate and only use it for final evaluation.

 13. Using the Wrong Loss Function  
   - Classification should use cross-entropy, regression should use MSE/MAE.  
   - Choosing the wrong loss function confuses model optimization.

 14. Incorrect Feature Encoding  
   - Label encoding for categorical variables can introduce false ordinal relationships.  
   - Use one-hot encoding or embeddings where necessary.

 15. Not Shuffling Training Data  
   - Ordered datasets create biased learning (e.g., time series).  
   - Always shuffle training data before batching.

 16. Memory Management Issues  
   - Loading too much data at once can crash the system.  
   - Use batch processing, clear GPU memory, and monitor resource usage.

 17. Ignoring Model Bias and Fairness  
   - Models may unintentionally discriminate based on gender, ethnicity, or other factors.  
   - Check model performance across different subgroups and document biases.

 18. Ignoring Model Assumptions  
   - Many algorithms have assumptions (e.g., linear regression assumes linear relationships).  
   - Validate data distributions and relationships before selecting models.

 19. Starting with Complex Models Too Early  
   - Deep learning isn’t always necessary; simpler models often work better.  
   - Start with logistic regression, decision trees, or random forests before increasing complexity.

 20. Ignoring Domain Knowledge  
   - ML isn’t just about data—understanding the business or scientific context is crucial.  
   - Experts can help identify meaningful features and ensure the model aligns with real-world needs.


 Encoders can't be used for chatbots because of temporal paradox, the model would need to know the entire future sequence before generating the first token.


 DSPy offers supervision for intermediate layers.  The framework provides several mechanisms to apply supervision and constraints to intermediate module outputs within multi-stage pipelines, not just to the final output. This is done through Assertions.


 BehaviorGPT is Unbox AI's behavior-to-behavior foundation model that generates human actions and transactions. Unlike traditional Large Language Models (LLMs) that work in what the company calls a "think and talk space," LBMs operate directly in the "real world of actions".

 The key distinction is that while LLMs are trained on text data from the internet, BehaviorGPT is trained on over 1 trillion real-world human actions. This includes behaviors like clicks, scrolls, purchases, forms of investments, and movements - data that flows in volumes 100× larger than all the content we post each day.

 CTGAN and SMOTENC can be used for balancing data of different classes.
	Main Components of LLM training:
	architecture, training algo/loss, data, evaluation, systems



	AI systems:
	Majority completion strategy: for doing a hard reasoning task, the model can go in diff reasoning paths and final answer can be the majority output from these paths.
	dspy is a declarative programming library designed for building and optimizing AI-powered applications, especially those leveraging large language models (LLMs). It provides a structured and modular approach to defining tasks, training models, and integrating AI capabilities.



	Books:
	LLM engineer handbook

	AI Product Manager:
	IBM AI Product Manager Professional Certificate (Coursera)
	https://drive.google.com/drive/folders/1NQqtEYqdo3jAbzQGtvrdsSrIrA-045N-

	Gemini models have the longest context window.


	Common ML Mistakes:
	1. Not Cleaning Data Properly
	- Dirty data (duplicates, outliers, mixed formats) leads to biased and unreliable models.
	- Data cleaning is essential—"Garbage in, garbage out."

	2. Skipping Feature Scaling
	- Normalization/standardization is crucial when features have different scales.
	- Without scaling, gradient descent struggles, leading to poor model performance.

	3. Data Leakage
	- If test/validation data influences training, the model performs unrealistically well.
	- Always split data first, then preprocess training and test sets separately.

	4. Ignoring Class Imbalance
	- Heavily skewed data (e.g., fraud detection) results in misleading accuracy.
	- Use oversampling, undersampling, or SMOTE to balance classes.

	5. Incorrect Handling of Missing Values
	- Simply deleting missing values loses information, while naive imputation may mislead.
	- Treat missing values contextually; sometimes they hold important signals.

	6. Choosing the Wrong Evaluation Metric
	- Accuracy is misleading for imbalanced datasets.
	- Use precision, recall, F1-score, or problem-specific metrics.

	7. Overfitting to Training Data
	- A model that performs well only on training data has learned noise, not patterns.
	- Use regularization, cross-validation, and early stopping to control overfitting.

	8. Underfitting the Data
	- Too simple a model (e.g., linear regression for non-linear data) fails to capture patterns.
	- Increase model complexity or use feature engineering to improve learning.

	9. Using an Incorrect Learning Rate
	- A high learning rate causes instability; a low one makes training slow.
	- Use learning rate schedules or tuning to find a balance.

	10. Poor Hyperparameter Choices
	- Randomly selecting batch sizes, network architectures, or other parameters hurts performance.
	- Use grid search, random search, or Bayesian optimization for systematic tuning.

	11. Not Using Cross-Validation
	- Training on a single split can be misleading.
	- K-fold cross-validation provides a more reliable performance estimate.

	12. Train-Test Contamination
	- If test data is used for hyperparameter tuning or feature selection, the test set becomes invalid.
	- Keep test data completely separate and only use it for final evaluation.

	13. Using the Wrong Loss Function
	- Classification should use cross-entropy, regression should use MSE/MAE.
	- Choosing the wrong loss function confuses model optimization.

	14. Incorrect Feature Encoding
	- Label encoding for categorical variables can introduce false ordinal relationships.
	- Use one-hot encoding or embeddings where necessary.

	15. Not Shuffling Training Data
	- Ordered datasets create biased learning (e.g., time series).
	- Always shuffle training data before batching.

	16. Memory Management Issues
	- Loading too much data at once can crash the system.
	- Use batch processing, clear GPU memory, and monitor resource usage.

	17. Ignoring Model Bias and Fairness
	- Models may unintentionally discriminate based on gender, ethnicity, or other factors.
	- Check model performance across different subgroups and document biases.

	18. Ignoring Model Assumptions
	- Many algorithms have assumptions (e.g., linear regression assumes linear relationships).
	- Validate data distributions and relationships before selecting models.

	19. Starting with Complex Models Too Early
	- Deep learning isn’t always necessary; simpler models often work better.
	- Start with logistic regression, decision trees, or random forests before increasing complexity.

	20. Ignoring Domain Knowledge
	- ML isn’t just about data—understanding the business or scientific context is crucial.
	- Experts can help identify meaningful features and ensure the model aligns with real-world needs.


	Encoders can't be used for chatbots because of temporal paradox, the model would need to know the entire future sequence before generating the first token.


	DSPy offers supervision for intermediate layers. The framework provides several mechanisms to apply supervision and constraints to intermediate module outputs within multi-stage pipelines, not just to the final output. This is done through Assertions.


	BehaviorGPT is Unbox AI's behavior-to-behavior foundation model that generates human actions and transactions. Unlike traditional Large Language Models (LLMs) that work in what the company calls a "think and talk space," LBMs operate directly in the "real world of actions".

	The key distinction is that while LLMs are trained on text data from the internet, BehaviorGPT is trained on over 1 trillion real-world human actions. This includes behaviors like clicks, scrolls, purchases, forms of investments, and movements - data that flows in volumes 100× larger than all the content we post each day.

	CTGAN and SMOTENC can be used for balancing data of different classes.