Dan Ofer ddofer

Are OpenAI training models in a way that encourages security risks?

Todays's topic is structured outputs, how to produce them, their interprlay with chain-of-thought, and a potential security risk this opens up.

Structured Outputs

When using an LLM programatically as part of a larger system or process, it is useful to have the model produce outputs in a structured format which is easy to parse programatically. Formatting the output as a JSON structure makes a lot of sense in this regard, and the commercial LLM models are trained to produce JSON outputs according to your specification. So for example instead of asking the model to produce a list of 10 items (left) which may be tricky to parse, I could ask it to return the answer as a JSON list of 10 strings (right).

You probably don't know how to do Prompt Engineering

(This post could also be titled "Features missing from most LLM front-ends that should exist")

Apologies for the snarky title, but there has been a huge amount of discussion around so called "Prompt Engineering" these past few months on all kinds of platforms. Much of it is coming from individuals who are peddling around an awful lot of "Prompting" and very little "Engineering".

Most of these discussions are little more than users finding that writing more creative and complicated prompts can help them solve a task that a more simple prompt was unable to help with. I claim this is not Prompt Engineering. This is not to say that crafting good prompts is not a difficult task, but it does not involve doing any kind of sophisticated modifications to general "template" of a prompt.

Others, who I think do deserve to call themselves "Prompt Engineers" (and an awful lot more than that), have been writing about and utilizing the rich new eco-system

Get my forked tensorflow graphics repo and switch to appropriate branch

git clone https://github.com/jackd/graphics.git
cd graphics
git checkout sparse-feastnet
pip install -e .
cd ..

Get this gist:

download model

if [[ ! -e 'numberbatch-17.06.txt' ]]; then
    wget https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.06.txt.gz
    gunzip numberbatch-17.06.txt.gz
fi
sudo pip install wordfreq
sudo pip install gensim

	def describe_robust(self, percentiles=None, include=None, exclude=None, trim=0.2):
	"""
	Monkey-patch for pd.Dataframe.describe based on robust statistics.
	Calculate trimmed mean and winsorized standard deviation with default trim 0.2.
	Uses scipy.stats.mstats (trimmed_mean, winsorized) and numpy.std

	See e.g. http://www.uh.edu/~ttian/ES.pdf for methodical background.

	BSD 3-Clause License

	# how to use : df should be the dataframe restricted to categorical values to impact,
	# target should be the pd.series of target values.
	# Use fit, transform etc.
	# three types : binary, multiple, continuous.
	# for now m is a param <===== but what should we put here ? I guess some function of total shape.
	# I mean what would be the value of m we want to have for 0.5 ?

	import pandas as pd
	import numpy as np

	"""
	A scikit-learn like transformer to remove a confounding effect on X.
	"""

	from sklearn.base import BaseEstimator, TransformerMixin, clone
	from sklearn.linear_model import LinearRegression
	import numpy as np

	class DeConfounder(BaseEstimator, TransformerMixin):
	""" A transformer removing the effect of y on X.

	# This is not presently all encompassing as it was started well after my sequence work repo
	# at https://github.com/fomightez/sequencework , where much of this related code is.



	# For making FASTA files/entriees out of dataframes, see 'specific dataframe contents saved as formatted text file example'
	# in my useful pandas snippets gist https://gist.github.com/fomightez/ef57387b5d23106fabd4e02dab6819b4