Skip to content

Instantly share code, notes, and snippets.

@srikumarks
Last active June 23, 2022 10:21
Show Gist options
  • Save srikumarks/36b20a4a4d7d66ff457ffa2813b24992 to your computer and use it in GitHub Desktop.
Save srikumarks/36b20a4a4d7d66ff457ffa2813b24992 to your computer and use it in GitHub Desktop.
Miniature date language data generator
:- module(qd, [gendata/0, nth/5, weekday/2, direction/2, dayprop/2, today/1, gendates/4, leapyear/1, nonleapyear/1, validate/1, cardinal/3]).
:- use_module(library(clpfd)).
% This module defines a small query language for (approximate) dates according
% to the gregorian calendar. You can run this program using --
% swipl -t gendata datelang.pl
% Abstract
% ========
%
% Large language models (LLMs) like GPT3 and PaLM have garnered interest due to
% their showing some signs of reasoning ability. While the deep learning
% faithful went forth and asserted that what remains now is to just scale the
% models, another group of researchers advocate for adding symbolic AI to the
% mix without which, they claim, it will not be possible for current
% architectures to achieve anywhere near the level of flexible reasoning humans
% engage in. Part of the difficulty is that at the level of the LLMs, pretty
% much nobody has the infrastructure to test such statements on them. The
% purpose of this module is to serve as a generator for a small dataset focused
% on calculations about dates. Date calculations are modest, in that they
% involve basic arithmetic and modulus calculations, but are complex enough
% when considered alongside what humans are typically interested about dates.
% If a "pure" deep learning network can be trained on data generated by this
% program and tested on held out data also generated by this program and shows
% 100% accuracy, it would be fair to claim that deep learning architectures can
% learn such logic. If the current architectures are unable to do that, it
% would at least serve as a small playground to discover and test potential
% architecture that can go on to do that and we'd have gained some
% understanding of the gaps that the current architectures have. Either way, we
% win!
%
% The language
% ============
%
% This module defines a number of basic queries on dates. Time is not of concern
% here. There are 3 aspects to these queries -
%
% a) A query term describing what information is being sought. Currently the
% language supports date(Y,M,D), weekday(WeekDayName), weekend, long_weekend,
% holiday and day.
% b) A "generator" term which describes the direction relative to a date the
% query is being performed. Currently two generators are supported - before
% and after.
% c) A desired response term that describes what aspect of the queried date is
% wanted as a response. Currently supported responses are -- date(Y,M,D),
% year(Y), month(MonthName), weekday(WeekDayName).
%
% In addition to the three, an "nth" such day meeting the query can be requested,
% where the N is given either as a numerical value > 0 or as a cardinal term of
% the form (for example) cardinal(fifth).
%
% Characteristics
% ===============
%
% The language features simple words that stand for numeric concepts and the
% model would be expected to discover these relationships through training.
% The model could have a much easier job if it didn't have to discover these
% associations, which mimics what we can expect in real world language usage.
%
% The word queries relate different dates together, which is also expected to
% be discovered by the model. These relationships, especially equivalences,
% are also characteristic of normal language use.
%
% Numerical date presentations are also included in both the query and the
% response. So identifying and calculating with digits is also part of the
% model's requirements. The place value system is one of the simplest forms
% of "language" we use on a daily basis.
%
% There are also specific "special" dates -- holidays -- which a model will
% simply have to memorize. Ideally, it should be possible to just feed these
% holidays as data to the model without having to train it, but that's left
% to the model builder to consider.
%
% Hypotheses targeted by this dataset
% ===================================
%
% *Note*: In the following, I refer to NNs, but what I really mean is models
% featuring continuous parameter spaces where parameters are learnt using
% stochastic gradient descent.
%
% 1. Are "learning logical reasoning" and "learning to imitate logical reasoning"
% empirically distinguishable? (Is this a stupid question?) If pure SGD
% architectures fail on this task even when given a large parameter space,
% it might support the argument that they are different and evidence of
% imitating logical reasoning should not count towards trusting architectures
% to perform actual logical reasoning. OTOH, if they do succeed, especially
% when success is not taken by merely a high percentage like 99.9% but as a
% perfect score of 100% on generated held out examples, it gives some
% credence to the claim that symbolic capability perhaps does not need to
% be added as subsystems to otherwise pure SGD based models.
% 2. Accurate logical reasoning can be achieved through current neural network
% architectures through SGD training on such a dataset, without assistance from
% symbolic subsystems. The problem domain is small enough that extremely high
% (even 100%?) reliability on held out (freshly generated) data is not an
% unreasonable ask, especially given the claims of "we only need scale now".
% 3. Architecture complexity depends on complexity of dataset. In this case, the
% dataset's complexity is low (short sentences about dates). However, I'm not
% sure small-parameter-count models with known architectures can perform well
% on this. (Plan to study.)
% 4. NN architectures can produce reliable indicators of truth of statements.
% If a full predicate expression is given, the model must be able to compute
% a truth probability for it that is in line with what would be expected of
% purely symbolic systems - perhaps as a softmax pair.
% 5. Can we evaluate the complexity of information flow in a network for
% "obvious" relationships like "the third weekend after 2022-06-20 falls on a saturday"
% relative to statements that call for more compute? (in this dataset, weekends
% always fall on saturdays and long_weekends always fall on fridays). (This should
% be student homework perhaps.)
% 6. Is approach for 4 usable to characterize work done to compare dates? .. i.e.
% finding whether one date is before or after another given date. If this turns out
% this is low, then unintelligent search used by gendates for this will stand
% superceded, chalking a point up for NNs.
%
% Possible revisions
% ==================
%
% It could be valuable to treat the nth/5 predicate with an extra argument that
% declares the rest of the arguments as true or false, rather than have only
% true statements included in the dataset. I'm not sure we should give that kind
% of consideration for the models though, since we can't expect LLMs to be given
% a set of known-to-be-false associations to train on (apart from negative sampling
% derived from the truth dataset) in addition to going self-supervised on large
% language productions from the web. They should be able to gauge truth from
% falsity without being told (how? .. beats me right now!)
%
% The entry point predicate
% =========================
%
% nth/5 is the main query predicate which has the form -
%
% nth(N, Query, Generator, date(Y,M,D), Response).
%
% where we can use cardinal(fifth) for N as well as natural numbers.
%
% List of names of week days.
% Used for query and printing.
weekday(1, monday).
weekday(2, tuesday).
weekday(3, wednesday).
weekday(4, thursday).
weekday(5, friday).
weekday(6, saturday).
weekday(7, sunday).
% List of month names and their common abbreviated forms.
monthname(1, january, jan).
monthname(2, february, feb).
monthname(3, march, mar).
monthname(4, april, apr).
monthname(5, may, may).
monthname(6, june, jun).
monthname(7, july, jul).
monthname(8, august, aug).
monthname(9, september, sep).
monthname(10, october, oct).
monthname(11, november, nov).
monthname(12, december, dec).
% How many days are there in a given month (by number).
daysinmonth(_, 1, 31).
daysinmonth(Y, 2, 28) :- nonleapyear(Y).
daysinmonth(Y, 2, 29) :- leapyear(Y).
daysinmonth(_, 3, 31).
daysinmonth(_, 4, 30).
daysinmonth(_, 5, 31).
daysinmonth(_, 6, 30).
daysinmonth(_, 7, 31).
daysinmonth(_, 8, 31).
daysinmonth(_, 9, 30).
daysinmonth(_, 10, 31).
daysinmonth(_, 11, 30).
daysinmonth(_, 12, 31).
% Our two "generators".
direction(before, -1).
direction(after, 1).
% Validates components of a date. We only consider a 10000 year
% range.
validate(date(_Y,2,30)) :- false.
validate(date(_Y,2,31)) :- false.
validate(date(Y,2,29)) :- Y #> 0, Y #< 10000, leapyear(Y).
validate(date(Y,M,D)) :-
Y #> 0, Y #< 10000,
M #> 0, M #=< 12,
daysinmonth(Y, M, ND),
D #> 0, D #=< ND.
% Asserts properties of a given date. You can use this
% to check or extract aspects like weekday name, whether
% it is a weekend or a holiday or a long_weekend or a
% leap year (Jan 1st of leap years, that is).
dayprop(date(Y,M,D), date(Y,M,D)) :- validate(date(Y,M,D)).
dayprop(date(Y,M,D), weekday(WDName)) :-
validate(date(Y,M,D)),
label([Y,M,D]),
day_of_the_week(date(Y,M,D), WDNum),
weekday(WDNum, WDName).
dayprop(date(Y,M,D), weekend) :-
dayprop(date(Y,M,D), weekday(saturday)).
% Concept of business day is any week day (monday through friday)
% that isn't a known holiday.
dayprop(date(Y,M,D), business_day) :-
validate(date(Y,M,D)),
day_of_the_week(date(Y,M,D), WDNum),
WDNum #>= 1, WDNum #=< 5,
\+ dayprop(date(Y,M,D), holiday).
% Every valid date is a "day". This trivial thing is useful
% to count "n days after" and such.
dayprop(date(Y,M,D), day) :- validate(date(Y,M,D)).
% Some "random" holidays. New year, pongal, India's independence day, republic
% day, may day, USA's independence day, Gandhi jayanthi, and such.
% A model for this language should be able to (ideally) accept these as data
% at prediction time rather than have them be inferred from training data.
dayprop(date(Y,1,1), holiday) :- validate(date(Y,1,1)).
dayprop(date(Y,1,15), holiday) :- validate(date(Y,1,15)).
dayprop(date(Y,5,1), holiday) :- validate(date(Y,5,1)).
dayprop(date(Y,7,4), holiday) :- validate(date(Y,7,4)).
dayprop(date(Y,12,25), holiday) :- validate(date(Y,12,25)).
dayprop(date(Y,8,15), holiday) :- validate(date(Y,8,15)), Y #>= 1947.
dayprop(date(Y,4,14), holiday) :- validate(date(Y,4,14)), Y #>= 1947.
dayprop(date(Y,1,26), holiday) :- validate(date(Y,1,26)), Y #>= 1949.
dayprop(date(Y,10,2), holiday) :- validate(date(Y,10,2)), Y #>= 1947.
% For now, keep it simple by not considering thursday+friday holidays.
dayprop(date(Y,M,D), long_weekend) :-
dayprop(date(Y,M,D), holiday),
dayprop(date(Y,M,D), weekday(friday)).
dayprop(date(Y,1,1), leapyear) :-
validate(date(Y,1,1)),
leapyear(Y).
% You can query for properties of adjacent dates as well
% like dayprop(date(Y,M,D), day(after, 2, date(Y2,M2,D2))).
% which will give the Y/M/D two days after the given date.
dayprop(date(Y,M,D), day(Gen, N, Prop)) :-
date_time_stamp(date(Y,M,D,0,0,0,0,-,-), TS),
direction(Gen, Dir),
TSnext is TS + Dir * N * 24 * 3600,
stamp_date_time(TSnext, date(YY,MM,DD,_,_,_,_,_,_), 'UTC'),
dayprop(date(YY,MM,DD), Prop).
% Asserts property for today's date. All dates are UTC,
% for simplicity.
today(Prop) :-
get_time(TS),
stamp_date_time(TS, date(Y,M,D,_,_,_,_,_,_), 'UTC'),
dayprop(date(Y,M,D), Prop).
% Concept of leap year is kept simple - year number
% divisible by 4 but not by 100 are leap years.
leapyear(Y) :-
R #= Y rem 4,
Rc #= Y rem 100,
R #= 0, Rc #> 0.
nonleapyear(Y) :-
R #= Y rem 4,
Rc #= Y rem 100,
(R #> 0; Rc #= 0).
% The core work predicate which discovers dates
% that meet the desired property given the generator relative
% to a given date.
gendates(Prop, Gen, date(Y,M,D), date(YY,MM,DD)) :-
validate(date(Y,M,D)),
label([Y,M,D]),
date_time_stamp(date(Y,M,D,0,0,0,0,-,-), TS),
direction(Gen, Dir),
N #> 0, N #< 365000,
label([N]),
TS2 is TS + Dir * N * 24 * 3600,
stamp_date_time(TS2, date(YY,MM,DD,_,_,_,_,_,_), 'UTC'),
dayprop(date(YY,MM,DD), Prop).
% Utility to get last element of a list.
last([X],X).
last([_|Y], Z) :- last(Y, Z).
% Table of cardinal numbers and their suffix forms.
cardinal(first, 1, st).
cardinal(second, 2, nd).
cardinal(third, 3, rd).
cardinal(fourth, 4, th).
cardinal(fifth, 5, th).
cardinal(sixth, 6, th).
cardinal(seventh, 7, th).
cardinal(eighth, 8, th).
cardinal(ninth, 9, th).
cardinal(tenth, 10, th).
cardinal(eleventh, 11, th).
cardinal(twelfth, 12, th).
% nth/5 is the main query predicate.
nth(0, day, before, date(Y,M,D), Out) :-
dayprop(date(Y,M,D), Out).
nth(cardinal(Name), Prop, Gen, date(Y,M,D), Out) :-
cardinal(Name, N, _),
nth(N, Prop, Gen, date(Y,M,D), Out).
nth(N, Prop, Gen, date(Y,M,D), year(YY)) :-
nth(N, Prop, Gen, date(Y,M,D), date(YY,_MM,_DD)).
nth(N, Prop, Gen, date(Y,M,D), month(MName)) :-
nth(N, Prop, Gen, date(Y,M,D), date(_YY,MM,_DD)),
monthname(MM, MName).
nth(N, Prop, Gen, date(Y,M,D), weekday(WDName)) :-
N #> 0,
nth(N, Prop, Gen, date(Y,M,D), date(YY,MM,DD)),
dayprop(date(YY,MM,DD), weekday(WDName)).
nth(N, Prop, Gen, date(Y,M,D), date(YY,MM,DD)) :-
N #> 0,
findnsols(N, date(Y2,M2,D2), gendates(Prop, Gen, date(Y,M,D), date(Y2,M2,D2)), Ds),
length(Ds, DL),
DL #>= N,
last(Ds, date(YY,MM,DD)), !.
%% Data generator. We have to enumerate some possibilities for the terms.
%% TODO: Currently the term enumerator is depth first recursive, which is
%% not a great way to do it from a data perspective. Better if random,
%% but random_labeling appears to be only available for CLPB and not for
%% CLPFD. Need to figure that out.
propgen(1, weekday(monday)).
propgen(2, weekday(tuesday)).
propgen(3, weekday(wednesday)).
propgen(4, weekday(thursday)).
propgen(5, weekday(friday)).
propgen(6, weekday(saturday)).
propgen(7, weekday(sunday)).
propgen(8, weekend).
propgen(9, long_weekend).
propgen(10, holiday).
propgen(11, day).
ngen(N, cardinal(C)) :- cardinal(C, N, _).
ngen(N, N) :- N #> 12, N #< 1000, label([N]).
gengen(1, before).
gengen(2, after).
yeargen(Y) :- Y #>= 2000, Y #< 2050, label([Y]).
monthgen(M) :- M #> 0, M #=< 12, label([M]).
daygen(D) :- D #> 0, D #=< 31, label([D]).
responsegen(1, date(_,_,_)).
responsegen(2, weekday(_)).
responsegen(3, weekend).
responsegen(4, long_weekend).
responsegen(5, holiday).
showprop(S, weekday(S)).
showprop(weekend, weekend).
showprop('long weekend', long_weekend).
showprop(holiday, holiday).
showprop(day, day).
showdate(S, date(Y,M,D)) :-
format(string(S), "~a-~a-~a", [Y,M,D]).
showresult(S, date(Y,M,D)) :-
format(string(S), "is on ~a-~a-~a", [Y,M,D]).
showresult(S, weekday(Name)) :-
format(string(S), "is on a ~a", [Name]).
showresult("is a weekend", weekend).
showresult("is a long weekend", long_weekend).
showresult("is a holiday", holiday).
display(nth(0, day, before, date(Y,M,D), date(Y,M,D))) :- true.
display(nth(0, day, before, RefDate, Out)) :-
RefDate \= Out,
showdate(DS, RefDate),
showresult(DSOut, Out),
format("~a ~a.~n", [DS, DSOut]).
display(nth(N, Prop, Gen, RefDate, Out)) :-
Prop \= Out,
cardinal(C, N, _),
showprop(S, Prop),
showdate(DS, RefDate),
showresult(DSOut, Out),
format("The ~a ~a ~a ~a ~a.~n", [C, S, Gen, DS, DSOut]).
gendata :-
N #>= 0, N #< 5, label([N]),
yeargen(Y),
monthgen(M),
daygen(D),
propgen(_, Prop),
gengen(_, Gen),
responsegen(_, R),
dif(Prop, R),
T = nth(N, Prop, Gen, date(Y,M,D), R),
call(T),
display(T),
fail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment