- Image aesthetic analysis has undergone significant change from 2010s to current.
- Approaches to aesthetic analysis evolves over time according to the trends of the given era.
- The scarcity of data, lack of objective metrics, and innate taste variance makes it difficult.
- Anime aesthetic is a specialized branch of general aesthetics
Early methods rely on manually designed features (eg. color, texture, composition):
- photographic rules? (eg. rule of thirds) -> Lo et al. (2013), SVM, 89% acc on CUHK
- statistical methods? (foreground-background, color harmony?) -> Yang et al. (2015), SVR, 84.83% acc
- local and global features? -> Gao et al. (2015), Bayesian Network, 72.7% acc on memorability dataset
- content-based methods? -> Nishiyama et al. (2012), color harmony patterns, 77.6% acc on DPChallenge
Findings from these?
- Some human intuitions on aesthetic can be intrinsically modelled (using statistical or interepretable policies).
- Hand-crafted features can have limited expressiveness (think of hand-crafted features vs LeNets on MNIST)
- Manually selected features don't cover the last bits of human taste variance.
- Eval datasets are messy, no de-facto standard for evaluations like ImageNet
- OA (overall accuracy) is more widely used, despite effects on imbalanceness.
More interesting methods start developing after the introduction of CNN and more powerful computes:
basic deep network? -> Zhou et al. (2015), deep network on 56 features, 82.1% acc on 28k images multi-task learning? -> Kao et al. (2016), joint training on aesthetic and semantic prediction, MTCNN, 91.3% acc on AVA content-adaptive? -> Kong et al. (2016), AlexNet co-conditioned with attributes, 77.33% acc on AADB
emerging (but not yet there):
- multimodal inputs (MSCAN: self attention on visual features, cross attention on image <-->additional attributes) ^ attention mechanism is a good idea
- semi-supervised / active learning (SDAL: training with partially labelled data; active learning via iterative labelling; eye-tracking for data collection; probablistic model with expert ratings mix-in) ^ issue1: still data-capped. -> contrastive pre-training ^ issue2: CNNs don't generalize well. -> universal representation learning
Early trends of popular methods emerge, but are more task-specific and less widely adopted. Deep Learning-based methods achieves better accuracy than hand-crafted ones.
transformer:
- Although attention was already a thing in CNN/RNN era, it was mostly auxiliary
- transformers made it the primary idea, replacing recurrence or convolutions
- better efficiency with scaled dot-product attention / multi-head attention
- desigend so it's fully parallelizable, better optimized on modern hardware (ie, GPUs) ^ (no local connectivity, uses global self-attention instead)
CLIP (Contrastive Language-Image Pre-training)
-
embeds text and images into a shared embedding space.
-
text encoder is a tranformer -> replaces RNN
-
image encoder is a vision transformer -> replaces CNN
-
contrastive: addresses data scarcity, by inventing ways to learn from noisy data, and basically training everything off the web -> data collection is now scalable
-
language-image: extends the idea of multimodal learning through direct text-image alignment. (task-specific -> generalized zero-shot)
-
pre-training: offers base models that have better transferability than previous architectures, like CNN / RNN, on unseen data -> eg. anime
key things to consider:
- model arch: greatly improved over the years, offers better generalizability and accuracy
- compute: increased overtime, and being better utilized by model architectures (ie. transformers)
- fine-tuning data: still gated and scarce, on areas such as domain / expert knowledge ^ which is especially the case on anime
LAION Aesthetic ("clip aesthetic"):
-
idea: pretrained CLIP models offers useful info for aesthetics rating.
-
states linear layer on CLIP embedding produces“visually most appealing” rankings, ended up using MLP head anyways
-> their evaluation principle is visually appealing, so still no commonly recognized evalset. -
trained on human-scored images: AVA Dataset: crowd sourced photo-rating dataset (250k, filtered photos) (Adobe / UCI) Simulacra Aesthetic Captions (SAC): crowd sourced synthetic images rating (176k, ai-generated) LAION-Logos: private samples for text-in-image aesthetics (15k, text focused) -> in-house datasets are around this size
(notable mentions on CLIP):
zero-shot prompting for aesthetic analysi?
- Wang et al. (2022), use (“good/bad photo”) for aesthetic clssifications
automate the process of zero-shot prompting?
- IAA-LQ (Image Aesthetics Assessment via Learnable Queries) (2023)
- idea: keep pretrained ViT frozen, inserts a small transformer module to "probe" ViT features for aesthetic information
- automates the process of quality prompting
separate theme / composition with aesthetic?
- TC-IAA (Theme and Composition-aware IAA) (2025)
- idea: predict the image's theme and composition first, then predict aesthetic based on that
- (similar to the very early days of approaches ("content-adaptive"), but now using clip!)
- MSE ~0.47 on AVA
issue: they improve on architecture, but still data-capped.
pick-a-Pic (stability.ai, 2023):
- collected 500k human preferences for ai-generated images, on 35k prompts
- used to train reward models (PickScore) for both Reinforcement Learning and automated evaluation metrics.
HPS v2 (CUHK, 2024):
- collected 798k human preferences for ai-generated images using 9 models (not just the single stable diffusion model)
- prompts selected from DiffusionDB and COCO, then cleaned using ChatGPT to remove eg. excessive artist names / contradictory style elements
ImageReward (BUPT, 2023-25):
- similar idea, listed here for completion (didn't look into it)
- has an update recently: THUDM/VisionReward
VILA (google, 2023):
- harvests implicit aesthetic data via sentiment analysis
- collects image-comment pairs, from Flickr and DPChallenge, as weak aesthetic guidance
idea: what if we just use social media?
- artworks have like counts, which are naturally collected during scraping
- they should offer direct (and stronger) aesthetic signal compared to like VILA
With advancements in general (especially photographic aesthetics), anime aesthetics remains an under-researched area, though some general principles are applied here:
Waifu-diffusion (WD waifu score) (2022):
- a linear binary classfier (0-1)
- using OpenAI
clip-vit-b-32
backbone + linear layer
aesthetic shadow v2 (2024):
- 1 Billion parameter ViT binary classifier (0-1)
- (likely a full finetuned ViT-Huge)
- have strong bias toward certain styles; likely not a lot actual training data
Siglip aesthetics (2024):
- a MLP regressor (3-15, approx)
- using Siglip backbone + MLP head (similar to original LAION clip aesthetic)
- performs a bit better on anime images, but have its limitations and inaccuracy
- bias towards high contrast / high saturation images
(Many other models follow similar principles but are omitted here for brevity)
(inhouse models):
aes-iter3 (2025):
- 800M parameter Siglip2 regressor (0-5)
- Uses SOTA pretrained model + higher resolution (512)
- Data collected systematically, using stratified groups & content-balanced dataset
- offers two versions:
aesthetic
andanatomy
what if we use a vision language model (VLM)?
- BAID
- UNIAA-LLaVA
- Q-Align / OneAlign
what if we combine multiple aesthetic predictors results?
SkyTNT Aesthetic
(..is a combination of wd and something else?)pixiv-compound-score
(a weighted combination of pixiv reactions / other scores)
what if we use semi-supervised learning?
- Semilearn (though not aesthetics specifig)
weakm-v2 / weakm-v3
series ()
what if we use social network feedbacks?
pixiv-compound-score
,twitter-logfav
seriestwitter-cbrtxlog
: doing data transformation on numerical data
(to be updated)
(PDF) Image Aesthetic Assessment: A Comparative Study of Hand-Crafted & Deep Learning Models IAACS: Image aesthetic assessment through color composition and space formation - ScienceDirect