Skip to content

Instantly share code, notes, and snippets.

@trojblue
Created March 13, 2025 02:36
Show Gist options
  • Save trojblue/27f540b4d4d0ac7a1829431dd18467c3 to your computer and use it in GitHub Desktop.
Save trojblue/27f540b4d4d0ac7a1829431dd18467c3 to your computer and use it in GitHub Desktop.

Redo: Aesthetic Tech Talk:

Overview

  • Image aesthetic analysis has undergone significant change from 2010s to current.
  • Approaches to aesthetic analysis evolves over time according to the trends of the given era.
  • The scarcity of data, lack of objective metrics, and innate taste variance makes it difficult.
  • Anime aesthetic is a specialized branch of general aesthetics

Before Deep Learning

Early methods rely on manually designed features (eg. color, texture, composition):

  • photographic rules? (eg. rule of thirds) -> Lo et al. (2013), SVM, 89% acc on CUHK
  • statistical methods? (foreground-background, color harmony?) -> Yang et al. (2015), SVR, 84.83% acc
  • local and global features? -> Gao et al. (2015), Bayesian Network, 72.7% acc on memorability dataset
  • content-based methods? -> Nishiyama et al. (2012), color harmony patterns, 77.6% acc on DPChallenge

Findings from these?

  1. Some human intuitions on aesthetic can be intrinsically modelled (using statistical or interepretable policies).
  2. Hand-crafted features can have limited expressiveness (think of hand-crafted features vs LeNets on MNIST)
  3. Manually selected features don't cover the last bits of human taste variance.
  4. Eval datasets are messy, no de-facto standard for evaluations like ImageNet
  5. OA (overall accuracy) is more widely used, despite effects on imbalanceness.

Deep Learning Methods

More interesting methods start developing after the introduction of CNN and more powerful computes:

basic deep network? -> Zhou et al. (2015), deep network on 56 features, 82.1% acc on 28k images multi-task learning? -> Kao et al. (2016), joint training on aesthetic and semantic prediction, MTCNN, 91.3% acc on AVA content-adaptive? -> Kong et al. (2016), AlexNet co-conditioned with attributes, 77.33% acc on AADB

emerging (but not yet there):

  • multimodal inputs (MSCAN: self attention on visual features, cross attention on image <-->additional attributes) ^ attention mechanism is a good idea
  • semi-supervised / active learning (SDAL: training with partially labelled data; active learning via iterative labelling; eye-tracking for data collection; probablistic model with expert ratings mix-in) ^ issue1: still data-capped. -> contrastive pre-training ^ issue2: CNNs don't generalize well. -> universal representation learning

Early trends of popular methods emerge, but are more task-specific and less widely adopted. Deep Learning-based methods achieves better accuracy than hand-crafted ones.

Modern Deep Learning Methods

transformer:

  • Although attention was already a thing in CNN/RNN era, it was mostly auxiliary
  • transformers made it the primary idea, replacing recurrence or convolutions
  • better efficiency with scaled dot-product attention / multi-head attention
  • desigend so it's fully parallelizable, better optimized on modern hardware (ie, GPUs) ^ (no local connectivity, uses global self-attention instead)

CLIP (Contrastive Language-Image Pre-training)

  • embeds text and images into a shared embedding space.

  • text encoder is a tranformer -> replaces RNN

  • image encoder is a vision transformer -> replaces CNN

  • contrastive: addresses data scarcity, by inventing ways to learn from noisy data, and basically training everything off the web -> data collection is now scalable

  • language-image: extends the idea of multimodal learning through direct text-image alignment. (task-specific -> generalized zero-shot)

  • pre-training: offers base models that have better transferability than previous architectures, like CNN / RNN, on unseen data -> eg. anime

key things to consider:

  • model arch: greatly improved over the years, offers better generalizability and accuracy
  • compute: increased overtime, and being better utilized by model architectures (ie. transformers)
  • fine-tuning data: still gated and scarce, on areas such as domain / expert knowledge ^ which is especially the case on anime

Modern Aesthetic Predictors

LAION Aesthetic ("clip aesthetic"):

  • idea: pretrained CLIP models offers useful info for aesthetics rating.

  • states linear layer on CLIP embedding produces“visually most appealing” rankings, ended up using MLP head anyways
    -> their evaluation principle is visually appealing, so still no commonly recognized evalset.

  • trained on human-scored images: AVA Dataset: crowd sourced photo-rating dataset (250k, filtered photos) (Adobe / UCI) Simulacra Aesthetic Captions (SAC): crowd sourced synthetic images rating (176k, ai-generated) LAION-Logos: private samples for text-in-image aesthetics (15k, text focused) -> in-house datasets are around this size

(notable mentions on CLIP):

zero-shot prompting for aesthetic analysi?

  • Wang et al. (2022), use (“good/bad photo”) for aesthetic clssifications

automate the process of zero-shot prompting?

  • IAA-LQ (Image Aesthetics Assessment via Learnable Queries) (2023)
  • idea: keep pretrained ViT frozen, inserts a small transformer module to "probe" ViT features for aesthetic information
  • automates the process of quality prompting

separate theme / composition with aesthetic?

  • TC-IAA (Theme and Composition-aware IAA) (2025)
  • idea: predict the image's theme and composition first, then predict aesthetic based on that
  • (similar to the very early days of approaches ("content-adaptive"), but now using clip!)
  • MSE ~0.47 on AVA

issue: they improve on architecture, but still data-capped.

Scaling on Aesthetic Data-collection

pick-a-Pic (stability.ai, 2023):

  • collected 500k human preferences for ai-generated images, on 35k prompts
  • used to train reward models (PickScore) for both Reinforcement Learning and automated evaluation metrics.

HPS v2 (CUHK, 2024):

  • collected 798k human preferences for ai-generated images using 9 models (not just the single stable diffusion model)
  • prompts selected from DiffusionDB and COCO, then cleaned using ChatGPT to remove eg. excessive artist names / contradictory style elements

ImageReward (BUPT, 2023-25):

  • similar idea, listed here for completion (didn't look into it)
  • has an update recently: THUDM/VisionReward

VILA (google, 2023):

  • harvests implicit aesthetic data via sentiment analysis
  • collects image-comment pairs, from Flickr and DPChallenge, as weak aesthetic guidance

idea: what if we just use social media?

  • artworks have like counts, which are naturally collected during scraping
  • they should offer direct (and stronger) aesthetic signal compared to like VILA

...What about Anime?

With advancements in general (especially photographic aesthetics), anime aesthetics remains an under-researched area, though some general principles are applied here:

Waifu-diffusion (WD waifu score) (2022):

  • a linear binary classfier (0-1)
  • using OpenAI clip-vit-b-32 backbone + linear layer

aesthetic shadow v2 (2024):

  • 1 Billion parameter ViT binary classifier (0-1)
  • (likely a full finetuned ViT-Huge)
  • have strong bias toward certain styles; likely not a lot actual training data

Siglip aesthetics (2024):

  • a MLP regressor (3-15, approx)
  • using Siglip backbone + MLP head (similar to original LAION clip aesthetic)
  • performs a bit better on anime images, but have its limitations and inaccuracy
  • bias towards high contrast / high saturation images

(Many other models follow similar principles but are omitted here for brevity)

(inhouse models):

aes-iter3 (2025):

  • 800M parameter Siglip2 regressor (0-5)
  • Uses SOTA pretrained model + higher resolution (512)
  • Data collected systematically, using stratified groups & content-balanced dataset
  • offers two versions: aesthetic and anatomy

More Ideas to Explore

what if we use a vision language model (VLM)?

  • BAID
  • UNIAA-LLaVA
  • Q-Align / OneAlign

what if we combine multiple aesthetic predictors results?

  • SkyTNT Aesthetic (..is a combination of wd and something else?)
  • pixiv-compound-score (a weighted combination of pixiv reactions / other scores)

what if we use semi-supervised learning?

  • Semilearn (though not aesthetics specifig)
  • weakm-v2 / weakm-v3 series ()

what if we use social network feedbacks?

  • pixiv-compound-score, twitter-logfav series
  • twitter-cbrtxlog: doing data transformation on numerical data

References / journals

(to be updated)

(PDF) Image Aesthetic Assessment: A Comparative Study of Hand-Crafted & Deep Learning Models IAACS: Image aesthetic assessment through color composition and space formation - ScienceDirect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment