Why does ingredient position in a product matter?

Ingredients are listed in descending order of concentration, so the higher an active ingredient appears on the list, the more of it is actually present in the formula.

Is ChatGPT accurate for skincare recommendations?

ChatGPT can explain ingredients well, but its scoring and interpretation vary across sessions, so two identical queries can return meaningfully different results.

Can I use ChatGPT to build a skincare routine safely?

For general guidance it can be useful, but it doesn't consistently flag prescription-only treatments or ingredients that require medical supervision, which can be a risk for sensitive or specific skin conditions.

Does Crea8 use AI to generate its skincare recommendations?

Yes. Crea8's core engine calculates recommendations using a pre-scored database of 1000+ products built on 900+ rules from CosIng, AAD, NIH guidelines, and over 30 peer-reviewed publications, with AI layered in to sharpen recommendations over time based on user activity.

Why ChatGPT Gets Skincare Recommendations Wrong

It's a fair question. Probably the most honest one anyone can ask when they first hear about a skincare recommendation platform. ChatGPT knows what niacinamide does. It can read an ingredient list. It'll give you a ranked list of products in seconds. So why would you need anything else?

The short answer is: for a lot of things, you wouldn't. But for ingredient-level product matching against your actual skin profile, we tested the latest models of ChatGPT, Claude, and Gemini, and found similar core limitations across all three.

What they actually get right

Let's be honest about this: LLMs are genuinely impressive at skincare.

Ask ChatGPT about an ingredient, and it'll give you an accurate explanation of what it does, how it interacts with other ingredients, and which skin types it suits. Ask it to build a routine, and it'll produce something structured, logical, and often reasonably sound.

With web search, they can pull real-time product information across e-commerce platforms, compare prices, surface user reviews, and identify what's currently available in your market. In conversation, they can adjust for sensitivity, flag common irritants, and refine their output based on your follow-ups.

Where they break

1. Ingredients are treated as checkboxes

Skincare ingredient lists are ordered by concentration down to around 1%, meaning the first several ingredients are present in meaningfully higher amounts, and position directly reflects how much active work an ingredient is actually doing.

A retinol product with retinol at position 3 is delivering a meaningfully higher concentration than one with retinol at position 21, and concentration is what determines whether an active ingredient produces a clinical effect or, just appears on the label.

LLMs largely treat this as a binary feature. Contains retinol: yes or no. This creates an illusion of precision, the output looks analytical, the reasoning sounds right, but it isn't true in real-world formulation. Two products that need to score very differently end up scoring the same.

2. The score you get is not the score you'd get tomorrow

This is the one most people don't notice, because you'd have to run the same test twice to catch it. We did.

Across repeated tests on identical products and identical user profiles, scores varied by 10 to 19 points between sessions, frequently enough to change the winning product.

This happens because LLMs don't operate on a defined scoring system; every response is a fresh estimate, benchmarked against a reference point the model constructs in that moment. There is no stable floor, no stable ceiling, and no way for two outputs to be meaningfully compared to each other.

If a score can shift by 19 points on the same input, the score doesn't mean anything.

3. Recommendations drift toward what's popular, not what's right

LLMs are trained on the internet. That means blog posts, Reddit threads, e-commerce listings, and influencer content all feed into what surfaces as a recommendation. The most talked-about product gets recommended more. The most reviewed brand appears more credible.

Skincare isn't a popularity contest. A product with thousands of five-star reviews might be exactly wrong for someone with a compromised barrier, a fragrance sensitivity, or overlapping skin concerns. Without ingredient-first evaluation, you're not getting a clinical match; you’re getting an aggregated opinion of what other people liked.

Sample recommendation output with ChatGPT.

The moment it stopped being theoretical

During our testing, we ran a Shopping Search query for a simple cleanser recommendation. Alongside a list of standard over-the-counter cleansers, the output included Benzac AC and Adapalene. Both prescription-only acne treatments, with no prescription flag, no warning, and no differentiation from the OTC options.

Both require a prescription in India precisely because they carry risks: purging, skin barrier disruption, and, in Adapalene's case, contraindication during pregnancy, which require clinical supervision to manage

The same pattern showed up in scoring. Across 25+ test profiles, three clinically distinct cases: a pregnancy-risk profile, a damaged-barrier profile, and a severe dry-skin case, received identical match scores with no differentiation between them.

But the deeper problem wasn't just the floor. LLMs only recognise the well-documented attributes of ingredients: the popular benefits, the commonly flagged risks. Less-documented properties get missed entirely, leading to over-scoring products that require caution and under-scoring ones that don't.

And because risk classification shifts depending on how the prompt is interpreted that session, the same ingredient can be flagged in one run and ignored in the next.

What a structured engine actually does differently

Crea8 doesn't generate recommendations. It calculates them.

Our model runs on a classified ingredient database with pre-defined scoring rules, so there is no reinterpretation happening at the formulation level. And our system compounds over time; the more a user interacts with the platform, the sharper their recommendations get.

Our position weighting is built into a pre-calculated scoring table across 1000+ products and 40+ ingredient attributes, with power-law decay rates calibrated from pattern analysis across 500+ real formulations, modelling concentration by position since ingredient lists don't disclose actual percentages

The scoring logic itself is derived from 900+ structured rules built from CosIng, AAD, NIH guidelines, and more than 30 peer-reviewed publications, with established dermatological research,

The entire catalogue of over 1000 products across 10 categories runs through this formula simultaneously in under 2 seconds.

The same analysis via an LLM took a minimum of 13 seconds per product in our testing, and up to 6.5 minutes for recommendations across just 15 products. That gap isn't about speed. It's about architecture.

So, back to the question

Why can't you just ask ChatGPT?

For learning about skincare— what an ingredient does, how a routine fits together, what the difference between AHA and BHA is; you absolutely can, and it'll serve you well.

But for a decision, ChatGPT can tell you if a product seems okay. Crea8 tells you it's an 87% match, with quantified reasoning and clinical backing. That is not a feature gap. It is an architectural one: a database, a scoring engine, domain constraints, and validation against 30+ scientific publications, built specifically for this problem.

Because when it comes to your skin, 'good enough' isn't simply good enough.

Also Read

How to Remove Tan Safely: Skincare Routine That Works Antioxidants in Skincare: Why Your Skin Needs Daily Protection from Free Radicals Polyglutamic Acid: Is It the New Hyaluronic Acid for Skin Hydration?

Get smarter about skincare

Don't miss out on clear skincare insights and simple tips that make choosing products easier.

Why Can't I Just Ask ChatGPT About My Skincare?