It's a fair question. Probably the most honest one anyone can ask when they first hear about a skincare recommendation platform. ChatGPT knows what niacinamide does. It can read an ingredient list. It'll give you a ranked list of products in seconds. So why would you need anything else?
The short answer is: for a lot of things, you wouldn't. But for ingredient-level product matching against your actual skin profile, our testing showed LLMs breaking down in three specific ways.

What they actually get right
To be honest about this: LLMs are genuinely impressive at skincare.
Ask ChatGPT about an ingredient, and it'll give you an accurate explanation of what it does, how it interacts with other ingredients, and which skin types it suits. Ask it to build a routine, and it'll produce something structured, logical, and often reasonably sound.
With web search, they can pull real-time product information across e-commerce platforms, compare prices, surface user reviews, and identify what's currently available in your market. In conversation, they can adjust for sensitivity, flag common irritants, and refine their output based on your follow-ups.
Where they break
1. Ingredients are treated as checkboxes
Skincare ingredient lists are ordered by concentration down to around 1%, meaning the first several ingredients are present in meaningfully higher amounts, and position directly reflects how much active work an ingredient is actually doing.
A retinol product with retinol at position 3 is delivering a meaningfully higher concentration than one with retinol at position 21, and concentration is what determines whether an active ingredient produces a clinical effect or, just appears on the label.
LLMs largely treat this as a binary feature. Contains retinol: yes or no. This creates an illusion of precision, the output looks analytical, the reasoning sounds right, but it isn't true in real-world formulation. Two products that need to score very differently end up scoring the same.
2. The score you get is not the score you'd get tomorrow
This is the one most people don't notice, because you'd have to run the same test twice to catch it. We did.
Across repeated tests on identical products and identical user profiles, scores varied by 10 to 19 points between sessions, frequently enough to change the winning product.
This happens because LLMs don't operate on a defined scoring system; every response is a fresh estimate, benchmarked against a reference point the model constructs in that moment. There is no stable floor, no stable ceiling, and no way for two outputs to be meaningfully compared to each other.
If a score can shift by 19 points on the same input, the score doesn't mean anything.
3. Recommendations drift toward what's popular, not what's right
LLMs are trained on the internet. That means blog posts, Reddit threads, e-commerce listings, and influencer content all feed into what surfaces as a recommendation. The most talked-about product gets recommended more. The most reviewed brand appears more credible.
Skincare isn't a popularity contest. A product with thousands of five-star reviews might be exactly wrong for someone with a compromised barrier, a fragrance sensitivity, or overlapping skin concerns. Without ingredient-first evaluation, you're not getting a clinical match; you’re getting an aggregated opinion of what other people liked.

The moment it stopped being theoretical
During our testing, we ran a Shopping Search query for a simple cleanser recommendation. Alongside a list of standard over-the-counter cleansers, the output included Benzac AC and Adapalene. Both prescription-only acne treatments, with no prescription flag, no warning, and no differentiation from the OTC options.
Both require a prescription in India precisely because they carry risks: purging, skin barrier disruption, and, in Adapalene's case, contraindication during pregnancy, which require clinical supervision to manage
The same pattern showed up in scoring. Across 25+ test profiles, three clinically distinct cases: a pregnancy-risk profile, a damaged-barrier profile, and a severe dry-skin case, received identical match scores with no differentiation between them.
But the deeper problem wasn't just the floor. LLMs only recognise the well-documented attributes of ingredients: the popular benefits, the commonly flagged risks. Less-documented properties get missed entirely, leading to over-scoring products that require caution and under-scoring ones that don't.
And because risk classification shifts depending on how the prompt is interpreted that session, the same ingredient can be flagged in one run and ignored in the next.
What a structured engine actually does differently
Crea8 doesn't generate recommendations. It calculates them.
Our model runs on a classified ingredient database with pre-defined scoring rules, so there is no reinterpretation happening at the formulation level. And our system compounds over time; the more a user interacts with the platform, the sharper their recommendations get.
Our position weighting is built into a pre-calculated scoring table across 1000+ products and 40+ ingredient attributes, with power-law decay rates calibrated from pattern analysis across 500+ real formulations, modelling concentration by position since ingredient lists don't disclose actual percentages
The scoring logic itself is derived from 900+ structured rules built from CosIng, AAD, NIH guidelines, and more than 30 peer-reviewed publications, with established dermatological research,
The entire catalogue of over 1000 products across 10 categories runs through this formula simultaneously in under 2 seconds.
The same analysis via an LLM took a minimum of 13 seconds per product in our testing, and up to 6.5 minutes for recommendations across just 15 products. That gap isn't about speed. It's about architecture.

So, back to the question
Why can't you just ask ChatGPT?
For learning about skincare— what an ingredient does, how a routine fits together, what the difference between AHA and BHA is; you absolutely can, and it'll serve you well.
But for a decision, ChatGPT can tell you if a product seems okay. Crea8 tells you it's an 87% match, with quantified reasoning and clinical backing. That is not a feature gap. It is an architectural one: a database, a scoring engine, domain constraints, and validation against 30+ scientific publications, built specifically for this problem.