"Data modeling" is one of those foundational terms that everyone in data uses and few stop to define precisely. This is a plain-language explainer for data leaders and practitioners: what data modeling actually is, the levels and traditions it spans, where it happens in a modern analytics stack, and why — in the AI era — modeling decisions made once in a semantic layer are what decide whether you can trust an agent's answer.

TL;DR

Data modeling is the process of defining how your data is structured, related, and accessed — the entities your business cares about, their attributes, the relationships between them, and, in analytics, the metrics and join paths that turn raw tables into business-ready numbers. It runs across three levels (conceptual, logical, physical) and two traditions (entity-relationship for transactional systems, dimensional for analytics). In a modern stack it happens twice: transformation tools like dbt model the data in the warehouse, and a semantic layer models the metrics, dimensions, and access rules on top — defining each metric once so every BI tool, embedded app, spreadsheet, and AI agent returns the same number. Cube is the agentic analytics platform built on a semantic layer; its open-source core, Cube Core, is where that model lives.

A working definition

Data modeling is the process of defining how data is organized, related, and accessed. You identify the entities a business cares about — customers, orders, products, sessions — describe their attributes, specify the relationships between them, and, for analytics, define the metrics and join paths that turn those raw entities into numbers people can reason about. The output is a blueprint: something the rest of the stack designs, builds, and queries against.

The payoff is consistency. A good model means data is stored once, related correctly, and interpreted the same way wherever it's used — in the application database, the warehouse, the BI tool, and the spreadsheet. The alternative is the familiar failure mode: every system re-encodes the same business logic in its own way, and "active customer" ends up meaning one thing in the product, another in finance, and a third in a one-off query.

Data modeling is defined by what it produces, not where it lives. A model is not a database — it's the specification a database (or a semantic layer, or a report) is built from. The same conceptual model can be implemented in PostgreSQL today and Snowflake tomorrow without the underlying business meaning changing.

The three levels of data modeling

Data modeling is usually described across three levels of abstraction, moving from business meaning to implementation detail:

  • Conceptual model. The highest level. It names the core entities and how they relate — "a customer places many orders; an order contains many line items" — in terms a business stakeholder can read. No attributes, keys, or database specifics yet.
  • Logical model. Adds detail without committing to a platform: attributes for each entity, primary and foreign keys, cardinality, and normalization decisions. It's precise enough to review for correctness but still independent of any particular database.
  • Physical model. Translates the logical model into a concrete schema for a specific system: real tables, column types, indexes, partitioning, and constraints, tuned to the engine you're running — PostgreSQL, Snowflake, BigQuery, Redshift, or Databricks.

The three aren't rival approaches; they're stages of the same work, each with a different audience. Conceptual is for the business, logical for data designers, physical for the engineers who implement it. Skipping the upper levels is how teams end up with schemas that are technically valid but don't match how the business actually thinks.

Two traditions: entity-relationship and dimensional modeling

Underneath those levels sit two long-standing modeling traditions, each optimized for a different workload.

Entity-relationship (ER) modeling, formalized by Peter Chen in 1976, represents data as entities with attributes and the relationships between them. Its instinct is normalization — store each fact once, reference it by key — which minimizes duplication and keeps writes consistent. That makes ER modeling the natural fit for transactional (OLTP) systems: the order-entry database behind your app.

Dimensional modeling, associated with Ralph Kimball, organizes data for analysis instead of transactions. It splits data into fact tables — the numeric events you measure, like orders or page views — surrounded by dimension tables that hold the descriptive context you slice by, like date, product, or region. This star-schema shape denormalizes deliberately so that aggregations are fast and the model is easy to reason about, which is why it underpins most reporting warehouses.

Entity-relationship modelingDimensional modeling
Optimized forTransactions (OLTP), consistent writesAnalytics and reporting, fast reads
ShapeNormalized entities and relationshipsFact tables surrounded by dimensions (star schema)
GoalAvoid duplication, protect integrityMake aggregation fast and queries intuitive
Typical homeApplication and source databasesData warehouse and analytics models
Trade-offSlower, multi-join analytical queriesSome redundancy by design

These aren't mutually exclusive. A typical stack normalizes in the source systems and then builds dimensional models in the warehouse for analytics — the same underlying facts, modeled differently for different jobs.

How data modeling works in practice

Designing a model follows a recognizable arc, regardless of tradition:

  1. Requirement analysis. Understand what questions the data has to answer and what the business actually means by its key terms.
  2. Conceptual modeling. Sketch the main entities and relationships at a level a stakeholder can confirm.
  3. Logical modeling. Add attributes, keys, and structure; decide how far to normalize.
  4. Physical modeling. Implement the schema in a specific database, with the types, indexes, and constraints that engine needs.

In an analytics stack, though, modeling doesn't stop at the warehouse schema — it happens again, one layer up. Transformation tools like dbt model and shape the raw data in the warehouse into clean, tested tables. Then a semantic layer models the metrics and dimensions on top of those tables: what "revenue" means, which dimensions you slice it by, the join paths that connect entities, and the access rules that govern who sees what. The first kind of modeling produces trustworthy tables; the second produces trustworthy answers.

Where data modeling meets the semantic layer

This second modeling step is the one that's changed the most. For years, metric definitions lived wherever they were convenient — inside a BI tool's reports, in handwritten SQL, in a spreadsheet formula. Each was a small, local data model, and none of them agreed with the others. That's the "three different numbers for active users in one meeting" problem, and it's a modeling problem, not a tooling problem.

A semantic layer fixes it by making the metric model the shared source of truth. You model your metrics, dimensions, join paths, and access rules once, as code, and every consumer — a BI dashboard, an analytics feature embedded in your product, a spreadsheet, or an AI agent — requests them from that single model. Modeling stops being something each tool does privately and becomes a governed asset the whole stack reads from. Because it's defined as code, the model gets the same rigor as the rest of your engineering: version control, code review, CI/CD, and isolated environments.

So in 2026, "data modeling" for analytics increasingly means modeling the semantic layer. The entity and dimensional thinking is still there underneath — facts, dimensions, keys, join paths — but the deliverable is a governed business model that serves many tools, not a schema for one database.

Why data modeling matters more in the AI era

When the consumer asking a question is an AI agent answering on behalf of a person, the quality of your data model stops being a back-office concern and becomes load-bearing.

Here's the structural reason. Point a large language model at raw, unmodeled tables and it has to re-derive your business on every prompt. A table named orders doesn't encode whether revenue is gross or net, includes tax, or excludes refunds; the join graph has fan-outs and three tables that all look like "the customer"; and nothing in a SELECT distinguishes a correct query from one that leaks another tenant's data. So "what was revenue last quarter?" can return three different numbers across three sessions. No amount of prompt engineering fixes that — it's a missing model.

A modeled semantic layer is that model. The agent selects from certified metrics by name instead of authoring SQL from scratch, so answers are consistent, governed, and explainable — you can see which named metrics produced a number rather than auditing a wall of generated SQL. This is the foundation of agentic analytics: AI-native BI where agents do the analytical work over a governed model. It's also not theoretical — Brex evaluated approaches for grounding AI on their data, chose Cube, and built Brex Spaces, an embedded AI financial analyst, on top of it. Their one-line summary is the cleanest case for modeling done right: the semantic layer is what makes the AI useful.

Common misconceptions about data modeling

A few myths are worth retiring:

  • "It's a one-time, up-front artifact." Models should evolve as the business does. A metric definition or entity relationship that was right last year may not be right now; treating the model as living, version-controlled code is how you keep it accurate.
  • "There's one correct technique." There isn't. Normalized ER models suit transactional systems; dimensional models suit analytics. Most stacks use both at different layers, and the right choice depends on the workload.
  • "It's only for big projects." Even a small system benefits from a clear model — and the moment more than one tool reads the same metrics, defining them once becomes the difference between consistent numbers and a debugging session.

Where Cube fits

Cube is the agentic analytics platform built on a semantic layer. Its open-source foundation, Cube Core (Apache 2.0), is where the analytics data modeling happens: you model metrics, dimensions, joins, and access rules once, as code, and serve them over SQL, REST, GraphQL, an MCP server for AI agents, and DAX/MDX for spreadsheet tools. Row-level, multi-tenant security is applied at compile time, pre-aggregation caching keeps queries fast, and the model is SQL-first and extensible at query time — governed definitions stay fixed while tools and agents build ad-hoc calculations on top. On top of that foundation, the platform adds AI agent interfaces, workbooks, dashboards, and embedded surfaces, so the same model powers both internal business intelligence for your teams and embedded analytics for your customers. That's why 400+ companies build on Cube across both use cases.

Two clarifications that come up immediately. dbt is a partner, not something the semantic layer replaces: dbt models and transforms the data; the semantic layer models the metrics and serves them — model in dbt, serve via Cube, which reads dbt models. (Only the dbt Semantic Layer, MetricFlow, is an alternative — and to Cube Core, not the platform.) And the semantic layer does not replace your warehouse: it sits on top of Snowflake, BigQuery, Redshift, or Databricks, which stay your storage and compute.

Our verdict

Data modeling is the process of defining your data's entities, attributes, relationships, metrics, and join paths so data is stored consistently and interpreted the same way everywhere. It spans three levels — conceptual, logical, physical — and two traditions — entity-relationship for transactional systems, dimensional for analytics. In a modern stack the highest-leverage modeling now lives in the semantic layer, where metrics are defined once and served to BI, embedded apps, spreadsheets, and AI agents from a single governed model. That's what turns an AI that demos well into one you can trust in production — and it's where Cube, built on the open-source Cube Core, fits.

Methodology

This explainer describes data modeling as the term is used in 2026, weighted toward the parts that matter when many tools — and increasingly AI agents — consume the same data: the conceptual/logical/ physical levels, the entity-relationship and dimensional traditions, and the shift of analytics modeling into a governed semantic layer that defines metrics once. As the publisher, Cube builds a semantic layer and an agentic analytics platform on top of it, so we have an obvious interest here; we've tried to define the concept neutrally and be explicit about where Cube fits versus the broader category. Product-specific capabilities move quickly — treat them as version-dependent and confirm against current documentation.