When we talk about responsible AI, the conversation often focuses on model behavior—bias, fairness, safety. These are critical concerns, but they're downstream of an equally important question: how do we build the datasets that train these models in the first place?

Dataset construction is where many of the most significant ethical decisions in AI happen. Who is represented in the data? How was consent obtained? What assumptions are embedded in labeling schemas? How is the dataset documented for downstream users? These questions don't have simple technical answers—they require careful thought about values, power dynamics, and the communities affected by our work.

Consent Is Not a Checkbox

Let's start with consent, because it's both fundamental and frequently misunderstood. In many discussions of dataset ethics, consent is treated as a binary—either you have it or you don't. But meaningful consent is more nuanced than that.

Consider a common scenario: collecting data from publicly available sources. Content posted publicly on social media, for example, is legally accessible. But does legal accessibility equal meaningful consent for use in AI training datasets?

I'd argue it doesn't. When someone posts content publicly, they're operating within specific social contexts and expectations. They may expect the content to be seen by their followers, or indexed by search engines, or embedded in news articles. That doesn't necessarily mean they expect or consent to their content being used to train commercial AI systems that might eventually generate content in their style or extract patterns from their behavior.

This gap between legal permissibility and meaningful consent is where much of the ethical complexity lives. At Drane Labs, we've tried to think carefully about this. When we collect data from public sources, we ask: What reasonable expectations would the data subjects have? Would they feel misled or violated if they knew their data was being used this way? Can we contact them to explain our use and give them an opportunity to opt out?

These questions don't always yield clear answers, but asking them is part of building responsibly. Sometimes the answer is that we shouldn't use certain data, even if we legally could. That constraint might make dataset construction harder, but it's a necessary trade-off.

Layered Consent Models

For datasets where we have direct relationships with data contributors, we've implemented what we call "layered consent." This means obtaining consent at multiple levels of specificity, allowing people to opt in or out of different uses.

The first layer is general consent: Can we include your data in our training datasets? This is the basic yes/no that determines whether someone participates at all.

The second layer specifies use cases: Are you comfortable with your data being used for research models that will be published openly? For commercial models deployed in production? For models that might be fine-tuned by third parties? Different people have different comfort levels with these scenarios, and we want to respect that.

The third layer involves retention and deletion: How long are you comfortable with us retaining your data? Do you want the ability to request deletion later? Can we retain aggregate statistics even if individual records are deleted?

Implementing this requires more complex infrastructure than a simple one-time consent checkbox. We maintain a consent database that tracks preferences at granular levels and propagates those preferences through our data pipeline. If someone requests deletion, that request flows through to all derivative datasets and training runs that used their data.

This is technically challenging and operationally intensive. But it's the right thing to do, and it builds trust with contributors who know we take their preferences seriously.

Documentation Standards

Even with perfect consent practices, datasets can be used irresponsibly if they're poorly documented. Users need to understand what the data represents, how it was collected, what limitations it has, and what uses are appropriate.

We've adopted structured documentation standards for all our datasets, inspired by frameworks like Datasheets for Datasets and Data Statements. Every dataset we create includes standardized documentation covering:

Motivation: Why was this dataset created? What gap does it fill? What are the intended use cases?

Composition: What's in the dataset? How many instances, what types of data, what time period does it cover? Are there missing values or anomalies?

Collection: How was data collected? What instruments or methods were used? Who performed the collection? What quality control processes were applied?

Preprocessing: What preprocessing or cleaning was done? Were any instances filtered out? How were labels assigned? What assumptions are embedded in preprocessing decisions?

Distribution: Who has access to this dataset? What terms govern its use? How should users cite it?

Maintenance: Who maintains the dataset? How are errors reported and corrected? Will it be updated over time?

Ethical considerations: What consent mechanisms were used? Are there privacy risks? Are there known biases or representation gaps? Are there uses that would be inappropriate?

This documentation isn't an afterthought—it's created alongside the dataset itself. When we make collection or preprocessing decisions, we document the reasoning immediately while context is fresh. This produces more accurate, useful documentation than trying to reconstruct rationale after the fact.

Community Engagement

Responsible dataset construction isn't something we can do in isolation. The communities represented in our data—or excluded from it—have valuable perspectives on how data should be collected and used.

We've begun experimenting with participatory data collection, where representatives from affected communities are involved in dataset design decisions from the start. What should we measure? How should we label examples? What contexts are important to capture? These aren't purely technical questions—they involve judgments about what matters and how to represent complex realities in structured data.

For example, when building a dataset involving demographic attributes, standard label schemas often reflect crude categories that don't capture how people actually experience or identify with those attributes. Gender becomes a binary, race becomes a fixed set of checkboxes, disability becomes a simple yes/no. These simplifications might make machine learning easier, but they can also embed harmful assumptions and exclude people whose experiences don't fit neat categories.

Engaging with communities helps us design better schemas. We learn which distinctions matter, where our proposed categories are inadequate, and how to collect information respectfully. This engagement slows down dataset creation, but it produces datasets that are more representative and less likely to cause harm when used downstream.

Community engagement also creates accountability. When we make commitments to communities about how their data will be used, those commitments create obligations that go beyond legal compliance. We're accountable not just to regulators or commercial partners, but to the people whose data makes our work possible.

Handling Sensitive Attributes

Many datasets include sensitive attributes—race, gender, health status, financial information, political views. These attributes are often necessary for building fair models and evaluating disparate impacts. But they also create privacy risks and potential for misuse.

Our approach is to collect sensitive attributes only when genuinely necessary, document the necessity clearly, and implement strong access controls. Not everyone at Drane Labs has access to all datasets. Access is granted based on specific project needs and requires justification.

We also separate sensitive attributes from other data when possible. Rather than storing demographics alongside behavior data in a single table, we maintain separate tables linked by anonymized identifiers. This allows us to compute aggregate statistics about different groups without requiring analysts to access individual-level demographic information.

For particularly sensitive datasets, we use secure computation techniques. Rather than directly accessing sensitive data, analysts can run approved queries that return aggregate results or models trained on the data without exposing individual records. This adds technical complexity but significantly reduces privacy risks.

Addressing Historical Bias

Many real-world datasets reflect historical biases and inequalities. Training on such data can perpetuate or amplify those biases in model behavior. But simply removing sensitive attributes doesn't solve the problem—bias can persist through correlated features even when protected attributes are excluded.

We approach this through a combination of bias documentation, mitigation techniques, and careful deployment decisions. First, we audit datasets for known bias issues. Are some groups underrepresented? Are labels systematically different for different groups in ways that reflect historical discrimination rather than ground truth?

When we identify bias issues, we consider mitigation approaches: resampling to balance representation, carefully designed fairness constraints in training, or separate model evaluation on different subgroups to ensure adequate performance. But we also acknowledge that technical mitigation has limits. Some datasets may be too biased for responsible use in certain applications, regardless of mitigation attempts.

Documentation is critical here. Downstream users need to understand what biases exist in our datasets so they can make informed decisions about whether and how to use the data. We can't anticipate every possible use, so we provide the information needed for others to assess fit for their specific contexts.

Evolving Standards

The field's understanding of responsible dataset construction is evolving rapidly. Practices that seemed adequate a few years ago now appear insufficient. Standards that seem rigorous today may prove inadequate as we discover new failure modes or societal expectations change.

This means responsible dataset construction isn't a static checklist—it requires ongoing learning and adaptation. We participate actively in research and policy discussions about data ethics. We engage with critics who point out limitations in our approaches. We revisit older datasets to assess whether they meet current standards and decide whether updates or deprecation are needed.

One specific area of evolution involves the role of synthetic data. As generative models improve, synthetic data becomes increasingly viable for training. This could potentially address some consent and privacy concerns—synthetic data doesn't directly represent real individuals.

But synthetic data introduces new questions: Does synthetic data reproduce biases from the real data it was generated from? How do we validate that synthetic data adequately represents the distributions we care about? What are the intellectual property implications if synthetic data is derived from copyrighted material? These questions don't have settled answers yet, but they'll become increasingly important as synthetic data use grows.

Transparency and Trust

Underlying all of these practices is a commitment to transparency. We document our data practices publicly, not because regulations require it, but because transparency builds trust and enables accountability.

When we describe our datasets, we're honest about limitations and uncertainties. We don't claim to have solved thorny ethical problems that remain genuinely difficult. We acknowledge when we've made judgment calls that others might disagree with.

This transparency sometimes reveals imperfections that we could have hidden. But I believe it's essential. Users of our datasets deserve to understand what they're working with. Affected communities deserve to know how their data is being used. And we deserve the feedback that transparency enables—it helps us improve.

The Long View

Building datasets responsibly is more expensive and time-consuming than optimizing purely for scale and speed. It requires infrastructure, processes, and expertise beyond what's needed for technical dataset construction. It creates constraints that limit what data we can use and how quickly we can move.

These costs are real, and in a competitive landscape, they can feel like handicaps. But I believe they're necessary investments. AI systems trained on poorly constructed datasets carry forward the ethical shortcuts that created those datasets. Models inherit biases that weren't documented or addressed. Deployment surprises reflect data limitations that were never acknowledged.

Building responsibly from the start creates datasets that are more trustworthy, more robust, and more sustainable over the long term. It establishes relationships with data contributors and affected communities that enable ongoing collaboration rather than extraction. It creates documentation and processes that make future work more efficient rather than leaving debt for later teams to clean up.

Most importantly, it builds AI systems that deserve the trust people place in them. Technology that doesn't respect the people it depends on ultimately undermines itself. Responsible dataset construction isn't separate from building capable AI—it's foundational to building AI that works well and does good.


Priya Sandoval is Lead Data Scientist at Drane Labs, where she develops ethical data practices and builds training datasets.