Datasets
Bring your own domain-specific intelligence to Dregs, with custom datasets let you enrich scoring with business-specific knowledge that no off-the-shelf fraud tool can provide. Your data, your rules, applied automatically during every scoring run.
What Are Datasets
Datasets are key-value lookup tables that analyzers consult during scoring. Think of them as reference lists that help Dregs make smarter decisions about your users. When an analyzer evaluates an identity, it can check whether a value — like an email domain, IP address, or username — exists in a dataset and factor that into the score.
The concept is simple, but the applications are broad. Any piece of knowledge you have about what makes a user suspicious or trustworthy in your specific domain can become a dataset that improves scoring accuracy.
Examples of Useful Datasets
To give you a sense of what is possible, here are some common dataset types:
- Disposable email domains — a built-in global dataset that Dregs maintains. Automatically flags users who sign up with throwaway email addresses from services like Guerrilla Mail or Temp Mail.
- Known VPN and proxy IP ranges — IP ranges associated with VPN services and data center proxies that are commonly used to mask identity.
- Trusted enterprise domains — email domains belonging to enterprise customers or known organizations, used to boost authenticity scores for users with corporate email addresses.
- Internal IP addresses — your office and infrastructure IPs, so your team's activity does not pollute scoring data.
- Known bad actors — email addresses, usernames, or other identifiers from your own fraud investigations. When a banned user comes back with a new account, the dataset catches them.
- Geographic regions of interest — country codes or city names where you see disproportionate abuse, or conversely, where your most valuable customers are concentrated.
These are just starting points. Any list of values that helps distinguish good users from bad ones in your application can become a dataset.
Global vs. Customer Datasets
Datasets come in two flavors:
- Global datasets are maintained by Dregs and shared across all customers. The disposable email domains dataset is a good example — it benefits everyone and is updated regularly. You cannot modify global datasets, but your analyzers can reference them freely.
- Customer datasets are private to your account. Only your analyzers can access them, and only your team can manage their contents. This is where your business-specific intelligence lives.
When an analyzer looks up a value, it searches both your customer datasets and any applicable global datasets. Customer data always takes precedence if both contain the same key.
How Analyzers Use Datasets
During scoring, analyzers can perform two basic operations against datasets:
- Existence check — does a key exist in the dataset? For example, the disposable email analyzer checks whether a user's email domain appears in the disposable domains dataset. If it does, the authenticity score takes a hit.
- Value lookup — what value is associated with a key? For example, an IP reputation dataset might map IP ranges to risk categories (datacenter, residential, mobile), letting analyzers factor the connection type into scoring.
This is the power of datasets: they let analyzers make context-aware decisions without hardcoding lists into the analysis logic. Update the dataset, and every future scoring run benefits immediately.
Versioning
Datasets use versioned updates for zero-downtime changes. This matters when you are importing large datasets or making bulk modifications — you do not want scoring to see a half-updated dataset.
The process works like this: you create a new version of the dataset, populate it with entries, and then publish it. While you are building the new version, the old version continues serving reads. The switch is atomic — one moment analyzers see the old data, the next they see the new data. No partial states, no race conditions.
After publishing, the old version is cleaned up automatically. You do not need to manage version lifecycle manually.
Managing Datasets
You can create and manage datasets through the Settings page in the dashboard or programmatically via the REST API.
The dashboard interface lets you create datasets, add individual entries, and view existing data. For bulk operations — importing thousands of entries from a CSV or external source — the API is the better choice. The API supports both replacing all entries in a dataset and appending new entries to an existing dataset.
Each dataset has a slug (a URL-friendly identifier) and a human-readable name. When creating analyzers that reference datasets, you use the slug.
Case Sensitivity
Each dataset can be configured as case-sensitive or case-insensitive. This is set when the dataset is created and affects how keys are stored and matched.
Case-insensitive datasets normalize all keys to lowercase on insert. This means a lookup
for Gmail.com, gmail.com, and GMAIL.COM all
match the same entry. Most datasets — especially those dealing with email domains, URLs,
or hostnames — should be case-insensitive.
Case-sensitive datasets store keys exactly as provided. Use this when the casing carries
meaning, such as identifiers or tokens where ABC123 and abc123
are genuinely different values.
Datasets are most powerful when combined with custom analyzers. See Scoring to understand how analyzers use datasets during the scoring process. Contact dregs@dregs.com to learn more about enabling custom analyzers.