Checks if passed color data are valid, i. e. are bountiful and varied enough according to passed validation criteria. This function is normally only used indirectly through `Participant$check_valid_get_twcv()` or `ParticipantGroup$get_valid_twcv()`.
Usage
validate_get_twcv(
color_matrix,
dbscan_eps = 20,
dbscan_min_pts = 4,
max_var_tight_cluster = 150,
max_prop_single_tight_cluster = 0.6,
safe_num_clusters = 3,
safe_twcv = 250
)
Arguments
- color_matrix
An n-by-3 numerical matrix where each row corresponds to a single point in 3D color space.
- dbscan_eps
One-element numerical vector: radius of ‘epsilon neighborhood’ when applying DBSCAN clustering.
- dbscan_min_pts
One-element numerical vector: Minimum number of points required in the epsilon neighborhood for core points (including the core point itself).
- max_var_tight_cluster
One-element numerical vector: maximum variance for a cluster to be considered 'tight-knit'.
- max_prop_single_tight_cluster
One-element numerical vector: maximum proportion of points allowed to be within a 'tight-knit' cluster (if this threshold is exceeded, the data are categorized as invalid).
- safe_num_clusters
One-element numerical vector: minimum number of clusters that guarantees validity if points are 'non-tight-knit'.
- safe_twcv
One-element numerical vector: minimum total within-cluster variance (TWCV) score that guarantees validity if points are 'non-tight-knit'.
Value
A list with components
- valid
One-element logical vector
- reason_invalid
One-element character vector, empty if valid is TRUE
- twcv
One-element numeric (or NA if can't be calculated) vector, indicating TWCV
- num_clusters
One-element numeric (or NA if can't be calculated) vector, indicating the number of identified clusters counting toward the tally compared with 'safe_num_clusters'
Details
This function relies heavily on the DBSCAN algorithm and its implementation in the R package `dbscan`, for clustering color points. For further information regarding the 'dbscan_eps' and 'dbscan_min_pts' parameters as well as DBSCAN itself, please see the `dbscan` documentation. Once clustering is done, passed validation criteria are applied:
If too high a proportion of all color points (cut-off specified with `max_prop_single_tight_cluster`) fall within a single 'tight-knit' cluster (with a cluster variance less than or equal to `max_var_tight_cluster`), then the data are always classified as invalid.
If the first criterion is cleared, and points form more than `safe_num_cluster` clusters, data are always classified as valid.
If the first criterion is cleared, and the Total Within-Cluster Variance (TWCV) score is greater than or equal to `safe_twcv`, data are always classified as valid.
Note that this means data can be classified as valid by either having at least 'safe_num_cluster' clusters, or by having points composing a smaller number of clusters but spaced relatively far apart within these clusters.
The DBSCAN 'noise' cluster only counts towards the 'cluster tally' (compared with 'safe_num_cluster') if it includes at least 'dbscan_min_pts' points. Points in the noise cluster are however always included in other calculations, e. g. total within-cluster variance (TWCV).
See also
point_3d_variance
for single-cluster variance,
total_within_cluster_variance
for TWCV.