Maintaining Data Integrity of our CiviCRM database

Due to some conservative settings, we've ended up with some duplicate records in our database. I've written the following SQL statements to help us keep track of the dupes.

Contact with the same Email

You don't really need this one - you can simply run the Dupe/Merge rule for contacts - all of our de-dupe rules assume that same contact if the email address is the same.

SELECT e1.contact_id, e2.contact_id,e1.email 
	FROM civicrm_email e1 
		JOIN civicrm_email e2 ON e1.email = e2.email 
	WHERE e1.contact_id != e2.contact_id;

Dupe organizations

If you are looking for dupe organizations (based on just the name of the organization), you can use the Organization strict or fuzzy rule to find them.

I wrote this SQL query so I could focus on the duplicates that are registered for the social forum:

SELECT DISTINCT c1.id,c1.organization_name 
	FROM civicrm_contact c1 
		JOIN civicrm_contact c2 ON c1.organization_name = c2.organization_name 
		JOIN civicrm_relationship AS r 
	WHERE c1.id != c2.id AND (r.relationship_type_id = 11 OR r.relationship_type_id = 12) AND r.contact_id_b = c1.id;

Even these are failry straight forward merges. However, if we have duplicates and more than one of the duplicates is registered, then we have to take special care when merging.

Here are organizations with contacts with more than one registration:

SELECT DISTINCT d1.id,d1.organization_name 
 FROM ussf_distinct_registered_org_dupes d1
	JOIN ussf_distinct_registered_org_dupes d2 ON d1.organization_name = d2.organization_name 
 WHERE d1.id != d2.id;

That statement requires the following view to be created:

 CREATE VIEW ussf_distinct_registered_org_dupes AS 
 	SELECT DISTINCT c1.id,c1.organization_name 
	FROM civicrm_contact c1 JOIN civicrm_contact c2 ON c1.organization_name = c2.organization_name 
	WHERE c1.id != c2.id AND 
		c1.id IN (
			SELECT DISTINCT contact_id_b 
			FROM civicrm_relationship
 			WHERE relationship_type_id = 11 OR relationship_type_id = 12
		) ;

And here's one that show how they paid (if you want to eliminate the paid laters.

SELECT DISTINCT d1.id,d2.organization_name,c.id,c.first_name,c.last_name,is_pay_later 
	FROM ussf_distinct_registered_org_dupes d1 
		JOIN ussf_distinct_registered_org_dupes d2 ON d1.organization_name = d2.organization_name 
		JOIN civicrm_relationship r ON d1.id = r.contact_id_b 
		JOIN civicrm_contact c ON r.contact_id_a = c.id 
		JOIN civicrm_participant p ON c.id = p.contact_id 
	WHERE 
		d1.id != d2.id AND
		r.relationship_type_id = 12

Properly registered contacts

We have some duplicate data about registration in CiviCRM:

  • Participant record
  • Tag USSF2010 Registered

This queries show contacts coded with USSF2010 tag, but who do not have a participant record.

SELECT t.contact_id 
	FROM civicrm_entity_tag t 
		JOIN civicrm_contact c ON c.id = t.contact_id LEFT 
		JOIN civicrm_participant p USING(contact_id)
	WHERE p.contact_id IS NULL AND 
		contact_type = 'Individual' 
		AND t.tag_id = 11;

This one shows the opposite: contacts not coded with USSF2010 tag, but who do have a participant record.

 SELECT p.contact_id
 	FROM civicrm_participant p
	JOIN civicrm_contact c ON c.id = p.contact_id
	LEFT JOIN civicrm_entity_tag t USING(contact_id)
	WHERE p.contact_id IS NULL AND
 		contact_type = 'Individual';

Duplicate participant records

We sometimes get individuals that have registered more than once:

SELECT p1.contact_id 
	FROM civicrm_participant p1 
	JOIN civicrm_participant p2 ON p1.contact_id = p2.contact_id 
	WHERE p1.id != p2.id;

General Statistics

Total registered individuals:

SELECT count(distinct c.id) FROM civicrm_participant p JOIN
civicrm_contact c ON c.id = p.contact_id WHERE contact_type =
'Individual' AND p.event_id = 4 AND (p.status_id = 5 OR p.status_id = 1);

Total registered organizations:

 SELECT count(distinct c.id) FROM civicrm_entity_tag t JOIN
 civicrm_contact c ON c.id = t.contact_id WHERE contact_type =
 'Organization' AND t.tag_id = 11;

Count by state:

SELECT sp.name, COUNT(DISTINCT c.id) AS count 
 FROM civicrm_contact c JOIN civicrm_entity_tag et ON et.contact_id = c.id  
 JOIN civicrm_address a ON c.id = a.contact_id   
 JOIN civicrm_state_province sp ON a.state_province_id = sp.id 
 WHERE a.country_id = 1228 AND tag_id = 11 
 GROUP BY sp.name ORDER BY count;

Country by count:

SELECT co.name, COUNT(DISTINCT c.id) AS count 
 FROM civicrm_contact c JOIN civicrm_entity_tag et ON et.contact_id = c.id  
 JOIN civicrm_address a ON c.id = a.contact_id 
 JOIN civicrm_country co ON a.country_id = co.id 
 GROUP BY co.name ORDER BY count;

Count of registered users with country

SELECT count(DISTINCT c.id) 
 FROM civicrm_contact c JOIN civicrm_entity_tag et ON et.contact_id = c.id  
 JOIN civicrm_address a ON c.id = a.contact_id 
 WHERE country_id IS NOT NULL;

Count of registered users with state

SELECT count(DISTINCT c.id) 
 FROM civicrm_contact c JOIN civicrm_entity_tag et ON et.contact_id = c.id  
 JOIN civicrm_address a ON c.id = a.contact_id 
 WHERE state_province_id IS NOT NULL;

~ ~