Given the lack of ticket numbers in 2022, I am going to limit the analysis to 2017-2021. Also there is a stunningly low number of license plate information to work with. Huge amounts of the dataset have no state drivers license or license plate information:
plate_and_lic_combined | % | |
---|---|---|
6,028,469 | 81.76 | |
DC | 1,130,953 | 15.34 |
MD | 122,663 | 1.66 |
VA | 54,572 | 0.74 |
FL | 18,733 | 0.25 |
DE | 4,906 | 0.07 |
MA | 4,798 | 0.07 |
MI | 3,174 | 0.04 |
PA | 2,590 | 0.04 |
NC | 2,137 | 0.03 |
Thinking about an approach that might bring about a useable hypothesis. First, omit the empty values and recompute:
plate_and_lic_combined | % | |
---|---|---|
DC | 1,130,953 | 85.23 |
MD | 122,663 | 9.24 |
VA | 54,572 | 4.11 |
FL | 18,733 | 1.41 |
It seems like a low percentage of Virginia drivers, but it could be representative.
The big problem is when the dataset is limited to Automated Traffic Enforcement cameras, there is a real paucity of data:
plate_and_lic_combined | % | |
---|---|---|
DC | 1,002,118 | 96.13 |
FL | 16,525 | 1.59 |
MD | 8,638 | 0.83 |
DE | 4,381 | 0.42 |
MA | 4,271 | 0.41 |
VA | 3,599 | 0.35 |
MI | 2,888 | 0.28 |
There’s simply no way FL is near the top of this list.
I also looked at MAR_ID and I believe it is also correlated with ATE camera placements. I might be able to use it to better assign the top few ticket giving ATEs the text-matching algorithm may have missed.
Also worth noting that I found some metadata information on the opendata site: https://www.arcgis.com/sharing/rest/content/items/94455e9d5f42439788da06caeaaf35ac/info/metadata/metadata.xml?format=default&output=html
It defines the MAR_ID as ‘Master Address Repository (MAR) Unique Identifier’ which is probably a tool used by the geocoders to render the moving violations on the GIS system. Still it may match ATE cameras.