Cellular Agriculture and US Regulatory Considerations
2020/ 11/ 24Practical response to anti-dumping investigations by overseas authorities
2020/ 11/ 26How to choose teacher data
FRONTEO's Anisa Henderson (Florida, USA / Washington, DC, USA) will explain how to select teacher data for AI-based reviews.
Even if you say teacher data, how much data should you choose?
The ideal teacher data is a certain number of documents randomly selected based on statistical theory from the entire documents to be reviewed.It's all determined by statistics.
How many documents are used as teacher data?
As I explained earlier, randomly selected documents from the documents to be reviewed are used as teacher data, but how many are enough?Here, we will use a statistical model to determine how many documents should be used as teacher data (sample size).
reference:https://www.calculator.net/sample-size-calculator.html
Entering the number of documents to be reviewed into the statistical calculator will calculate how many documents will be needed as teacher data to ensure a certain level of statistical confidence and tolerance. FRONTEO usually sets the statistical confidence level to 95% and the margin of error to ± 2.5%. With 10 documents, a sample size of 1,514 can achieve this level of confidence and tolerance.
The confidence level, margin of error, and sample size are all correlated.This is explained in detail on EDRM.net, so please refer to it.
reference:https://edrm.net/resources/project-guides/edrm-statistical-sampling-applied-to-electronic-discovery/
I don't want to create new teacher data!
Creating teacher data is honestly not fun.Can I use the previously created teacher data?Can the sample size be smaller?No wonder you think that.
Sometimes you can't afford to read thousands of teacher data, but you want to use random sampling.In such cases, you can reduce the sample size by adjusting the confidence level and tolerance.However, it is best to keep the confidence level and margin of error below 90% and ± 5%, respectively.
You can use the teacher data you used in the past, but there is a risk if there is a change in the review protocol.
In that respect, FRONTEO's original artificial intelligence KIBIT can start analysis even if there is little teacher data (if there are 50 related documents, it can be analyzed).
Regarding the risk in that case, AI creates a relevancy model (model showing the presence or absence of relevance) based on the teacher data given by the user, so the teacher data used in the past can be reused or the teacher data can be reused. If you reduce the number of, it is possible that it does not fit well with the latest document to be reviewed.However, since a normal review deals with a large amount of data, you can use the precision rate and recall rate to compare the expected result with the actual review result (Oh). Depending on your review platform and analysis tool, it can be easy or difficult).
If the actual review results are very different from what you expected in advance, you need to improve the quality and quantity of teacher data.However, if that happens, we can say that we are lucky because we can now memorize the newly reviewed document in AI as new teacher data.
Anisa Henderson (Florida, USA / Washington, DC, USA)