"especially because the next challenge with the democratization of LLMs will demand differentiation at the data level, especially of higher quality data for fine-tuning models"
What is "demand differentiation" at the data level?
What I meant by that was your ability to create truly remarkable & unique product experiences driven by ML is in collecting, aggregating & curating high quality data -- usually by thoughtful design and instrumentation of the application and services generating the data.
In terms of where people are getting data from for ML (whether training from scratch or fine-tuning), there's a number of acquisition mechanisms:
- Scraping
- Purchasing from data brokers
- Use existing sources in-house they haven't used before
- 1st party data generated by their application and services
In terms of quality, ia high-quality dataset for one use case might not be high-enough quality for another use case.
For example, a Reddit dataset consisting of all the posts & replies on the r/NFL + a scraped website of football stats could be really useful for a fun chatbot used for bar night trivia. However that same dataset would be useless for created an automated, real-time sports commentator for a Spanish channel.
So part of what feeds into quality is relevancy to the use case and the best way to create high-quality data is either leveraging existing data within a company or by working with the application developers to help engineer the first party data flywheel.
"especially because the next challenge with the democratization of LLMs will demand differentiation at the data level, especially of higher quality data for fine-tuning models"
What is "demand differentiation" at the data level?
What I meant by that was your ability to create truly remarkable & unique product experiences driven by ML is in collecting, aggregating & curating high quality data -- usually by thoughtful design and instrumentation of the application and services generating the data.
Wonderful article ... thank you for sharing! Where are people getting high quality web data from? Scraping it themselves?
Thanks for the kind words!
In terms of where people are getting data from for ML (whether training from scratch or fine-tuning), there's a number of acquisition mechanisms:
- Scraping
- Purchasing from data brokers
- Use existing sources in-house they haven't used before
- 1st party data generated by their application and services
In terms of quality, ia high-quality dataset for one use case might not be high-enough quality for another use case.
For example, a Reddit dataset consisting of all the posts & replies on the r/NFL + a scraped website of football stats could be really useful for a fun chatbot used for bar night trivia. However that same dataset would be useless for created an automated, real-time sports commentator for a Spanish channel.
So part of what feeds into quality is relevancy to the use case and the best way to create high-quality data is either leveraging existing data within a company or by working with the application developers to help engineer the first party data flywheel.
Great post! Waiting for the next one to get into the measures 🤩
Thank you! Draft for the next post was sent to Chad & Mark, so definitely on the horizon!