{"podcast":{"title":"Data Skeptic","slug":"data-skeptic","podcast_index_feed_id":587881,"rss_url":"https://dataskeptic.libsyn.com/rss","website_url":"https://dataskeptic.com","image_url":"https://static.libsyn.com/p/assets/0/e/4/b/0e4bd71bb64c6e45/DS_-_New_Logo_assets_-_JL_DS_Logo_Stacked_-_Color_2.jpg","author":"Kyle Polich","episode_count":601,"summary":"The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/data-skeptic"},"episode":{"title":"DataRec Library for Reproducible in Recommend Systems","slug":"datarec-library-for-reproducible-in-recommend-systems","published_at":"2025-11-13T21:41:00+00:00","page_url":"https://stenobird.com/podcast/data-skeptic/datarec-library-for-reproducible-in-recommend-systems","show_page_url":"https://stenobird.com/podcast/data-skeptic","url":"https://dataskeptic.com/blog/episodes/2025/datarec-library-for-reproducible-in-recommend-systems","audio_url":"https://pscrb.fm/rss/p/mgln.ai/e/35/traffic.libsyn.com/secure/dataskeptic/Alberto_With_Ads_V1.mp3?dest-id=201630","summary":"Standardizing dataset management is critical for reproducible research in recommender systems. This episode explores how the DataRec Python library automates downloads, verifies data integrity via checksums, and provides unified data structures to eliminate preprocessing inconsistencies.","meta_description":"Learn how the DataRec library brings reproducibility to recommender systems through automated dataset management, checksum verification, and unified APIs.","key_points":["Main idea: DataRec provides a unified DataRec object to allow the same preprocessing pipelines to run across different file formats like CSV, TSV, and JSON","Practical takeaway: Use checksum verification to ensure that datasets haven't been silently updated or modified at their original source URLs","Failure mode: Small, unnoticed changes in dataset filtering or splitting can lead to incomparable model benchmarks and invalid research conclusions","Main idea: The library is designed to integrate with, rather than replace, existing research frameworks like CoreMap by exporting standardized data versions","Practical takeaway: Implementing temporal splitting and standardized filtering is essential to reflect real-world user engagement and maintain temporal consistency"],"chapters":[{"start_ms":60000,"title":"Introduction to DataRec","summary":"An overview of DataRec's ability to handle data cleaning, splitting, and standardized dataset management."},{"start_ms":205000,"title":"Graph-Based Recommenders","summary":"Discussion on the integration of public and private knowledge graphs to enhance recommendation accuracy."},{"start_ms":350000,"title":"The Role of Offline Evaluation","summary":"How shared, publicly prepared datasets allow researchers to compare models and strategies effectively."},{"start_ms":500000,"title":"The Reproducibility Crisis","summary":"How variations in dataset configuration and even hardware can impact the final performance metrics of models."},{"start_ms":655000,"title":"Temporal Splitting and Real-World Behavior","summary":"The importance of maintaining the chronological order of events during data filtering to reflect true user engagement."},{"start_ms":805000,"title":"Library Implementation and Evolution","summary":"The process of implementing standardized utilities and the ongoing work to integrate new datasets and features."},{"start_ms":955000,"title":"Unified Data Structures","summary":"How DataRec abstracts different file formats into a single, reusable object for consistent pipeline execution."},{"start_ms":1240000,"title":"Getting Started with DataRec","summary":"Practical steps for researchers to install and begin using the library for their experiments."}],"topics":["Recommender Systems","Python Library","Machine Learning Reproducibility","Dataset Management","Knowledge Graphs","Data Engineering","Algorithm Benchmarking","Data Integrity"],"duration_seconds":1968,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/data-skeptic/episodes/datarec-library-for-reproducible-in-recommend-systems/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/data-skeptic/datarec-library-for-reproducible-in-recommend-systems.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}