By Rosan International | | Data Processing
Hello world! As we launch our blog, we wanted to start by laying out some foundations for excellence in data science projects. At Rosan International we do lots of custom research and analytics, working with organizations large and small. In our experience, most organizations have a blend of implicit and explicit norms, guidelines, rules and standards to manage their data science and analytics projects. The more these are made explicit, the easier it becomes to train new staff, ensure consistency, high quality outputs, and most importantly, happy clients!
The guidelines we present below identify the basic standards and procedures that should be put in place to ensure reproducibility, minimize errors, improve analytic quality, and introduce consistency in writing and reporting within the scope of quantitative data processing and analytics.
Data Governance
Data governance is the overall management, control, and protection of an organization’s data assets. Effective data governance is crucial for maintaining data integrity and ensuring compliance with regulations. The following aspects of data governance should be considered before launching a custom analytics research project:
- Common Data Sources: Identify data sources, including proprietary data, client proprietary data, and third-party data from reputable sources such as the World Bank, IMF, and FAO. Also lay out data sources to avoid.
- Data Governance Policies: Including storage, quality, privacy, security, retention, sharing, version control, naming standards, and documentation/metadata.
- Data Inventory: Maintain a data inventory that lists important proprietary data sources for custom research analytics.
Data Processing
A sad little secret of most data scientists is that 80% of our effort is spent running a bunch of data recodes and merges. While this may not be the sexiest part of the job, it is definitely a crucial one to get right. The following data processing procedures should be standardized to ensure consistency and accuracy throughout the analysis process.
- Variable recodes: Modifying variables based on specific criteria or rules.
- Variable transformations: Applying distributional transformations, discretization, or normalization/standardization to variables.
- Variable aggregation: Summarizing data by aggregating values at different levels, such as groups or time periods.
- Merging datasets: Combining multiple datasets based on common variables or identifiers.
- Identifying missing data: Developing strategies to identify and handle missing data appropriately.
- Training/Testing/Validation splits: Partitioning the data for model training, testing, and validation purposes.
- Sampling: Applying sampling techniques to select representative subsets of data for analysis.
- Other advanced methods: Missing data imputation, data reduction, outlier detection, etc.
Quantitative Analytics
Most projects use a small number of workhorse analytic methods. Sure there aren’t that many ways to run a crosstab, right? Think again! The following quantitative methods should be standardized, particularly those that are used frequently:
- Use of Weights: Guidelines for using weights in descriptive analytics, aggregation at different levels (e.g., country, region, world), and variance estimation.
- Frequency tables: Creating 1-way, 2-way, and 3-way tables to summarize categorical data.
- Averages: Calculating averages for continuous variables.
- Correlation: Measuring correlation among continuous or ordinal variables.
- Regression: Applying linear regression for continuous outcomes, logistic regression for binary outcomes, and other regression models for ordinal or count data.
- Graph-based methods: Utilizing choropleth maps, scatter diagrams, and other visualizations to represent quantitative data.
- Significance Testing: Guidelines for conducting significance tests on frequency tables, averages, and regression models.
- Other advanced methods: Factor analysis, cluster analysis, multilevel regression, time series analysis, classification trees, neural networks, and Bayesian methods.
Reproducibility
Have you ever finalized a project with triumphant results, only to later find out that you can’t reproduce those results due to some stray stochastic process? Reproducibility in data science refers to the ability to recreate and obtain the same results as a previous analysis or experiment using the same data and methods. The following procedures should be specified to ensure reproducibility.
- Document and Share Code: Using version control systems like Git, writing clean and self-explanatory code, avoiding saving code/data to local drives, and organizing projects with clear structures and standardized file naming conventions.
- Manage Dependencies: Documenting and tracking software libraries, packages, and versions used in the analysis, creating environment files for dependencies, and considering containerization tools like Docker for encapsulating the analysis environment.
- Data Management: Specifying data sources, storing raw and processed data separately, and documenting data preprocessing steps.
- Parameterization and Random Seeds: Using parameterization to easily modify input variables and setting random seeds for reproducibility of random processes.
- Record Experimental Details: Keeping records of software versions, hardware configurations, operating systems, and any custom settings or configurations used in the analysis.
- Reproducible Reporting: Using literate programming techniques (e.g., Markdown, Jupyter Notebooks) to combine code, text, and visualizations into reproducible reports that include step-by-step explanations and complete code.
- Test and Validate: Running analyses multiple times, testing on different systems or environments, and validating results against known outcomes or independent verification.
QC/QA Procedures
Quality control and quality assurance are essential to ensure the accuracy and reliability of data analysis. We recommend specifying the following procedures:
- Data Paralleling: Comparing results from different stages of data processing to ensure consistency and accuracy.
- Editorial Review: Conducting thorough reviews of reports, code, and analyses by experienced professionals.
- Code Version Control: Utilizing version control systems like GitHub to track changes and enable collaboration.
- Code Review Standards and Guidance: Establishing standards and guidelines for code review to ensure code quality and adherence to best practices.
- In-line/Unit Testing Guidance: Implementing in-line or unit testing to validate individual components of code or specific functions.
- Avoiding frequent but dangerous hacks: Manual data processing, manual calculations using tools like Excel, local processing environments, and point-and-click commands that may introduce errors or limit reproducibility.
Common Statistical Programming Languages
A custom analytics research project should establish best practices in regard to the use of specific programming languages. In our work, the following open-source and proprietary platforms are essential:
- R: Open-source, best for cutting-edge advanced analytics.
- Python: Open-source, best for machine learning/AI.
- SPSS: Proprietary, best for tabulation of survey data.
- Stata: Proprietary, best for econometric analysis.
- Excel: Proprietary, best for basic operations.
Where to Go for Help on Quant Methods
Custom analytics projects should specify where to go for help when encountering quantitative analysis challenges. Some recommended sources include:
- Generative AI to write code: AI tools like ChatGPT and Github Copilot can assist with code generation. Training resources should be suggested to enhance skills in this area.
- StackOverflow: The ultimate Q&A site for coders.
- Methodology blogs: We strongly recommend the methodology blog from our friends at Gallup.
- Internal Experts: Reaching out to colleagues who possess deep knowledge and expertise in quantitative methods for assistance and guidance.
Custom analytics guidelines can enhance reproducibility, minimize errors, improve analytic quality, introduce consistency in reporting standards, and optimize learning opportunities. By implementing these guidelines, organizations can foster a culture of excellence in data-driven decision making and elevate the quality of their analyses.
About Rosan International
ROSAN is a technology company specialized in the development of Data Science and Artificial Intelligence solutions with the aim of solving the most challenging global projects. Contact us to discover how we can help you gain valuable insights from your data and optimize your processes.