Ziad Obermeyer, a physician and machine-learning scientist at the University of California, Berkeley, last night unveiled Nightingale Open Science – a treasure trove of unique medical data sets, each compiled around an unsolved medical mystery that can help artificial intelligence solve loose.
The datasets, released after the project $ 2 million in funding from former Google CEO Eric Schmidt could help train computer algorithms to predict medical conditions earlier, better triage, and save lives.
The data include 40 terabytes of medical imagery, such as X-rays, electrocardiogram waveforms, and pathology samples, from patients with a range of conditions, including high-risk breast cancer, sudden cardiac arrest, fractures, and Covid-19. Each image was labeled with the patient’s medical outcomes, such as the stage of breast cancer and whether it led to death, and whether a Covid patient needed a ventilator.
Obermeyer made the datasets free to use and worked primarily with hospitals in the U.S. and Taiwan to build them over two years. He plans to expand it to Kenya and Lebanon in the coming months to reflect as much medical diversity as possible.
“Nothing exists like this,” says Obermeyer, who the new project announced in December with colleagues at NeurIPS, the World Academic Conference on Artificial Intelligence. What distinguishes it from anything that is available online is that the data sets are marked with the ‘basic truth’, which means with what really happened to a patient and not just a doctor’s opinion. ”
This means that data sets on, for example, cardiac arrest ECGs are not labeled, depending on whether a cardiologist detected something suspicious, but whether that patient eventually had a heart attack. “We can learn from real patient outcomes, rather than repeating flawed human judgment,” Obermeyer said.
In recent years, the AI community has undergone a sector-wide shift from collecting “big data” – as much data as possible – to meaningful data, or information that is more compiled and relevant to a specific problem, that can be used to address issues such as ingrained human prejudices in healthcare, image recognition or natural language processing.
So far, it has been proven that many healthcare algorithms amplify existing health disparities. For example, Obermeyer found that an AI system used by hospitals treating up to 70 million Americans, which provided extra medical support to patients with chronic diseases, prioritized healthier white patients over sicker black patients in need of assistance. It was the allocation of risk scores based on data that included an individual’s total health care costs in a year. The model used health care costs as a measure of health care needs.
The crux of this problem, which is reflected in the model’s underlying data, is that not everyone generates healthcare costs in the same way. Minorities and other underserved populations may lack access to and resources for health care, be less able to make time for doctor visits, or experience discrimination within the system by receiving fewer treatments or tests, which may lead to them if less expensive are classified in data sets. This does not necessarily mean that they were less ill.
The researchers calculated that nearly 47 percent of black patients had to be referred for extra care, but the algorithmic bias meant that only 17 percent were.
“Your costs will be lower even if your needs are the same. And that was the root of the prejudice we found, ”Obermeyer said. He found that several other similar AI systems also used costs as a power of attorney, a decision he estimates affects the lives of about 200 million patients.
Unlike widely used data sets in computer vision such as ImageNet, designed using images from the Internet that do not necessarily reflect the diversity of the real world, a wave of new data sets contains information that is more representative of the population, leading not only to greater applicability and greater accuracy of the algorithms, but also in expanding our scientific knowledge.
These new diverse, high-quality datasets can be used to eradicate underlying prejudices “that are discriminatory in terms of people who are underserved and unrepresented” in health care systems, such as women and minorities, said Schmidt, whose foundation is the Nightingale Open Science project. “You can use AI to understand what’s really going on with humans, rather than what a doctor thinks.”
The Nightingale datasets were among dozens proposed at NeurIPS this year.
Other projects have a voice dataset of Mandarin and eight sub-dialects recorded by 27,000 speakers in 34 cities in China; the largest audio dataset of Covid respiratory sounds, such as breathing, coughing and voice recordings, of more than 36,000 participants to help screen for the disease; and a dataset of satellite images covering the entire country of South Africa from 2006 to 2017, shared and named according to neighborhood, to discuss the social impact of spatial apartheid.
Elaine Nsoesie, a computer epidemiologist at Boston University School of Public Health, said new types of data could also help study the spread of disease in different places, as people from different cultures respond differently to disease.
She said her grandmother in Cameroon, for example, thinks differently from Americans about health. “If someone has had a flu-like illness in Cameroon, they may be looking for traditional herbal treatments or home remedies, compared to drugs or different home remedies in the US.”
Computer scientists Serena Yeung and Joaquin Vanschoren, who suggested that research to build new datasets at NeurIPS should be exchanged pointed out that the vast majority of the AI community still cannot find good datasets to evaluate their algorithms. This has meant that AI researchers are still turning to data that may have been “prejudiced”, they said. “There are no good models without good data.”